Evaluating Semantic against a Simple Web-based Model

Alon Talmor Mor Geva Jonathan Berant Tel-Aviv University Tel-Aviv University Tel-Aviv University [email protected] [email protected] [email protected]

Abstract (Voorhees and Tice, 2000; Hermann et al., 2015; Hewlett et al., 2016; Kadlec et al., 2016; Seo et al., Semantic parsing shines at analyzing com- 2016). plex natural language that involves com- In semantic parsing, background knowledge position and computation over multiple has already been compiled into a knowledge-base pieces of evidence. However, datasets (KB), and thus the challenge is in interpreting the for semantic parsing contain many factoid question, which may contain compositional con- questions that can be answered from a sin- structions (“What is the second-highest mountain gle web document. In this paper, we pro- in Europe?”) or computations (“What is the dif- pose to evaluate semantic parsing-based ference in population between France and Ger- question answering models by comparing many?”). In unstructured QA, the model needs them to a question answering baseline that to also interpret the language of a document, and queries the web and extracts the answer thus most datasets focus on matching the question only from web snippets, without access to against the document and extracting the answer the target knowledge-base. We investigate from some local context, such as a sentence or a this approach on COMPLEXQUESTIONS, paragraph (Onishi et al., 2016; Rajpurkar et al., a dataset designed to focus on composi- 2016; Yang et al., 2015). tional language, and find that our model Since semantic parsing models excel at han- obtains reasonable performance ( 35 F ∼ 1 dling complex linguistic constructions and reason- compared to 41 F1 of state-of-the-art). We ing over multiple facts, a natural way to exam- find in our analysis that our model per- ine whether a benchmark indeed requires model- forms well on complex questions involv- ing these properties, is to train an unstructured QA ing conjunctions, but struggles on ques- model, and check if it under-performs compared tions that involve relation composition and to semantic parsing models. If questions can be superlatives. answered by examining local contexts only, then the use of a knowledge-base is perhaps unneces- 1 Introduction sary. However, to the best of our knowledge, only Question answering (QA) has witnessed a surge models that utilize the KB have been evaluated on of interest in recent years (Hill et al., 2015; Yang common semantic parsing benchmarks. et al., 2015; Pasupat and Liang, 2015; Chen et al., The goal of this paper is to bridge this evalua- 2016; Joshi et al., 2017), as it is one of the promi- tion gap. We develop a simple log-linear model, nent tests for natural language understanding. QA in the spirit of traditional web-based QA systems can be coarsely divided into semantic parsing- (Kwok et al., 2001; Brill et al., 2002), that answers based QA, where a question is translated into a questions by querying the web and extracting the logical form that is executed against a knowledge- answer from returned web snippets. Thus, our base (Zelle and Mooney, 1996; Zettlemoyer and evaluation scheme is suitable for semantic pars- Collins, 2005; Liang et al., 2011; Kwiatkowski ing benchmarks in which the knowledge required et al., 2013; Reddy et al., 2014; Berant and for answering questions is covered by the web (in Liang, 2015), and unstructured QA, where a ques- contrast with virtual assitants for which the knowl- tion is answered directly from some relevant text edge is specific to an application).

161 Proceedings of the 6th Joint Conference on Lexical and (*SEM 2017), pages 161–167, Vancouver, Canada, August 3-4, 2017. c 2017 Association for Computational Linguistics We test this model on COMPLEXQUESTIONS s1: Billy Batts (Character) - Biography - IMDb Billy Batts (Character) on IMDb: Movies, TV, (Bao et al., 2016), a dataset designed to re- Celebs, and more...... Devino is portrayed by quire more compositionality compared to earlier Frank Vincent in the film Goodfellas. Page last up- datasets, such as WEBQUESTIONS (Berant et al., dated by !!!de leted!!! s2: Frank Vincent - Wikipedia He appeared in 2013) and SIMPLEQUESTIONS (Bordes et al., Scorsese’s 1990 film Goodfellas, where he played 2015). We find that a simple QA model, despite Billy Batts, a made man in the Gambino crime fam- having no access to the target KB, performs rea- R: ily. He also played a role in Scorsese’s... . sonably well on this dataset ( 35 F1 compared to . ∼ the state-of-the-art of 41 F1). Moreover, for the s100: Voice-over in Goodfellas In the summer when they played cards all night, nobody ever called the subset of questions for which the right answer can cops. .... But we had a problem with Billy Batts. be found in one of the web snippets, we outper- This was a touchy thing. Tommy had killed a made man. Billy was a part of the Bambino crew and un- form the semantic parser (51.9 F1 vs. 48.5 F1). We touchable. Before you... analyze results for different types of composition- q: “who played the part of billy batts in goodfellas?” ality and find that superlatives and relation com- a: “Frank Vincent” position constructions are challenging for a web- Figure 1: A training example containing a result set R, a based QA system, while conjunctions and events question q and an answer a. The result set R contains 100 with multiple arguments are easier. web snippets si, each including a title (boldface) and text. An important insight is that semantic parsers The answer is underlined. must overcome the mismatch between natural lan- guage and formal language. Consequently, lan- has a title and a text fragment. An example for a guage that can be easily matched against the web training example is provided in Figure1. may become challenging to express in logical Semantic parsing-based QA datasets contain form. For example, the word “wife” is an atomic question-answer pairs alongside a background binary relation in natural language, but expressed KB. To convert such datasets to our setup, we run with a complex binary λx.λy.Spouse(x, y) the question q against Google’s search engine and ∧ Gender(x, Female) in knowledge-bases. Thus, scrape the top-K web snippets. We use only the some of the complexity of understanding natural web snippets and ignore any boxes or other infor- language is removed when working with a natural mation returned (see Figure1 and the full dataset language representation. in the supplementary material). To conclude, we propose to evaluate the extent to which semantic parsing-based QA benchmarks Compositionality We argue that if a dataset require compositionality by comparing semantic truly requires a compositional model, then it parsing models to a baseline that extracts the an- should be difficult to tackle with methods that swer from short web snippets. We obtain rea- only match the question against short web snip- sonable performance on COMPLEXQUESTIONS, pets. This is since it is unlikely to integrate all and analyze the types of compositionality that are necessary pieces of evidence from the snippets. challenging for a web-based QA model. To en- We convert COMPLEXQUESTIONS into the sure reproducibility, we release our dataset, which aforementioned format, and manually analyze the attaches to each example from COMPLEXQUES- types of compositionality that occur on 100 ran- TIONS the top-100 retrieved web snippets.1 dom training examples. Table1 provides an ex- ample for each of the question types we found: 2 Problem Setting and Dataset SIMPLE: an application of a single binary re- lation on a single entity. (i) (i) (i) N Given a training set of triples q ,R , a , FILTER: a question where the semantic type { }i=1 where q(i) is a question, R(i) is a web result set, of the answer is mentioned (“tv shows” in Ta- and a(i) is the answer, our goal is to learn a model ble1). that produces an answer a for a new question- N-ARY: A question about a single event that result set pair (q, R). A web result set R consists involves more than one entity (“juni” and of K(= 100) web snippets, where each snippet si “spy kids 4” in Table1). CONJUNCTION: A question whose answer is 1Data can be downloaded from https: //worksheets.codalab.org/worksheets/ the conjunction of more than one binary rela- 0x91d77db37e0a4bbbaeb37b8972f4784f/ tion in the question.

162 Type Example % our model by maximizing the regularized condi- N SIMPLE “who has gone out with cornelis de graeff” 17% (i) tional log-likelihood objective i=1 log pθ(a FILTER “which tv shows has wayne rostad starred in” 18% (i) (i) 2 | N-ARY “who played juni in spy kids 4?” 51% q ,R ) + λ θ 2. At test time, we return the ONJ · || || P C . “what has queen latifah starred in that doug 10% most probable answers based on p (a q, R) (de- mchenry directed” θ | COMPOS. “who was the grandson of king david’s father?” 7% tails in Section4). While semantic parsers gener- SUPERL. “who is the richest sports woman?” 9% OTHER “what is the name george lopez on the show?” 8% ally return a set, in COMPLEXQUESTIONS 87% of the answers are a singleton set. Table 1: An example for each compositionality type and the proportion of examples in 100 random examples. A question Features A candidate span a often has multiple can fall into multiple types, and thus the sum exceeds 100%. mentions in the result set R. Therefore, our feature function φ( ) computes the average of the features · extracted from each mention. The main informa- COMPOSITION A question that involves tion sources used are the match between the candi- composing more than one binary relation date answer itself and the question (top of Table2) over an entity (“grandson” and “father” in and the match between the context of a candidate Table1). answer in a specific mention and the question (bot- SUPERLATIVE A question that requires sort- tom of Table2), as well as the Google rank in ing or comparing entities based on a numeric which the mention appeared. property. Lexicalized features are useful for our task, but OTHER Any other question. the number of training examples is too small to Table1 illustrates that C OMPLEXQUESTIONS train a fully lexicalized model. Therefore, we de- is dominated by N-ARY questions that involve an fine lexicalized features over the 50 most common event with multiple entities. In Section4 we eval- non-stop words in COMPLEXQUESTIONS. Last, uate the performance of a simple QA model for our context features are defined in a 6-word win- each compositionality type, and find that N-ARY dow around the candidate answer mention, where questions are handled well by our web-based QA the feature value decays exponentially as the dis- system. tance from the candidate answer mention grows. Overall, we compute a total of 892 features over 3 Model the dataset. Our model comprises two parts. First, we extract a set of answer candidates, , from the web result 4 Experiments A set. Then, we train a log-linear model that outputs COMPLEXQUESTIONS contains 1,300 training a distribution over the candidates in , and is used A examples and 800 test examples. We performed at test time to find the most probable answers. 5 random 70/30 splits of the training set for de- velopment. We computed POS tags and named Candidate Extraction We extract all 1-grams, entities with Stanford CoreNLP (Manning et al., 2-grams, 3-grams and 4-grams (lowercased) that 2014). We did not employ any co-reference reso- appear in R, yielding roughly 5,000 candidates per lution tool in this work. If after candidate extrac- question. We then discard any candidate that fully tion, we do not find the gold answer in the top- appears in the question itself, and define to be A K(=140) candidates, we discard the example, re- the top-K candidates based on their tf-idf score, sulting in a training set of 856 examples. where term frequency is computed on all the snip- We compare our model, WEBQA, to STAGG pets in R, and inverse document frequency is com- (Yih et al., 2015) and COMPQ(Bao et al., puted on a large external corpus. 2016), which are to the best of our knowledge Candidate Ranking We define a log-linear the highest performing semantic parsing models model over the candidates in : on both COMPLEXQUESTIONS and WEBQUES- A TIONS. For these systems, we only report test F1 exp(φ(q, R, a)>θ) numbers that are provided in the original papers, p (a q, R) = , θ | exp(φ(q, R, a ) θ) as we do not have access to the code or predic- a0 0 > ∈A tions. We evaluate models by computing average d P where θ R are learned parameters, and F1, the official evaluation metric defined for COM- ∈d φ( ) R is a feature function. We train PLEXQUESTIONS. This measure computes the F1 · ∈

163 Template Description Dev Test System F1 p@1 F1 p@1 MRR SPAN LENGTH Indicator for the number of tokens in am TF-IDF Binned and raw tf-idf scores of am for every STAGG - - 37.7 - - span length COMPQ - - 40.9 - - CAPITALIZED Whether am is capitalized WEBQA 35.3 36.4 32.6 33.5 42.4 STOPWORD Fraction of words in am that are stop words WEBQA-EXTRAPOL - - 34.4 - - INQUEST Fraction of words in am that are in q INQUEST+COMMON Conjunction of INQUEST with common words COMPQ-SUBSET - - 48.5 - - in q WEBQA-SUBSET 53.6 55.1 51.9 53.4 67.5 INQUESTIONDIST. Max./avg. cosine similarity between am words and q words WH+NE Conjunction of wh-word in q and named entity tags (NE) of am Table 3: Results on development (average over random splits) WH+POS Conjunction of wh-word in q and and test set. Middle: results on all examples. Bottom: results part-of-speech tags of am on the subset where candidate extraction succeeded. NE+NE Conjunction of NE tags in q and NE tags in am NE+COMMON Conjunction of NE tags in am and common words in q MAX-NE Whether am is a NE with maximal span (not contained in another NE) setup, COMPQ obtained 42.2 F1 on the test set YEAR Binned indicator for year if a is a year m (compared to 40.9 F1, when training on COM- CTXT MATCH Max./avg. over non stop words in q, for PLEXQUESTIONS only, as we do). Restricting the whether a q word occurs around am, weighted by distance from am predictions to the subset for which candidate ex- CTXTSIMILARITY Max./avg. cosine similarity over non-stop words in q, between q words and words around traction succeeded, the F1 of COMPQ-SUBSET is am, weighted by distance 48.5, which is 3.4 F points lower than WEBQA- INTITLE Whether am is in the title part of the snippet 1 CTXTENTITY Indicator for whether a common word appears SUBSET, which was trained on less data. between am and a named entity that appears in q Not using a KB, results in a considerable disad- GOOGLERANK Binned snippet rank of a in the result set R m vantage for WEBQA. KB entities have normalized descriptions, and the answers have been annotated Table 2: Features templates used to extract features from each according to those descriptions. We, conversely, answer candidate mention am. Cosine similarity is computed with pre-trained GloVe embeddings (Pennington et al., 2014). find answers on the web and often predict a cor- The definition of common words and weighting by distance is rect answer, but get penalized due to small string in the body of the paper. differences. E.g., for “what is the longest river in China?” we answer “yangtze river”, while the between the set of answers returned by the system gold answer is “yangtze”. To quantify this effect and the set of gold answers, and averages across we manually annotated all 258 examples in the questions. To allow WEBQA to return a set rather first random development set split, and determined than a single answer, we return the most proba- whether string matching failed, and we actually returned the gold answer.2 This improved perfor- ble answer a∗ as well as any answer a such that (φ(q, R, a ) θ φ(q, R, a) θ) < 0.5. We also mance from 53.6 F1 to 56.6 F1 (on examples that ∗ > − > compute precision@1 and Mean Reciprocal Rank passed candidate extraction). Further normaliz- (MRR) for WEBQA, since we have a ranking over ing gold and predicted entities, such that “Hillary answers. To compute metrics we lowercase the Clinton” and “Hillary Rodham Clinton” are uni- gold and predicted spans and perform exact string fied, improved F1 to 57.3 F1. Extrapolating this to match. the test set would result in an F1 of 34.4 (WEBQA- Table3 presents the results of our evaluation. EXTRAPOL in Table3) and 34.9, respectively. Last, to determine the contribution of each fea- WEBQA obtained 32.6 F1 (33.5 p@1, 42.4 MRR) ture template, we performed ablation tests and we compared to 40.9 F1 of COMPQ. Our candidate extraction step finds the correct answer in the top- present the five feature templates that resulted in K candidates in 65.9% of development examples the largest drop to performance on the develop- ment set in Table4. Note that TF-IDF is by far the and 62.7% of test examples. Thus, our test F1 on examples for which candidate extraction suc- most impactful feature, leading to a large drop of ceeded (WEBQA-SUBSET) is 51.9 (53.4 p@1, 12 points in performance. This shows the impor- 67.5 MRR). tance of using the redundancy of the web for our QA system. We were able to indirectly compare WEBQA- SUBSET to COMPQ: Bao et al.(2016) graciously Analysis To understand the success of WEBQA provided us with the predictions of COMPQ when on different compositionality types, we manu- it was trained on COMPLEXQUESTIONS,WE- BQUESTIONS, and SIMPLEQUESTIONS. In this 2We also publicly release our annotations.

164 60% % Passed cand. extraction Average dev F1 70 5 Related Work % Failed cand. extraction

50% 60 Our model WEBQA performs QA using web snip- 50 40%

40 pets, similar to traditional QA systems like MUL- 30% F1 Value 30 DER (Kwok et al., 2001) and AskMSR (Brill et al., 20% 20 2002). However, it it enjoys the advances in com- 10% 10 merical search engines of the last decade, and uses 0% 0 Simple Filter N-ary Conj. Compos. Superl. Other a simple log-linear model, which has become stan- dard in Natural Language Processing. Figure 2: Proportion of examples that passed or failed can- didate extraction for each compositionality type, as well as Similar to this work, Yao et al.(2014) ana- average F1 for each compositionality type. COMPOSITION lyzed a semantic parsing benchmark with a simple and SUPERLATIVE questions are difficult for WEBQA. QA system. However, they employed a semantic parser that is limited to applying a single binary

Feature Template F1 ∆ relation on a single entity, while we develop a QA

WEBQA 53.6 system that does not use the target KB at all. -MAX-NE 51.8 -1.8 -NE+COMMON 51.8 -1.8 Last, in parallel to this work Chen et al.(2017) -GOOGLE RANK 51.4 -2.2 evaluated an unstructured QA system against se- -IN QUEST 50.1 -3.5 -TF-IDF 41.5 -12 mantic parsing benchmarks. However, their fo- cus was on examining the contributions of multi- Table 4: Feature ablation results. The five features that lead task learning and distant supervision to training to largest drop in performance are displayed. rather than to compare to state-of-the-art seman- tic parsers. ally annotated the compositionality type of 100 6 Conclusion random examples that passed candidate extrac- We propose in this paper to evaluate semantic tion and 50 random examples that failed candi- parsing-based QA systems by comparing them to date extraction. Figure2 presents the results of a web-based QA baseline. We evaluate such a QA this analysis, as well as the average F obtained 1 system on COMPLEXQUESTIONS and find that it for each compositionality type on the 100 exam- obtains reasonable performance. We analyze per- ples that passed candidate extraction (note that formance and find that COMPOSITION and SU- a question can belong to multilpe compositional- PERLATIVE questions are challenging for a web- ity types). We observe that COMPOSITION and based QA system, while CONJUNCTION and N- SUPERLATIVE questions are challenging for WE- ARY questions can often be handled by our QA BQA, while SIMPLE,FILTER, and N-ARY quesi- model. tons are easier (recall that a large fraction of the questions in COMPLEXQUESTIONS are N-ARY). Reproducibility Code, data, annotations, and Interestingly, WEBQA performs well on CON- experiments for this paper are available on the JUNCTION questions (“what film victor garber CodaLab platform at https://worksheets. starred in that rob marshall directed”), possibly codalab.org/worksheets/ because the correct answer can obtain signal from 0x91d77db37e0a4bbbaeb37b8972f4784f/. multiple snippets. An advantage of finding answers to ques- Acknowledgments tions from web documents compared to seman- We thank Junwei Bao for providing us with the test tic parsing, is that we do not need to learn the predictions of his system. We thank the anony- “language of the KB”. For example, the ques- mous reviewers for their constructive feedback. tion “who is the governor of California 2010” This work was partially supported by the Israel can be matched directly to web snippets, while Science Foundation, grant 942/16. in Freebase (Bollacker et al., 2008) the word “governor” is expressed by a complex predicate λx. z.GoverPos(x, z) PosTitle(z, Governor). References ∃ ∧ This could provide a partial explanation for the J. Bao, N. Duan, Z. Yan, M. Zhou, and T. Zhao. 2016. reasonable performance of WEBQA. Constraint-based question answering with knowl-

165 edge graph. In International Conference on Com- T. Kwiatkowski, E. Choi, Y. Artzi, and L. Zettlemoyer. putational Linguistics (COLING). 2013. Scaling semantic parsers with on-the-fly on- tology matching. In Empirical Methods in Natural J. Berant, A. Chou, R. Frostig, and P. Liang. 2013. Se- Language Processing (EMNLP). mantic parsing on Freebase from question-answer pairs. In Empirical Methods in Natural Language C. Kwok, O. Etzioni, and D. S. Weld. 2001. Scaling Processing (EMNLP). question answering to the web. ACM Transactions on Information Systems (TOIS) 19:242–262. J. Berant and P. Liang. 2015. Imitation learning of agenda-based semantic parsers. Transactions of the P. Liang, M. I. Jordan, and D. Klein. 2011. Learn- Association for Computational Linguistics (TACL) ing dependency-based compositional semantics. In 3:545–558. Association for Computational Linguistics (ACL). pages 590–599. K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. 2008. Freebase: a collaboratively created C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. graph database for structuring human knowledge. In Bethard, and D. McClosky. 2014. The stanford International Conference on Management of Data coreNLP natural language processing toolkit. In (SIGMOD). pages 1247–1250. ACL system demonstrations.

A. Bordes, N. Usunier, S. Chopra, and J. Weston. 2015. T. Onishi, H. Wang, M. Bansal, K. Gimpel, and Large-scale simple question answering with mem- D. McAllester. 2016. Whodid what: A large-scale ory networks. arXiv preprint arXiv:1506.02075 . person-centered cloze dataset. In Empirical Meth- ods in Natural Language Processing (EMNLP). E. Brill, S. Dumais, and M. Banko. 2002. An analy- sis of the AskMSR question-answering system. In P. Pasupat and P. Liang. 2015. Compositional semantic Association for Computational Linguistics (ACL). parsing on semi-structured tables. In Association for pages 257–264. Computational Linguistics (ACL). J. Pennington, R. Socher, and C. D. Manning. 2014. D. Chen, J. Bolton, and C. D. Manning. 2016. A thor- Glove: Global vectors for word representation. In ough examination of the CNN / Daily Mail reading Empirical Methods in Natural Language Processing comprehension task. In Association for Computa- (EMNLP). tional Linguistics (ACL). P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. Danqi Chen, Adam Fisch, Jason Weston, and Antoine Squad: 100,000+ questions for machine comprehen- Bordes. 2017. Reading Wikipedia to answer open- sion of text. In Empirical Methods in Natural Lan- domain questions. In Association for Computa- guage Processing (EMNLP). tional Linguistics (ACL). S. Reddy, M. Lapata, and M. Steedman. 2014. Large- K. M. Hermann, T. Koisk, E. Grefenstette, L. Espe- scale semantic parsing without question-answer holt, W. Kay, M. Suleyman, and P. Blunsom. 2015. pairs. Transactions of the Association for Compu- Teaching machines to read and comprehend. In Ad- tational Linguistics (TACL) 2(10):377–392. vances in Neural Information Processing Systems (NIPS). M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. 2016. Bidirectional attention flow for machine com- D. Hewlett, A. Lacoste, L. Jones, I. Polosukhin, prehension. arXiv . A. Fandrianto, J. Han, M. Kelcey, and D. Berthelot. 2016. Wikireading: A novel large-scale language E. M. Voorhees and D. M. Tice. 2000. Building a understanding task over Wikipedia. In Association question answering test collection. In ACM Spe- for Computational Linguistics (ACL). cial Interest Group on Information Retreival (SI- GIR). pages 200–207. F. Hill, A. Bordes, S. Chopra, and J. Weston. 2015. The goldilocks principle: Reading children’s books with Y. Yang, W. Yih, and C. Meek. 2015. WikiQA: A chal- explicit memory representations. In International lenge dataset for open-domain question answering. Conference on Learning Representations (ICLR). In Empirical Methods in Natural Language Process- ing (EMNLP). pages 2013–2018. Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly X. Yao, J. Berant, and B. Van-Durme. 2014. Freebase supervised challenge dataset for reading comprehen- QA: or semantic parsing. In sion. arXiv preprint arXiv:1705.03551 . Workshop on Semantic parsing.

R. Kadlec, M. Schmid, O. Bajgar, and J. Kleindienst. W. Yih, M. Chang, X. He, and J. Gao. 2015. Semantic 2016. Text understanding with the attention sum parsing via staged query graph generation: Question reader network. In Association for Computational answering with knowledge base. In Association for Linguistics (ACL). Computational Linguistics (ACL).

166 M. Zelle and R. J. Mooney. 1996. Learning to parse database queries using inductive logic program- ming. In Association for the Advancement of Arti- ficial Intelligence (AAAI). pages 1050–1055. L. S. Zettlemoyer and M. Collins. 2005. Learning to map sentences to logical form: Structured classifica- tion with probabilistic categorial grammars. In Un- certainty in Artificial Intelligence (UAI). pages 658– 666.

167