Arxiv:1809.09528V2 [Cs.CL] 10 Apr 2019 Crowdsourcing Effort, Which Capture Lexical and Factoid QA Is the Task of Answering Natural Lan- Syntactic Variety

ComQA: A Community-sourced Dataset for Complex Factoid Question Answering with Paraphrase Clusters Abdalghani Abujabal1, Rishiraj Saha Roy2, Mohamed Yahya3 and Gerhard Weikum2 1Amazon Alexa, Aachen, Germany [email protected] 2Max Planck Institute for Informatics, Saarland Informatics Campus, Germany frishiraj, [email protected] 3Bloomberg L.P., London, United Kingdom [email protected] Cluster 1 Abstract Q: “Who was the Britain’s leader during WW1?” temporal Q: “Who ran Britain during WW1?” To bridge the gap between the capabilities Q: “Who was the leader of Britain during World War One?” A: [https://en.wikipedia.org/wiki/h._h._asquith, of the state-of-the-art in factoid question an- https://en.wikipedia.org/wiki/david_lloyd_george] swering (QA) and what users ask, we need Cluster 2 comparison large datasets of real user questions that cap- Q: “largest city located along the Nile river?” ture the various question phenomena users are Q: “largest city by the Nile river?” Q: “What is the largest city in Africa that is on the banks of the interested in, and the diverse ways in which Nile river?” A: [https://en.wikipedia.org/wiki/cairo] these questions are formulated. We introduce Cluster 3 compositional ComQA, a large dataset of real user questions Q: “John Travolta and Jamie Lee Curtis acted in this film?” that exhibit different challenging aspects such Q: “Jamie Lee Curtis and John Travolta played together in this as compositionality, temporal reasoning, and movie?” Q: “John Travolta and Jamie Lee Curtis were actors in this comparisons. ComQA questions come from film?” A: [https://en.wikipedia.org/wiki/perfect_(film) the WikiAnswers community QA platform, Cluster 4 which typically contains questions that are not Q: “Who is the first human landed in Mars?” empty answer Q: “Who was the first human being on Mars?” satisfactorily answerable by existing search A: [] engine technology. Through a large crowd- Q: “first human in Mars?” sourcing effort, we clean the question dataset, group questions into paraphrase clusters, and Figure 1: ComQA paraphrase clusters covering a range annotate clusters with their answers. ComQA of question aspects e.g., temporal and compositional contains 11; 214 questions grouped into 4,834 questions, with lexical and syntactic diversity. paraphrase clusters. We detail the process of constructing ComQA, including the measures taken to ensure its high quality while mak- benchmarks should be large enough to facilitate ing effective use of crowdsourcing. We also the use of data-hungry machine learning methods. present an extensive analysis of the dataset and In this paper, we present ComQA, a large dataset the results achieved by state-of-the-art systems of 11,214 real user questions collected from the on ComQA, demonstrating that our dataset WikiAnswers community QA website. As shown can be a driver of future research on QA. in Figure1, the dataset contains various question phenomena. ComQA questions are grouped into 1 Introduction 4,834 paraphrase clusters through a large-scale arXiv:1809.09528v2 [cs.CL] 10 Apr 2019 crowdsourcing effort, which capture lexical and Factoid QA is the task of answering natural lan- syntactic variety. Crowdsourcing is also used to guage questions whose answer is one or a small pair paraphrase clusters with answers to serve as number of entities (Voorhees and Tice, 2000). To a supervision signal for training and as a basis for advance research in QA in a manner consistent evaluation. with the needs of end users, it is important to Table1 contrasts ComQA with publicly avail- have access to datasets that reflect real user infor- able QA datasets. The foremost issue that ComQA mation needs by covering various question phe- tackles is ensuring research is driven by informa- nomena and the wide lexical and syntactic vari- tion needs formulated by real users. Most large- ety in expressing these information needs. The scale datasets resort to highly-templatic syntheti- 1The main part of this work was carried out when the au- cally generated natural language questions (Bor- thor was at the Max Planck Institute for Informatics. des et al., 2015; Cai and Yates, 2013; Su et al., Dataset Large scale (> 5K) Real Information Needs Complex Questions Question Paraphrases ComQA (This paper) 3 3 3 3 Free917 (Cai and Yates, 2013) 7 7 7 7 WebQuestions (Berant et al., 2013) 3 3 7 7 SimpleQuestions (Bordes et al., 2015) 3 7 7 7 QALD (Usbeck et al., 2017) 7 7 3 7 LC-QuAD (Trivedi et al., 2017) 3 7 3 7 ComplexQuestions (Bao et al., 2016) 7 3 3 7 GraphQuestions (Su et al., 2016) 3 7 3 3 ComplexWebQuestions (Talmor and Berant, 2018) 3 7 3 7 TREC (Voorhees and Tice, 2000) 7 3 3 7 Table 1: Comparison of ComQA with existing QA datasets over various dimensions. 2016; Talmor and Berant, 2018; Trivedi et al., ters with answers. ComQA answers are primar- 2017). Other datasets utilize search engine logs ily Wikipedia entity URLs. This has two motiva- to collect their questions (Berant et al., 2013), tions: (i) it builds on the example of search engines which creates a bias towards simpler questions that use Wikipedia entities as answers for entity- that search engines can already answer reasonably centric queries (e.g., through knowledge cards), well. In contrast, ComQA questions come from and (ii) most modern KBs ground their entities WikiAnswers, a community QA website where in Wikipedia. Wherever the answers are tempo- users pose questions to be answered by other ral or measurable quantities, we use TIMEX31 and users. This is often a reflection of the fact that such the International System of Units2 for normaliza- questions are beyond the capabilities of commer- tion. Providing canonical answers allows for bet- cial search engines and QA systems. Questions in ter comparison of different systems. our dataset exhibit a wide range of interesting as- We present an extensive analysis of ComQA, pects such as the need for temporal reasoning (Fig- where we introduce the various question aspects ure1, cluster 1), comparison (Figure1, cluster 2), of the dataset. We also analyze the results of compositionality (multiple subquestions with mul- running state-of-the-art QA systems on ComQA. tiple entities and relations) (Figure1, cluster 3), ComQA exposes major shortcomings in these sys- and unanswerable questions (Figure1, cluster 4). tems, mainly related to their inability to handle ComQA is the result of a carefully designed compositionality, time, and comparison. Our de- large-scale crowdsourcing effort to group ques- tailed error analysis provides inspiration for av- tions into paraphrase clusters and pair them with enues of future work to ensure that QA systems answers. Past work has demonstrated the bene- meet the expectations of real users. To summarize, fits of paraphrasing for QA (Abujabal et al., 2018; in this paper we make the following contributions: Berant and Liang, 2014; Dong et al., 2017; Fader et al., 2013). Motivated by this, we judiciously use • We present a dataset of 11,214 real user ques- crowdsourcing to obtain clean paraphrase clusters tions collected from a community QA web- from WikiAnswers’ noisy ones, resulting in ones site. The questions exhibit a range of aspects like those shown in Figure1, with both lexical and that are important for users and challenging syntactic variations. The only other dataset to pro- for existing QA systems. Using crowdsourc- vide such clusters is that of Su et al. (2016), but ing, questions are grouped into 4,834 para- that is based on synthetic information needs. phrase clusters that are annotated with answers. ComQA is available at: http://qa. For answering, recent research has shown that mpi-inf.mpg.de/comqa. combining various resources for answering sig- nificantly improves performance (Savenkov and • We present an extensive analysis and quantify Agichtein, 2016; Sun et al., 2018; Xu et al., 2016). the various difficulties in ComQA. We also Therefore, we do not pair ComQA with a specific present the results of state-of-the art QA sys- knowledge base (KB) or text corpus for answer- tems on ComQA, and a detailed error analy- ing. We call on the research community to in- sis. novate in combining different answering sources to tackle ComQA and advance research in QA. 1http://www.timeml.org We use crowdsourcing to pair paraphrase clus- 2https://en.wikipedia.org/wiki/SI 2 Related Work et al., 2013). Over the past five years, many datasets were introduced for this setting. How- There are two main variants of the factoid QA ever, as Table1 shows, they are either small task, with the distinction tied to the underlying an- in size (Free917, and ComplexQuestions), com- swering resources and the nature of answers. Tra- posed of synthetically generated questions (Sim- ditionally, QA has been explored over large tex- pleQuestions, GraphQuestions, LC-QuAD and tual corpora (Cui et al., 2005; Harabagiu et al., ComplexWebQuestions), or are structurally sim- 2001, 2003; Ravichandran and Hovy, 2002; Sa- ple (WebQuestions). ComQA addresses these quete et al., 2009) with answers being textual shortcomings. Returning semantic entities as an- phrases. Recently, it has been explored over large swers allows users to further explore these entities structured resources such as KBs (Berant et al., in various resources such as their Wikipedia pages, 2013; Unger et al., 2012), with answers being se- Freebase entries, etc. It also allows QA systems to mantic entities. Recent work demonstrated that tap into various interlinked resources for improve- the two variants are complementary, and a com- ment (e.g., to obtain better lexicons, or train bet- bination of the two results in the best perfor- ter NER systems). Because of this, ComQA pro- mance (Sun et al., 2018; Xu et al., 2016). vides semantically grounded reference answers in QA over textual corpora.

Arxiv:1809.09528V2 [Cs.CL] 10 Apr 2019 Crowdsourcing Effort, Which Capture Lexical and Factoid QA Is the Task of Answering Natural Lan- Syntactic Variety

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support