ComQA: A Community-sourced Dataset for Complex Factoid Question Answering with Paraphrase Clusters Abdalghani Abujabal1, Rishiraj Saha Roy2, Mohamed Yahya3 and Gerhard Weikum2 1Amazon Alexa, Aachen, Germany
[email protected] 2Max Planck Institute for Informatics, Saarland Informatics Campus, Germany frishiraj,
[email protected] 3Bloomberg L.P., London, United Kingdom
[email protected] Cluster 1 Abstract Q: “Who was the Britain’s leader during WW1?” temporal Q: “Who ran Britain during WW1?” To bridge the gap between the capabilities Q: “Who was the leader of Britain during World War One?” A: [https://en.wikipedia.org/wiki/h._h._asquith, of the state-of-the-art in factoid question an- https://en.wikipedia.org/wiki/david_lloyd_george] swering (QA) and what users ask, we need Cluster 2 comparison large datasets of real user questions that cap- Q: “largest city located along the Nile river?” ture the various question phenomena users are Q: “largest city by the Nile river?” Q: “What is the largest city in Africa that is on the banks of the interested in, and the diverse ways in which Nile river?” A: [https://en.wikipedia.org/wiki/cairo] these questions are formulated. We introduce Cluster 3 compositional ComQA, a large dataset of real user questions Q: “John Travolta and Jamie Lee Curtis acted in this film?” that exhibit different challenging aspects such Q: “Jamie Lee Curtis and John Travolta played together in this as compositionality, temporal reasoning, and movie?” Q: “John Travolta and Jamie Lee Curtis were actors in this comparisons. ComQA questions come from film?” A: [https://en.wikipedia.org/wiki/perfect_(film) the WikiAnswers community QA platform, Cluster 4 which typically contains questions that are not Q: “Who is the first human landed in Mars?” empty answer Q: “Who was the first human being on Mars?” satisfactorily answerable by existing search A: [] engine technology.