Photoshopquia: a Corpus of Non-Factoid Questions and Answers for Why-Question Answering

PhotoshopQuiA: A Corpus of Non-Factoid Questions and Answers for Why-Question Answering Andrei Dulceanuy∗, Thang Le Dinhz, Walter Chang∗, Trung Bui∗, Doo Soon Kim∗, Manh Chien Vuz, Seokhwan Kim∗ y Universitatea Politehnica Bucures, ti, Romaniaˆ ∗ Adobe Research, California, United States zUniversite´ du Quebec´ a` Trois-Rivieres,` Quebec,´ Canada [email protected], fwachang, bui, dkim, [email protected], fthang.ledinh, [email protected] Abstract Recent years have witnessed a high interest in non-factoid question answering using Community Question Answering (CQA) web sites. Despite ongoing research using state-of-the-art methods, there is a scarcity of available datasets for this task. Why-questions, which play an important role in open-domain and domain-specific applications, are difficult to answer automatically since the answers need to be constructed based on different information extracted from multiple knowledge sources. We introduce the PhotoshopQuiA dataset, a new publicly available set of 2,854 why-question and answer(s) (WhyQ, A) pairs related to Adobe Photoshop usage collected from five CQA web sites. We chose Adobe Photoshop because it is a popular and well-known product, with a lively, knowledgeable and sizable community. To the best of our knowledge, this is the first English dataset for Why-QA that focuses on a product, as opposed to previous open-domain datasets. The corpus is stored in JSON format and contains detailed data about questions and questioners as well as answers and answerers. The dataset can be used to build Why-QA systems, to evaluate current approaches for answering why-questions, and to develop new models for future QA systems research. Keywords: question answering, community question answering, non-factoid question, Why-QA 1. Introduction Usually, if a user is the original questioner, he/she is al- The success of IBM’s Watson in the Jeopardy! TV game- lowed to select the most relevant answer to his/her question. show in 2011 and the significant investments of large tech Although CQA web sites have lots of experts, it still takes companies in building personal assistants (e.g., Microsoft their time to give pertinent, authoritative answers to user Cortana, Apple Siri, Amazon Alexa or Google Assistant) questions and not all the content shares the same charac- have strengthened the interest in the Question Answering teristics. Some key differences (Blooma and Kurian, 2011) field. These systems have in common the fact that they in answer quality and availability between traditional QA mostly tackle factoid questions. These are questions systems and CQA web sites include: the type of questions that “can be answered with simple facts expressed in (factoid vs. non-factoid), the quality of the answers (high short text answers”; usually, their answers include “short vs. varying from answerer to answerer) and the response strings expressing a personal name, temporal expression, time (immediate vs. several hours or days). or location” (Jurafsky and Martin, 2017). An example of a Among all categories of non-factoid questions, namely list, factoid question and its answer is: confirmation, causal and hypothetical (Mishra and Jain, 2016), we are especially interested in why-questions that Q: Who is Canada’s prime minister? are related to causal relations. Why-questions are diffi- A: Justin Trudeau. cult to answer automatically since the answers often need to be constructed based on different information extracted By contrast, non-factoid questions ask for “opinions, from multiple knowledge sources. For this reason, why- suggestions, interpretations and the like”(Tomasoni and questions need a different approach than factoid questions Huang, 2010). Answering and evaluating the quality of the because their answers usually cannot be stated in a single provided answers for non-factoid questions have proved to phrase (Verberne et al., 2010). be non-trivial due to the difficulty of the task complexity A why Question Answering (Why-QA) system trying to as well as the lack of training data. To address the latter answer questions using CQA data needs to be able to dis- issue, numerous researchers have tried to take advantage of tinguish between relevant and irrelevant answers (answer user-generated content on Community Question Answer- selection task). Most of the time these systems also pro- ing (CQA) web sites such as Yahoo! Answers 1, Stack duce a sorted output of relevant answers (answer re-ranking Overflow 2 or Quora 3. These web forums allow users to task). Both tasks require curated and informative datasets post their own questions, answer others’ questions, com- on which to evaluate proposed methods. ment on others’ replies, and upvote or downvote answers. In this paper, we introduce the PhotoshopQuiA dataset, a corpus consisting of 2,854 (WhyQ, A) pairs covering vari- 1https://answers.yahoo.com ous questions and answers about Adobe Photoshop 4. We 2https://stackoverflow.com 3 https://www.quora.com 4Adobe Photoshop is the de facto industry stan- 2763 chose Adobe Photoshop because it is a popular and well- nfL66, a non-factoid subset of L6 focusing only on how- known product, with a lively, knowledgeable and sizable questions (Cohen and Croft, 2016). Although Yahoo Web- QA community. PhotoshopQuiA is the first large Why-QA scope L6 certainly contains many why-QA pairs which only English dataset that focuses on a product, as opposed should be fairly trivial to extract from the entire dataset, to previous open-domain datasets. Our dataset focuses on we believe this limits its usefulness for building Why-QA why-questions that occur while a user is trying to complete systems. While all these datasets focus on CQA forums a task (e.g., changing color mode for an image, or apply- data, there are some key differences between our work and ing a filter). It contains contextual information about the existing datasets (Table 2). answer, which in turn makes it easier to build a QA system that is able to find the most relevant answers. We named the Dataset Description corpus PhotoshopQuiA, because quia (first and last letters WebQuestions for training semantic parsers, which capitalized as in Question Answering) means because in and Free917 map natural language utterances to de- Latin and therefore hints at the expected why-answer type. notations (answers) via intermediate One of the challenges that often arises with CQA data is logical forms (Berant et al., 2013) the high variance in quality for both questions and answers. CuratedTREC 2,180 questions extracted from the To address this problem, we include in our dataset mostly datasets from TREC (Baudisˇ and ˇ official answers (65.5% from total pairs) given by Adobe Sedivy,` 2015) Photoshop experts. We choose to provide both text and WikiQA 3,000 questions sampled from Bing HTML representations of questions and answers because query logs associated with a Wikipedia the raw HTML snippets often include relevant informa- page presumed to be the topic of the tion like documentation links, screenshots or short videos question (Yang et al., 2015) which would be otherwise lost. We analyze the (WhyQ, A) 30M Factoid 30M natural language questions in En- pairs for presence of certain linguistic cues such as causal- QA Corpus glish and their corresponding facts in ity markers (e.g., the reason for, because or due to). the knowledge base Freebase (Serban et al., 2016) 2. Related Work & Datasets SQuAD 100,000 question-answer pairs on more than 500 Wikipedia articles (Ra- In recent years, numerous datasets have been released in jpurkar et al., 2016) the domain of question-answering (QA) systems to pro- Amazon 1.4 million answered questions from mote new methods that integrate natural language process- Amazon (Wan and McAuley, 2016) ing, information retrieval, artificial intelligence and knowl- Baidu 42K questions and 579K evidences, edge discovery. The majority of these datasets were open- which are a piece of text containing in- domain (Bollacker et al., 2008; Ignatova et al., 2009; Cai formation for answering the question and Yates, 2013; Yang et al., 2015; Chen et al., 2017). (Li et al., 2016) There are still a few QA datasets for specific fields such Allen AI 2,500 questions. Each question has 4 as BioASQ and WikiMovies. The BioASQ dataset con- Science answer candidates (Chen et al., 2017) tains questions in English, along with reference answers Challenge constructed by a team of biomedical experts (Tsatsaro- Quora Over 400K sentence pairs of which, nis et al., 2015). The WikiMovies dataset contains 96K almost 150K are semantically simi- question-answer pairs in the domain of movies (Miller et lar questions; no answers are provided al., 2016). Table 1 introduces selected recent open-domain (Iyer et al., 2017) QA datasets. Several existing datasets focus on the data taken from CQA Table 1: Recent datasets for QA systems web sites. The data structure of our dataset (question- answer(s) pair with the best answer labeled), resembles the one in (Hoogeveen et al., 2015). However, it does Difference This work Previous work not include comments and tags, making it more suitable Scope closed domain (focus open domain for Why-QA than previous structures which include only on product usage) questions (Iyer et al., 2017). Our work is mostly related Answer picked by a domain n/a to the SemEval-2016 Task 3 dataset (Nakov et al., 2016) authority expert (65.5%) which contains more than 2,000 questions associated with Question why-questions only various (focus on ten most relevant comments (answers). It also shares some types how-questions) characteristics with Yahoo’s Webscope5 L4 used by (Sur- deanu et al., 2008) and (Jansen et al., 2014), L6 and with Table 2: Major differences between our dataset and previous datasets focusing on CQA forums data dard raster graphics editor developed and pub- lished by Adobe Systems for macOS and Windows.

Photoshopquia: a Corpus of Non-Factoid Questions and Answers for Why-Question Answering

Words That Work: It's Not What You Say, It's What People Hear

Fake News Detection in Social Media

Combatting White Nationalism: Lessons from Marx by Andrew Kliman

Online Media and the 2016 US Presidential Election

Putin's New Russia

Automatic Question Answering: Beyond the Factoid

Digital Health Passports: the Snare That Will Lure Many Into the One-World Cashless System

Congressional Record United States Th of America PROCEEDINGS and DEBATES of the 115 CONGRESS, SECOND SESSION

A Systematic Literature Review on Disinformation

O Facto Falso: Do Factóide Às Fake News Autor(Es)

Toolkit for Independent Primary Care Providers

Arxiv:2007.00576V6 [Cs.CL] 12 May 2021 Pandemic When Increased Investment in Relevant Such As Twitter