Arxiv:2010.09692V2 [Cs.CL] 9 Jul 2021
Total Page:16
File Type:pdf, Size:1020Kb
Summary-Oriented Question Generation for Informational Queries Xusen Yin∗ Li Zhou USC/ISI Amazon [email protected] [email protected] Kevin Small Jonathan May Amazon USC/ISI [email protected] [email protected] Abstract not changing these expectations, and users simply not possessing sufficient knowledge of the subject Users frequently ask simple factoid questions of interest to ask more challenging questions. Ir- for question answering (QA) systems, attenu- respective of the reason, one potential solution to ating the impact of myriad recent works that support more complex questions. Prompting this dilemma is to provide users with automatically users with automatically generated suggested generated suggested questions (SQs) to help users questions (SQs) can improve user understand- better understand QA system capabilities. ing of QA system capabilities and thus facil- Generating SQs is a specific form of question itate more effective use. We aim to produce generation (QG), a long-studied task with many ap- self-explanatory questions that focus on main plied use cases – the most frequent purpose being document topics and are answerable with vari- data augmentation for mitigating the high sample able length passages as appropriate. We sat- complexity of neural QA models (Alberti et al., isfy these requirements by using a BERT-based Pointer-Generator Network trained on the Nat- 2019a). However, the objective of such existing ural Questions (NQ) dataset. Our model shows QG systems is to produce large quantities of ques- SOTA performance of SQ generation on the tion/answer pairs for training, which is contrary to NQ dataset (20.1 BLEU-4). We further ap- that of SQs. The latter seeks to guide users in their ply our model on out-of-domain news articles, research of a particular subject by producing engag- evaluating with a QA system due to the lack of ing and understandable questions. To this end, we gold questions and demonstrate that our model aim to generate questions that are self-explanatory produces better SQs for news articles – with further confirmation via a human evaluation. and introductory. Self-explanatory questions require neither signif- 1 Introduction icant background knowledge nor access to docu- ments used for QG to understand the SQ context. Question answering (QA) systems have experi- For example, existing QG systems may use the enced dramatic recent empirical improvements due text “On December 13, 2013, Beyonce´ unexpect- to several factors including novel neural architec- edly released her eponymous fifth studio album on tures (Chen and Yih, 2020), access to pre-trained the iTunes store without any prior announcement contextualized embeddings (Devlin et al., 2019), or promotion.” to produce the question “Where arXiv:2010.09692v2 [cs.CL] 9 Jul 2021 and the development of large QA training cor- was the album released?” This kind of question pora (Rajpurkar et al., 2016; Trischler et al., 2017; is not uncommon in crowd-sourced datasets (e.g., Yu et al., 2020). However, despite technological SQuAD (Rajpurkar et al., 2016)) but do not satisfy advancements that support more sophisticated ques- the self-explanatory requirement. Clark and Gard- tions (Yang et al., 2018; Joshi et al., 2017; Choi ner(2018) estimate that 33 % of SQuAD questions et al., 2018; Reddy et al., 2019), many consumers are context-dependent. This context-dependency of QA technology in practice tend to ask simple is not surprising, given that annotators observe the factoid questions when engaging with these sys- underlying documents when generating questions. tems. Potential explanations for this phenomenon Introductory questions are best answered by a include low expectations set by previous QA sys- larger passage than short spans such that users tems, limited coverage for more complex questions can learn more about the subject, possibly inspir- ∗Work was done as an intern at Amazon. ing follow-up questions (e.g., “Can convalescent plasma help COVID patients?”). However, existing suring the possibility to find answers from given QG methods mostly generate questions while read- contexts. The second is granularity, indicating ing the text corpus and tend to produce narrowly whether the answer would be passages or short focused questions with close syntactic relations to spans. Finally, we conduct a human evaluation with associated answer spans. TriviaQA (Joshi et al., generated questions of the test set and demonstrate 2017) and HotpotQA (Yang et al., 2018) also pro- that our BERTPGN model can produce introduc- vide fine-grained questions, even though reasoning tory and self-explanatory questions for information- from a larger document context via multi-hop infer- seeking scenarios, even for a new domain that dif- ence. This narrower focus often produces factoid fers from the training data. questions peripheral to the main topic of the under- The novel contributions of our paper include: lying document and is less useful to a human user • We generate questions, aiming to be both in- seeking information about a target concept. troductory and self-explanatory, to support Conversely, the Natural Question (NQ) dataset human information seeking QA sessions. (Kwiatkowski et al., 2019) (and similar ones • We propose to use the BERT-based Pointer- such as MS Marco (Bajaj et al., 2016), Generator Network to generate questions by GooAQ (Khashabi et al., 2021)) is significantly encoding larger contexts capable of resulting closer to simulating the desired information- in answer forms including entities, short text seeking behavior. Questions are generated inde- spans, and even whole paragraphs. pendently of the corpus by processing search query • We evaluate our method, both automatically logs, and the resulting answers can be entities, and with human evaluation, on in-domain spans in texts (aka short answers), or entire para- Natural Questions and out-of-domain news graphs (aka long answers). Thus, the NQ dataset datasets, providing insights into question gen- is more suitable as QG training data for generat- eration for information seeking. ing SQs as long-answer questions that tend to sat- • We propose a novel evaluation metric with a isfy our self-explanatory and introductory require- pretrained QA system for generated SQs when ments. there is no reference question. To this end, we propose a novel BERT-based 2 Related Work Pointer-Generator Network (BERTPGN) trained with the NQ dataset to generate introductory and QG has been studied in multiple application con- self-explanatory questions as SQs. Using NQ, texts (e.g., generating questions for reading com- we start by creating a QG dataset that contains prehension tests (Heilman and Smith, 2010), gen- questions with both short and long answers. We erating questions about an image (Mostafazadeh train our BERTPGN model with these two types of et al., 2016), recommending questions with respect context-question pairs together. During inference, to a news article (Laban et al., 2020)), evaluat- the model can generate either short- or long-answer ing summaries (Deutsch et al., 2020; Wang et al., questions as determined by the context. With auto- 2020), and using multiple methods (see (Pan et al., matic evaluation metrics such as BLEU (Papineni 2019) for a recent survey). Early neural models et al., 2002), we show that for long-answer question focused on sequence-to-sequence generation based generation, our model can produce state-of-the-art solutions (Serban et al., 2016; Du et al., 2017). performance with 20.1 BLEU-4, 6.2 higher than The primary directions for improving these early (Mishra et al., 2020), the current state-of-the-art on works generally fall into the categories of providing this dataset. The short answer question generation mechanisms to inject answer-aware information performance can reach 28.1 BLEU-4. into the neural encoder-decoder architectures (Du We further validate the generalization ability of and Cardie, 2018; Li et al., 2019; Liu et al., 2019; our BERTPGN model by creating an out-of-domain Wang et al., 2020; Sun et al., 2018), encoding larger test set with the CNN/Daily Mail (Hermann et al., portions of the answer document as context (Zhao 2015). Without human-generated reference ques- et al., 2018; Tuan et al., 2020), and incorporating tions, automatic evaluation metrics such as BLEU richer knowledge sources (Elsahar et al., 2018). are not usable. We propose to evaluate these ques- These QG methods and the work described in tions with a pretrained QA system that produces this paper focus on using single-hop QA datasets two novel metrics. The first is answerability, mea- such as SQuAD (Rajpurkar et al., 2016, 2018), NewsQA (Trischler et al., 2017; Hermann et al., Copy ? 2015), and MS Marco (Bajaj et al., 2016). However, Transformer there has also been recent interest in multi-hop QG Attn BERT Encoder settings (Yu et al., 2020; Gupta et al., 2020; Malon BERT as a LM w w w w and Bai, 2020) by using multi-hop QA datasets p p p p w w w including HotPotQA (Yang et al., 2018), Trivi- t t t t p p p aQA (Joshi et al., 2017), and FreebaseQA (Jiang Figure 1: The BERTPGN architecture. The input for the et al., 2019). Finally, there has been some recent in- BERT encoder is the context (w/p: word and position teresting work regarding unsupervised QG, where embeddinngs) with answer spans (or the whole context the goal is to generate QA training data without in the long answer setting) marked with the answer tag- an existing QG corpus to train better QA mod- ging (t: answer tagging embeddings). The decoder is els (Lewis et al., 2019; Li et al., 2020). a combination of BERT as a language model (i.e. has Most directly related to our work from a mo- only self-attentions) and a Transformer-based pointer- tivation perspective is recent research regarding generator network. providing SQs in the context of supporting a news chatbot (Laban et al., 2020).