<<

Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus

Bang Liu1, Haojie Wei2, Di Niu1, Haolan Chen2, Yancheng He2 1University of Alberta, Edmonton, AB, Canada 2Platform and Content Group, Tencent, Shenzhen, China

ABSTRACT The fight scene finale between Sharon and the character played by Ali Larter, The ability to ask questions is important in both human and ma- from the movie Obsessed, won the 2010 MTV Movie Award for Best Fight. chine intelligence. Learning to ask questions helps knowledge acqui- Answer: MTV Movie Award for Best Fight sition, improves question-answering and machine reading compre- Clue: from the movie Obsessed Style: Which hension tasks, and helps a chatbot to keep the conversation flowing Q: A fight scene from the movie, Obsessed, won which award? with a human. Existing question generation models are ineffective Answer: MTV Movie Award for Best Fight at generating a large amount of high-quality question-answer pairs Clue: The flight scene finale between Sharon and the character played by from unstructured text, since given an answer and an input passage, Ali Larter question generation is inherently a one-to-many mapping. In this Style: Which Q: Which award did the fight scene between Sharon and the role of Ali paper, we propose Answer-Clue-Style-aware Question Generation Larter win? (ACS-QG), which aims at automatically generating high-quality and diverse question-answer pairs from unlabeled text corpus at scale Answer: Obsessed Clue: won the 2010 MTV Movie Award for Best Fight by imitating the way a human asks questions. Our system consists Style: What of: i) an information extractor, which samples from the text multiple Q: What is the name of the movie that won the 2010 MTV Movie Award types of assistive information to guide question generation; ii) neu- for Best Fight? ral question generators, which generate diverse and controllable Figure 1: Given the same input sentence, we can ask diverse questions, leveraging the extracted assistive information; and iii) questions based on the different choices about i) what the a neural quality controller, which removes low-quality generated target answer is; ii) which answer-related chunk is used as a data based on text entailment. We compare our question generation clue, and iii) what type of questions is asked. models with existing approaches and resort to voluntary human evaluation to assess the quality of the generated question-answer pairs. The evaluation results suggest that our system dramatically 1 INTRODUCTION outperforms state-of-the-art neural question generation models in Automatically generating question-answer pairs from unlabeled terms of the generation quality, while being scalable in the mean- text passages is of great value to many applications, such as as- time. With models trained on a relatively smaller amount of data, sisting the training of machine reading comprehension systems we can generate 2.8 million quality-assured question-answer pairs [10, 44, 45], generating queries/questions from documents to im- from a million sentences found in Wikipedia. prove search engines [17], training chatbots to get and keep a conversation going [40], generating exercises for educational pur- CCS CONCEPTS poses [7, 18, 19], and generating FAQs for web documents [25]. • Computing methodologies → Natural language process- Many question-answering tasks such as machine reading compre- ing; Natural language generation; Machine translation. hension and chatbots require a large amount of labeled samples KEYWORDS for supervised training, acquiring which is time-consuming and costly. Automatic question-answer generation makes it possible to Question Generation, Sequence-to-Sequence, Machine Reading

arXiv:2002.00748v2 [cs.CL] 5 Mar 2020 provide these systems with scalable training data and to transfer Comprehension a pre-trained to new domains that lack manually labeled ACM Reference Format: training samples. Bang Liu1, Haojie Wei2, Di Niu1, Haolan Chen2, Yancheng He2. 2020. Asking Despite a large number of studies on Neural Question Generation, Questions the Human Way: Scalable Question-Answer Generation from it remains a significant challenge to generate high-quality QA pairs Text Corpus. In Proceedings of The Web Conference 2020 (WWW ’20), April from unstructured text at large quantities. Most existing neural 20–24, 2020, Taipei, Taiwan. ACM, New York, NY, USA, 12 pages. https: question generation approaches try to solve the answer-aware //doi.org/10.1145/3366423.3380270 question generation problem, where an answer chunk and the surrounding passage are provided as an input to the model while This paper is published under the Creative Commons Attribution 4.0 International the output is the question to be generated. They formulate the (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution. task as a Sequence-to-Sequence (Seq2Seq) problem, and design WWW ’20, April 20–24, 2020, Taipei, Taiwan various encoder, decoder, and input features to improve the quality © 2020 IW3C2 (International World Wide Web Conference Committee), published of generated questions [10, 11, 22, 27, 39, 41, 53]. However, answer- under Creative Commons CC-BY 4.0 License. ACM ISBN 978-1-4503-7023-3/20/04. aware question generation models are far from sufficient, since https://doi.org/10.1145/3366423.3380270 question generation from a passage is inherently a one-to-many WWW ’20, April 20–24, 2020, Taipei, Taiwan Bang Liu1, Haojie Wei2, Di Niu1, Haolan Chen2, Yancheng He2 mapping. Figure 1 shows an example of this phenomenon. Given type mismatches and avoid meaningless combinations of assistive the same input text “The fight scene finale between Sharon and information. the character played by Ali Larter, from the movie Obsessed, won How to learn a model to ask ACS-aware questions? Most the 2010 MTV Movie Award for Best Fight.”, we can ask a variety existing neural approaches are designed for answer-aware question of questions based on it. If we select the text chunk “MTV Movie generation, while there is no training data available for the ACS- Award for Best Fight” as the answer, we can still ask different aware question generation task. We propose effective strategies questions such as “A fight scene from the movie, Obsessed, won to automatically construct training samples from existing reading which award?” or “Which award did the fight scene between Sharon comprehension datasets without any human labeling effort. We and the role of Ali Larter win?”. define “clue” as a semantic chunk in an input passage thatwill We argue that when a human asks a question based on a pas- be included (or rephrased) in the target question. Based on this sage, she will consider various factors. First, she will still select definition, we perform syntactic parsing and chunking on input text, an answer as a target that her question points to. Second, she will and select the chunk which is most relevant to the target question as decide which piece of information will be present (or rephrased) the clue. Furthermore, we categorize different questions into 9 styles, in her question to set constraints or context for the question. We including “what”, “how”, “yes-no” and so forth, In this manner, we call this piece of information as the clue. The target answer may have leveraged the abundance of reading comprehension datasets be related to different clues in the passage. Third, even the same to automatically construct training data for ACS-aware question question may be expressed in different styles (e.g., “what”, “who”, generation models. “why”, etc.). For example, one can ask “which award” or “what is We propose two deep neural network models for ACS-aware the name of the award” to express the same meaning. Once the an- question generation, and show their superior performance in gener- swer, clue, and question style are selected, the question generation ating diverse and high-quality questions. The first model employs process will be narrowed down and become closer to a one-to-one sequence-to-sequence framework with copy and attention mecha- mapping problem, essentially mimicking the human way of asking nism [1, 3, 43], incorporating the information of answer, clue and questions. In other words, introducing these pieces of information style into the encoder and decoder. Furthermore, it discriminates into question-answer generation can help reduce the difficulty of between content words and function words in the input, and uti- the task. lizes vocabulary reduction (which downsizes the vocabularies for In this paper, we propose Answer-Clue-Style-aware Question both the encoder and decoder) to encourage aggressive copying. Generation (ACS-QG) designed for scalable generation of high- In the second model, we fine-tune a GPT2-small model [34]. We quality question-answer pairs from unlabeled text corpus. Just as a train our ACS-aware QG models using the input passage, answer, human will ask a question with clue and style in mind, our system clue, and question style as the language modeling context. As a re- first automatically extracts multiple types of information froman sult, we reduce the phenomenon of repeating output words, which input passage to assist question generation. Based on the multi- usually exists with sequence-to-sequence models, and can gener- aspect information extracted, we design neural network models to ate questions with better readability. With multi-aspect assistive generate diverse questions in a controllable way. Compared with information, our models are able to ask a variety of high-quality existing answer-aware question generation, our approach essen- questions based on an input passage, while making the generation tially converts the one-to-many mapping problem into a one-to-one process controllable. mapping problem, and is thus scalable by varying the assistive infor- How to ensure the quality of generated QA pairs? We con- mation fed to the neural network while in the meantime ensuring struct a data filter, which consists of an entailment model anda generation quality. Specifically, we have solved multiple challenges question answering model. In our filtering process, we input ques- in the ACS-aware question generation system: tions generated in the aforementioned manner into a BERT-based What to ask given an unlabeled passage? Given an input pas- [9] question answering model to get its predicted answer span, sage such as a sentence, randomly sampling and measure the overlap between the input answer span and the combinations will cause type mismatches, since answer, clue, and predicted answer span. In addition, we also classify the entailment style are not independent of each other. Without taking their corre- relationship between the original sentence and the question-answer lations into account, for example, we may select “how” or “when” concatenation. These components allow us to remove low-quality as the target question style while a person’s name is selected as the QA pairs. By combining the input sampler, ACS-aware question answer. Moreover, randomly sampling com- generator, and the data filter, we have constructed a pipeline that binations may lead to input volume explosion, as most of such is able to generate a large number of QA pairs from unlabeled text combinations point to meaningless questions. without extra human labeling efforts. To overcome these challenges, we design and implement an We perform extensive experiments based on the SQuAD dataset information extractor to efficiently sample meaningful inputs from [36] and Wikipedia, and compare our ACS-aware question genera- the given text. We learn the joint distribution of tuples from existing reading comprehension datasets, such as both the content-separated seq2seq model with aggressive copy- SQuAD [36]. In the meantime, we decompose the joint probability ing mechanism and the extra input information bring substantial distribution of the tuple into three components, and apply a three- benefits to question generation. Our method outperforms the state- step sampling mechanism to select reasonable combinations of of-the-art models significantly in terms of various metrics suchas input information from the input passage to feed into the ACS- BLEU-4, ROUGE-L and METEOR. aware question generator. Based on this strategy, we can alleviate Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus WWW ’20, April 20–24, 2020, Taipei, Taiwan

Obtaining ACS-Aware QG QA Datasets datasets Train Datasets Seq2Seq Generated ACS-Aware QG Input Output Filter Models Inputs for Tuples Text Corpus ACS-Aware QG

Sampling Input GPT2 inputs

Figure 2: An overview of the system architecture. It contains a dataset constructor, information sampler, ACS-aware question generator and a data filter. With models trained on 86, 635 of SQuAD data samples, we can expressed in different ways in a question. On the other hand, given automatically generate two large datasets containing 1.33 million unlabeled text corpus, we can sample clue chunks for generating and 1.45 million QA pairs from a corpus of top-ranked Wikipedia questions according to the same distribution in training datasets to articles, respectively. We perform quality evaluation on the gener- avoid discrepancy between training and generating. ated datasets and identify their strengths and weaknesses. Finally, Given the above definitions, our generation process is decom- we also evaluate how our generated QA data perform in training posed into input sampling and ACS-aware question generation: question-answering models in machine reading comprehension, as an alternative means to assess the data generation quality. Õ P(q, a|p) = P(a,c,s|p)P(q|a,c,s,p) (1) c,s 2 PROBLEM FORMULATION Õ = P(a|p)P(s|a,p)P(c|s, a,p)P(q|c,s, a,p), (2) In this section, we formally introduce the problem of ACS-aware c,s question generation. ( | ) ( | ) ( | ) Denote a passage by p, where it can be either a sentence or a where P a p , P s a,p , and P c s, a,p model the process of input paragraph (in our work, it is a sentence). Let q denotes a question sampling to get answer, question style, and clue information for a ( | ) related to this passage, and a denotes the answer of that question. target question; and P q c,s, a,p models the process of generating the target question. { } |p | | | A passage consists of a sequence of words p = pt t=1 where p { } |q | denotes the length of p. A question q = qt t=1 contains words 3 MODEL DESCRIPTION from either a predefined vocabulary V or from the input text p. In this section, we present our overall system architecture for gen- Our objective is to generate different question-answer pairs from erating questions from unlabeled text corpus. We then introduce ( , | ) it. Therefore, we aim to model the probability P q a p . the details of each component. In our work, we factorize the generation process into multiple Figure 2 shows the pipelined system we build to train ACS-aware steps to select different inputs for question generation. Specifically, question generation models and generate large-scale datasets which given a passage p, we will select three types of information as input can be utilized for different applications. Our system consists of four to a generative model, which are defined as follows: major components: i) dataset constructor, which takes existing QA • Answer: here we define an answer a as a span in the input datasets as input, and constructs training datasets for ACS-aware passage p. Specifically, we select a semantic chunk of p as question generation; ii) information sampler (extractor), which a target answer from a set of chunks given by parsing and samples answer, clue and style information from input text and chunking. feed them into ACS-aware QG models; iii) ACS-aware question • Clue: denote a clue as c. As mentioned in Sec. 1, a clue is a generation models, which are trained on the constructed datasets semantic chunk in input p which will be copied or rephrased to generate questions; and iv) data filter, which controls the quality in the target question. It is related to the answer a, and pro- of generated questions. viding it as input can help reducing the uncertainty when generating questions. This helps to alleviate the one-to-many 3.1 Obtaining Training Data for Question mapping problem of question generation, makes the gen- Generation eration process more controllable, as well as improves the Our first step is to acquire a training dataset to train ACS-aware quality of generated questions. question generation models. Existing answer-aware question gen- • Style: denote a question style as s. We classify each ques- eration methods [11, 27, 39, 53] utilize reading comprehension tion into nine styles: “who”, “where”, “when”, “why”, “which”, datasets such as SQuAD [36], as these datasets contain < p,q, a > “what”, “how”, “yes-no”, and “other”. By providing the target tuples. However, for our problem, the input and output consists style to the question generation model, we further reduce the of < p,q, a,c,s >, where the clue c and style s information are uncertainty of generation and increase the controllability. not directly provided in existing datasets. To address this issue, we We shall note that our definition of clue is different with[27]. In design effective strategies to automatically extract clue and style our work, given a passage p and a question q, we identify a clue as information without involving human labeling. a consistent chunk in p instead of being the overlapping non-stop Rules for Clue Identification. As mentioned in Sec. 2, given words between p and q. On one hand, this allows the clue to be < p,q, a >, we define a semantic chunk c in input p which is copied WWW ’20, April 20–24, 2020, Taipei, Taiwan Bang Liu1, Haojie Wei2, Di Niu1, Haolan Chen2, Yancheng He2

ALGORITHM 1: Clue Extraction ALGORITHM 2: Style Classification Input: passage p, answer a, question q, related words dictionary R. Input: question q, style set S = Output: clue c. {who, where, when, why, which, what, how, yes-no, other }, 1: get candidate chunks C = {c1, c2, ··· , c|C | } of passage p by parsing yes-no feature words set Y = and chunking; {am, is, was, were, are, does, do, did, have, had, has, could, 2: remove function words, tokenize p and q to get pt,c and qt,c and can, shall, should, will, would, may, miдht }. stemming p and q to get pm,c and qm,c ; Output: style s ∈ S. 3: for c ∈ C do 1: for s ∈ S\{yes-no, other } do 4: get tokenized clue ct,c and stemmed clue cm,c with only content 2: if word s is contained in q then words; 3: return s o 5: nt,c ← number of overlapping tokens between ct,c and qt,c ; 4: end if o 5: end for 6: nm,c ← number of overlapping stems between cm,c and qm,c ; sof t−o 6: for y ∈ Y do 7: n ← number of soft copied tokens between ct,c and qt,c ; t,c 7: if word y is the first word of q then 8: binary x ← whether q contains the chunk text c; 8: return yes-no o o sof t−o 9: score(c) = nt,c + nm,c + nt,c + x 9: end if 10: end for 10: end for 11: select the chunk c with maximum score(c) as the clue chunk; 11: return other

3.2 ACS-Aware Question Generation After obtained training datasets, we design two models for ACS- aware question generation. The first model is based on Seq2Seq or rephrased in the output question q as clue. We identify c by the framework with attention and copy mechanism [1, 14, 43]. In addi- method shown in Algorithm 1. tion, we exploit clue embedding, content embedding, style encoding First, we parse and chunk the input passage to get all candidate and aggressive copying to improve the performance of question chunks. Second, we get the tokenized and stemmed passage and generation. The second model is based on pre-trained language mod- question, and only keep the content words in the results. Third, we els. We fine-tune a GPT2-small model34 [ ] using the constructed calculate the similarities between each candidate chunk and the training datasets. target question according to different criteria. The final score of each chunk is the sum of different similarities. Finally, we select 3.2.1 Seq2Seq-based ACS-aware question generation. Given a pas- the chunk with the maximum score as the identified clue chunk c. sage, an answer span, a clue span, and a desired question style, To estimate the similarities between each candidate chunk and we train a neural encoder-decoder model to generate appropriate the question, we calculate the number of overlapping tokens and questions. stems between each chunk and the question, as well as checking Encoder. We utilize a bidirectional Gated Recurrent Unit (Bi- whether the chunk is fully contained in the question. In addition, GRU) [5] as our encoder. For each word pi in input passage p, we we further define “soft copy” relationship between two words to concatenate the different features to form a concatenated embed- take rephrasing into consideration. Specifically, a word wq ∈ q ding vector wi as the input to the encoder. Specifically, for each is considered as soft-copied from input passage p if there exist a word, it is represented by the concatenation of its word vector, em- word wp ∈ p which is semantically coherent with wq . To give an beddings of its Named Entity Recognition (NER) tag, Part-of-Speech instance, consider a passage “Selina left her hometown at the age of (POS) tag, and whether it is a content word. In addition, we can 18” and a question “How old was Selina when she left?”, the word know whether each word is within the span of answer a or clue c, “old” is soft-copied from “age” in the input passage. and utilize binary features to indicate the positions of answer and To identify the soft-copy relationship between any pair of words, clue in input passage. All tag features and binary features are casted we utilize synonyms and word vectors, such as Glove [33], to con- into 16-dimensional vectors by different embedding matrices that struct a related words dictionary R, where R(w) = {w1,w2, ··· ,w |R(w)| } are trainable. returns a set of words that is closely related to w. For each word w, Suppose the embedding of passage p is (w1,w2, ··· ,w |p |). Our R(w) is composed of the synonyms of w, as well as the top N most encoder will read the input sequence and produce a sequence of similar words estimated by word vector representations (we set hidden states h1,h2, ··· ,h |p |, where each hidden state is a concate- N = 5). In our work, we utilize Glove word vectors, and construct nation of a forward representation and a backward representation: R based on Genism [37] and WordNet [30]. −→ ←− Rules for Style Classification. Algorithm 2 presents our method hi = [ h i ; h i ], (3) for question style classification. We classify a given question into 9 −→ −→ h i = BiGRU(wi , h i− ), (4) q who, where, 1 classes based on a few heuristic strategies. If contains ←− ←− when, why, which, what, or how, we classify it as the corresponding h i = BiGRU(wi , h i+1). (5) type. For yes-no type questions, we define a set of feature words. If −→ ←− q starts with any word belonging to the set of feature words, we The h i and h i are the forward and backward hidden states of the classify it as type yes-no. For all other cases, we label it as other. i-th token in p, respectively. Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus WWW ’20, April 20–24, 2020, Taipei, Taiwan

Decoder. Our decoder is another GRU with attention and copy GPT-2 Language Model mechanism. Denote the embedding vector of desired question style s as hs . We initialize the hidden state of our decoder GRU by con- ←− catenating hs with the last backward encoder hidden state h 1 to a Word Embeddings linear layer: Positional Embeddings ←− Segment Embeddings sl = tanh(W0 h 1 + b), (6) s0 = [hs ;sl ]. (7) Passage Clue Answer