Arxiv:2010.01611V2 [Cs.CL] 19 Oct 2020 ( Ulcdtst,Eg QA ( Squad Comprehensive2016 E.G
Total Page:16
File Type:pdf, Size:1020Kb
When in Doubt, Ask: Generating Answerable and Unanswerable Questions, Unsupervised Liubov Nikolenko Pouya Rezazadeh Kalehbasti Stanford University / Stanford, CA Stanford University / Stanford, CA [email protected] [email protected] Abstract models, most notably BERT (Devlin et al., 2018). As an instance of these advancements in Ex- Question Answering (QA) is key for making tractive Question Answering (EQA), state-of-the- possible a robust communication between hu- man and machine. Modern language mod- art models trained on SQuAD dataset have sur- els used for QA have surpassed the human- passed human performance (Devlin et al., 2018), performance in several essential tasks; how- while BERT model has a performance on par with ever, these models require large amounts human’s on the updated version of the dataset, of human-generated training data which are SQuAD 2.0 (Devlin et al., 2018). costly and time-consuming to create. This pa- per studies augmenting human-made datasets Yet, the mentioned achievements come at a with synthetic data as a way of surmount- price: massive human-made training datasets. ing this problem. A state-of-the-art model Generating this massive training data is typically based on deep transformers is used to in- crowd-sourced, and the process takes consider- spect the impact of using synthetic answer- able time and resources (Lewis et al., 2019b). Fur- able and unanswerable questions to comple- ther, training models on these large datasets is ment a well-known human-made dataset. The results indicate a tangible improvement in the time-consuming and computationally expensive. performance of the language model (measured The problem exacerbates when models are trained in terms of F1 and EM scores) trained on on languages other than English which is well- the mixed dataset. Specifically, unanswer- researched and has an abundance of training data able question-answers prove more effective available for different tasks. To tackle these chal- in boosting the model: the F1 score gain lenges, some researchers have tried to create more from adding to the original dataset the answer- effective models which can perform better than able, unanswerable, and combined question- answers were 1.3%, 5.0%, and 6.7%, respec- current models on existing datasets, while others tively. [Link to the Github repository: have developed more complex datasets or devised https://github.com/lnikolenko/EQA] methods of synthesizing training data to get better results from current EQA models. The following 1 Introduction section briefly reviews some of these efforts dur- arXiv:2010.01611v2 [cs.CL] 19 Oct 2020 1.1 Problem Statement ing the past few years. Question Answering (QA) is essential to enabling effective communication between humans and ma- chines. As a computer science discipline, it falls 1.2 Prior Research under information retrieval and natural language processing (NLP), and it concerns building ma- Researchers have adopted three major approaches chines able to answer questions asked by humans for creating more effective QA systems: (1) create in natural languages (Cimiano et al., 2014). Re- more effective models to better leverage existing cent years have seen significant progress in Ques- datasets, (2) generate more actual training/test data tion Answering owing to novel comprehensive using crowd-sourcing or (3) generate synthetic public datasets, e.g. SQuAD (Rajpurkar et al., training/test data to improve the performance of 2016), TriviaQA (Joshi et al., 2017), HOTPOTQA existing models. These approaches are briefly ex- (Yang et al., 2018), and modern deep-learning plored below. 1.2.1 Model Development for shared substructures across candidate spans for answering the asked question, and this has resulted Qi et al. (Qi et al., 2019) focus on the task of QA in the improved performance of their model com- across multiple documents which requires multi- pared to the baseline models studied in the paper. hop reasoning. Their main hypothesis is that the current QA models are too expensive to scale up 1.2.2 Actual Data Generation efficiently to open-domain QA queries, so they cre- Lewis et al. (Lewis et al., 2019a) took on the ate a new QA model called GOLDEN (Gold En- challenge of ‘cross-lingual EQA’ by developing tity) Retriever, able to perform iterative-reasoning- a multi-lingual benchmark dataset, called MLQA. and-retrieval for open-domain multi-hop question This dataset covered seven languages including answering. They train and test the proposed model English and Vietnamese with more than 12k in- on HOTPOTQA multi-hop dataset. One highlight stances in English and 5k in the other six lan- of Qi et al. model is that they avoid computation- guages. They also managed to make each instance ally demanding neural models, such as BERT, and included in the benchmark to be paralleled across instead use off-the-shelf information retrieval sys- at least four of their chosen languages. Lewis tems to look for missing entities. They show that et al. aimed to reduce the overfit observed in the proposed QA model outperforms several state- cross-lingual QA models. As their baseline mod- of-the-art QA models on HOTPOTQA test-set. els, Lewis et al. used BERT and XLM models In another work, Wang et al. (Wang et al., (Lewis et al., 2019a). The dataset they developed 2018) develop an open-domain QA system, called only included development and testing set, so for 3 R , with two innovative features in its question- training baseline models, they used the SQuAD answering pipeline: a ranker to rank the retrieved v1.1 dataset. Using their test/dev dataset, Lewis et passages (based on the likelihood of retrieving al. finally showed that the transfer results for state- the ground-truth answer to a query), and a reader of-the-art models (in terms of EM and F1 score) to extract answers from the ranked passages us- largely lag behind the training results; hence, more ing Reinforcement Learning (RL). Modern deep work is required to reduce the variance of high- learning models for open-domain QA use large performance models in EQA. text corpora as training sets, and use a two-step In a recent paper, Reddy et al. (Reddy et al., process to answer questions: (1) Information Re- 2019) develop a dataset focused on Conversational trieval (IR) to select the relevant passages, and (2) Question-Answering, called CoQA. They hypoth- Reading Comprehension (RC) to select candidate esize that machine QA systems should be able to phrases containing the answer (Chen et al., 2017; answer questions asked based on conversations, Dhingra et al., 2016). The model proposed in this as humans can do. Their dataset includes 127k paper follows this same structure: Ranker module question-answer pairs from 8k conversation pas- acts as the IR while Reader module acts as the sages across 7 distinct domains. Reddy et al. RC. Wang et al. use SGD/backprop to train their show that the state-of-the-art language models (in- Reader and to maximize the probability that the cluding Augmented DrQA and DrQA+PGNet) are selected span contains the potential answer to the only able to secure an F1 score of 65.4% on CoQA query. They train the Ranker using REINFORCE dataset, falling short of the human-performance by (Williams, 1992) RL algorithm with a reward func- more than 20 points. The results of their work tion evaluating the quality of the answers extracted shows a huge potential for further research on from the passages Ranker sends to Reader. They conversational question answering which is key show that this configuration is robust against se- for natural human-machine communication. Pre- mantical differences between the question and the viously, Choi et al. (Choi et al., 2018) had con- passage. ducted a similar study on conversational question In an older study, Lee et al. (Lee et al., 2016) answering, and using high-performance language implemented a recurrent network (called RASOR) models, they obtained an F1 score 20 points less on SQuAD dataset for question answering, which than that of humans on their proposed dataset, resulted in a model with higher EM and F1 score called QuAC. compared to the most successful models up to the In another work, Rajpurkar et al. date [Match-LSTM]. In analysis, Lee et al. state (Rajpurkar et al., 2018) focus on augmenting that a recurrent net enables sharing computation the existing QA datasets with unanswerable questions. They hypothesize that the existing QA on question-answering task were not explored as models get trained only on answerable questions thoroughly. and easy-to-recognize unanswerable questions. In a similar work, Zhu et al. (Zhu et al., To make QA models robust against unanswerable 2019) propose a model to automatically gener- questions, they augment SQuAD dataset with ate unanswerable questions based on paragraph- 50k+ unanswerable questions generated through answerable-question pairs for the task of machine crowd-sourcing. They observe that the strongest reading comprehension. They use this model existing language models struggle to achieve to augment SQuAD 2.0 dataset and achieve im- an F1 score of 66% on their proposed update proved F1 scores, compared to the non-augmented to SQuAD dataset (called SQuAD 2.0), while dataset, using two state-of-the-art QA models. To achieving an F1 score of 86% on the initial create the model for generating unanswerable- version of the dataset. Rajpurkar et al. state that questions, Zhu et al. adopt a pair-to-sequence this newly developed dataset may spur research in architecture which they show outperforms mod- QA on stronger models which are robust against els with a typical sequence-to-sequence question- unanswerable questions. generating architecture. In an earlier work from 2017, Duan et al. 1.2.3 Synthetic Data Generation (Duan et al., 2017) propose a question-generator Lewis et al.