Answer Span Correction in Machine Reading Comprehension

Answer Span Correction in Machine Reading Comprehension Revanth Gangi Reddy,∗ Md Arafat Sultan, Efsun Sarioglu Kayi, Rong Zhang, Vittorio Castelli,y Avirup Sil IBM Research AI [email protected], [email protected], favi,zhangr,[email protected], [email protected] Abstract Q: what type of pasta goes in italian wedding soup GT: usually cavatelli, acini di pepe, pastina, orzo, etc. Answer validation in machine reading compre- Prediction: acini di pepe hension (MRC) consists of verifying an ex- Q: when does precipitate form in a chemical reaction tracted answer against an input context and GT: When the reaction occurs in a liquid solution question pair. Previous work has looked at Prediction: Precipitation is the creation of a solid from re-assessing the “answerability” of the ques- a solution. When the reaction occurs in a liquid tion given the extracted answer. Here we ad- solution dress a different problem: the tendency of existing MRC systems to produce partially cor- Q: what is most likely cause of algal blooms rect answers when presented with answerable GT: an excess of nutrients, particularly some questions. We explore the nature of such er- phosphates rors and propose a post-processing correction Prediction: Freshwater algal blooms are the result of method that yields statistically significant per- an excess of nutrients formance improvements over state-of-the-art Figure 1: Examples of partially correct MRC predic- MRC systems in both monolingual and mul- tions and corresponding ground truth (GT) answers. tilingual evaluation. The reader fails to find a minimal yet sufficient answer in all three cases. 1 Introduction Extractive machine reading comprehension (MRC) Recent work on answer validation (Penas˜ et al., has seen unprecedented progress in recent years 2007) has focused on improving the prediction of (Pan et al., 2019; Liu et al., 2020; Khashabi et al., the answerability of a question given an already 2020; Lewis et al., 2019). Nevertheless, existing extracted answer. Hu et al.(2019) look for support MRC systems—readers, henceforth—extract only of the extracted answer in local entailments be- partially correct answers in many cases. At the time tween the answer sentence and the question. Back of this writing, for example, the top systems on et al.(2020) propose an attention-based model leaderboards like SQuAD (Rajpurkar et al., 2016), that explicitly checks if the candidate answer sat- HotpotQA (Yang et al., 2018) and Quoref (Dasigi isfies all the conditions in the question. Zhang et al., 2019) all have a difference of 5–13 points et al.(2020) use a two-stage reading process: a arXiv:2011.03435v1 [cs.CL] 6 Nov 2020 between their exact match (EM) and F1 scores, sketchy reader produces a preliminary judgment which are measures of full and partial overlap with on answerability and an intensive reader extracts the ground truth answer(s), respectively. Figure1 candidate answer spans to verify the answerability. shows three examples of such errors that we ob- Here we address the related problem of improv- served in a state-of-the-art (SOTA) RoBERTa-large ing the answer span, and present a correction (Liu et al., 2019) model on the recently released model that re-examines the extracted answer in Natural Questions (NQ) (Kwiatkowski et al., 2019) context to suggest corrections. Specifically, we dataset. In this paper, we investigate the nature of mark the extracted answer with special delimiter such partial match errors in MRC and also their tokens and show that a corrector with architecture post hoc correction in context. similar to that of the original reader can be trained ∗ Work done during AI Residency at IBM Research. to produce a new accurate prediction. y Corresponding author. Our main contributions are as follows: (1) We Error % Single-Span GT 67 Prediction ⊂ GT 33 GT ⊂ Prediction 28 Prediction \ GT 6= ; 6 Multi-Span GT 33 Table 1: Types of errors in NQ dev predictions with a partial match with the ground truth. Figure 2: Flow of an MRC instance through the reader- corrector pipeline. The corrector takes an input, with analyze partially correct predictions of a SOTA En- special delimiter tokens ([Td]) marking the reader’s pre- glish reader model, revealing a distribution over dicted answer in context, and makes a new prediction. three broad categories of errors. (2) We show that an Answer Corrector model can be trained to cor- qualifications or syntactic completions of the rect errors in all three categories given the question main answer entity. and the original prediction in context. (3) We fur- 2. GT ⊂ Prediction: Exemplified by the second ther show that our approach generalizes to other example in Figure1, this category comprises languages: our proposed answer corrector yields cases where the model’s prediction subsumes statistically significant improvements over strong the closest GT, and is therefore not minimal. RoBERTa and Multilingual BERT (mBERT) (De- In many cases, these predictions lack syntactic vlin et al., 2019) baselines on both monolingual structure and semantic coherence as a textual and multilingual benchmarks. unit. 3. Prediction \ GT 6= ;: This final category con- 2 Partial Match in MRC sists of cases similar to the last example of Fig- Short-answer extractive MRC only extracts short ure1, where the prediction partially overlaps sub-sentence answer spans, but locating the best with the GT. (We slightly abuse the set notation span can still be hard. For example, the answer for conciseness.) Such predictions generally may contain complex substructures including multi- exhibit both verbosity and inadequacy. item lists or question-specific qualifications and Table1 shows the distribution of errors over all contextualizations of the main answer entity. This categories. section analyzes the distribution of broad categories of errors that neural readers make when they fail to 3 Method pinpoint the exact ground truth span (GT) despite making a partially correct prediction. In this section, we describe our approach to correct- To investigate, we evaluate a RoBERTa-large ing partial-match predictions of the reader. reader (details in Section3) on the NQ dev set and 3.1 The Reader identify 587 examples where the predicted span has only a partial match (EM = 0, F1 > 0) with the GT. We train a baseline reader for the standard MRC Since most existing MRC readers are trained to pro- task of answer extraction from a passage given a duce single spans, we discard examples where the question. The reader uses two classification heads NQ annotators provided multi-span answers con- on top of a pre-trained transformer-based language sisting of multiple non-contiguous subsequences model (Liu et al., 2019), pointing to the start and of the context. After discarding such multi-span end positions of the answer span. The entire net- GT examples, we retain 67% of the 587 originally work is then fine-tuned on the target MRC training identified samples. data. For additional details on a transformer-based There are three broad categories of partial match reader, see (Devlin et al., 2019). errors: 3.2 The Corrector 1. Prediction ⊂ GT: As the top example in Fig- Our correction model uses an architecture that is ure1 shows, in these cases, the reader only ex- similar to the reader’s, but takes a slightly differ- tracts part of the GT and drops words/phrases ent input. As shown in Figure2, the input to the such as items in a comma-separated list and corrector contains special delimiter tokens marking the boundaries of the reader’s prediction, while the Model Dev Test rest is the same as the reader’s input. Ideally, we RoBERTa Reader 61.2 62.4 want the model to keep answers that already match + Corrector 62.8 63.7 the GT intact and correct the rest. To generate training data for the corrector, we Table 2: Exact Match results on Natural Questions. need a reader’s predictions for the training set. To obtain these, we split the training set into five folds, has 15,747 answerable questions in the dev set and train a reader on four of the folds and get predic- a much larger test set with 158,083 answerable tions on the remaining fold. We repeat this process questions. five times to produce predictions for all (question, answer) pairs in the training set. The training ex- 4.2 Setup amples for the corrector are generated using these Our NQ and MLQA readers fine-tune a RoBERTa- reader predictions and the original GT annotations. large and an mBERT (cased, 104 languages) lan- To create examples that do not require correction, guage model, respectively. Following Alberti we create a new example from each original ex- et al.(2019), we fine-tune the RoBERTa model ample where we delimit the GT answer itself in first on SQuAD2.0 (Rajpurkar et al., 2018) and the input, indicating no need for correction. For then on NQ. Our experiments showed that training examples that need correction, we use the reader’s on both answerable and unanswerable questions top k incorrect predictions (k is a hyperparameter) yields a stronger and more robust reader for NQ, to create an example for each, where the input is even though we evaluate only on answerable ques- the reader’s predicted span and the target is the GT. tions. For MLQA, we follow Lewis et al.(2019) The presence of both GT (correct) and incorrect to train on SQuAD1.1 (Rajpurkar et al., 2016), as predictions in the input data ensures that the cor- MLQA does not contain any training data. We ob- rector learns both to detect errors in the reader’s tain similar baseline results as reported in (Lewis predictions and to correct them. et al., 2019).

Answer Span Correction in Machine Reading Comprehension

Italian Wedding Soup

KC Refrigerated Product List 10.1.19.Indd

Touchofitaly.Com (302) 703-3090

Pasta ($10) Meat ($15) Fish

L'eccellenza Quotidiana

2017 Products Food Service

The Geography of Italian Pasta

The 30 Year Horizon

International Culinary Arts & Sciences Institute

10 Types of Pasta and Information About Them

Chart-Of-Pasta-Shapes

Renaissance Cocktail Menu