Answer Span Correction in Machine Reading Comprehension

Revanth Gangi Reddy,∗ Md Arafat Sultan, Efsun Sarioglu Kayi, Rong Zhang, Vittorio Castelli,† Avirup Sil IBM Research AI [email protected], [email protected], {avi,zhangr,vittorio}@us.ibm.com, [email protected]

Abstract Q: what type of goes in italian GT: usually , , , , etc. Answer validation in machine reading compre- Prediction: acini di pepe hension (MRC) consists of verifying an ex- Q: when does precipitate form in a chemical reaction tracted answer against an input context and GT: When the reaction occurs in a liquid solution question pair. Previous work has looked at Prediction: Precipitation is the creation of a solid from re-assessing the “answerability” of the ques- a solution. When the reaction occurs in a liquid tion given the extracted answer. Here we ad- solution dress a different problem: the tendency of ex- isting MRC systems to produce partially cor- Q: what is most likely cause of algal blooms rect answers when presented with answerable GT: an excess of nutrients, particularly some questions. We explore the nature of such er- phosphates rors and propose a post-processing correction Prediction: Freshwater algal blooms are the result of method that yields statistically significant per- an excess of nutrients formance improvements over state-of-the-art Figure 1: Examples of partially correct MRC predic- MRC systems in both monolingual and mul- tions and corresponding ground truth (GT) answers. tilingual evaluation. The reader fails to find a minimal yet sufficient answer in all three cases. 1 Introduction

Extractive machine reading comprehension (MRC) Recent work on answer validation (Penas˜ et al., has seen unprecedented progress in recent years 2007) has focused on improving the prediction of (Pan et al., 2019; Liu et al., 2020; Khashabi et al., the answerability of a question given an already 2020; Lewis et al., 2019). Nevertheless, existing extracted answer. Hu et al.(2019) look for support MRC systems—readers, henceforth—extract only of the extracted answer in local entailments be- partially correct answers in many cases. At the time tween the answer sentence and the question. Back of this writing, for example, the top systems on et al.(2020) propose an attention-based model leaderboards like SQuAD (Rajpurkar et al., 2016), that explicitly checks if the candidate answer sat- HotpotQA (Yang et al., 2018) and Quoref (Dasigi isfies all the conditions in the question. Zhang et al., 2019) all have a difference of 5–13 points et al.(2020) use a two-stage reading process: a arXiv:2011.03435v1 [cs.CL] 6 Nov 2020 between their exact match (EM) and F1 scores, sketchy reader produces a preliminary judgment which are measures of full and partial overlap with on answerability and an intensive reader extracts the ground truth answer(s), respectively. Figure1 candidate answer spans to verify the answerability. shows three examples of such errors that we ob- Here we address the related problem of improv- served in a state-of-the-art (SOTA) RoBERTa-large ing the answer span, and present a correction (Liu et al., 2019) model on the recently released model that re-examines the extracted answer in Natural Questions (NQ) (Kwiatkowski et al., 2019) context to suggest corrections. Specifically, we dataset. In this paper, we investigate the nature of mark the extracted answer with special delimiter such partial match errors in MRC and also their tokens and show that a corrector with architecture post hoc correction in context. similar to that of the original reader can be trained ∗ Work done during AI Residency at IBM Research. to produce a new accurate prediction. † Corresponding author. Our main contributions are as follows: (1) We Error % Single-Span GT 67 Prediction ⊂ GT 33 GT ⊂ Prediction 28 Prediction ∩ GT 6= ∅ 6 Multi-Span GT 33

Table 1: Types of errors in NQ dev predictions with a partial match with the ground truth. Figure 2: Flow of an MRC instance through the reader- corrector pipeline. The corrector takes an input, with analyze partially correct predictions of a SOTA En- special delimiter tokens ([Td]) marking the reader’s pre- glish reader model, revealing a distribution over dicted answer in context, and makes a new prediction. three broad categories of errors. (2) We show that an Answer Corrector model can be trained to cor- qualifications or syntactic completions of the rect errors in all three categories given the question main answer entity. and the original prediction in context. (3) We fur- 2. GT ⊂ Prediction: Exemplified by the second ther show that our approach generalizes to other example in Figure1, this category comprises languages: our proposed answer corrector yields cases where the model’s prediction subsumes statistically significant improvements over strong the closest GT, and is therefore not minimal. RoBERTa and Multilingual BERT (mBERT) (De- In many cases, these predictions lack syntactic vlin et al., 2019) baselines on both monolingual structure and semantic coherence as a textual and multilingual benchmarks. unit. 3. Prediction ∩ GT 6= ∅: This final category con- 2 Partial Match in MRC sists of cases similar to the last example of Fig- Short-answer extractive MRC only extracts short ure1, where the prediction partially overlaps sub-sentence answer spans, but locating the best with the GT. (We slightly abuse the set notation span can still be hard. For example, the answer for conciseness.) Such predictions generally may contain complex substructures including multi- exhibit both verbosity and inadequacy. item lists or question-specific qualifications and Table1 shows the distribution of errors over all contextualizations of the main answer entity. This categories. section analyzes the distribution of broad categories of errors that neural readers make when they fail to 3 Method pinpoint the exact ground truth span (GT) despite making a partially correct prediction. In this section, we describe our approach to correct- To investigate, we evaluate a RoBERTa-large ing partial-match predictions of the reader. reader (details in Section3) on the NQ dev set and 3.1 The Reader identify 587 examples where the predicted span has only a partial match (EM = 0, F1 > 0) with the GT. We train a baseline reader for the standard MRC Since most existing MRC readers are trained to pro- task of answer extraction from a passage given a duce single spans, we discard examples where the question. The reader uses two classification heads NQ annotators provided multi-span answers con- on top of a pre-trained transformer-based language sisting of multiple non-contiguous subsequences model (Liu et al., 2019), pointing to the start and of the context. After discarding such multi-span end positions of the answer span. The entire net- GT examples, we retain 67% of the 587 originally work is then fine-tuned on the target MRC training identified samples. data. For additional details on a transformer-based There are three broad categories of partial match reader, see (Devlin et al., 2019). errors: 3.2 The Corrector 1. Prediction ⊂ GT: As the top example in Fig- Our correction model uses an architecture that is ure1 shows, in these cases, the reader only ex- similar to the reader’s, but takes a slightly differ- tracts part of the GT and drops words/phrases ent input. As shown in Figure2, the input to the such as items in a comma-separated list and corrector contains special delimiter tokens marking the boundaries of the reader’s prediction, while the Model Dev Test rest is the same as the reader’s input. Ideally, we RoBERTa Reader 61.2 62.4 want the model to keep answers that already match + Corrector 62.8 63.7 the GT intact and correct the rest. To generate training data for the corrector, we Table 2: Exact Match results on Natural Questions. need a reader’s predictions for the training set. To obtain these, we split the training set into five folds, has 15,747 answerable questions in the dev set and train a reader on four of the folds and get predic- a much larger test set with 158,083 answerable tions on the remaining fold. We repeat this process questions. five times to produce predictions for all (question, answer) pairs in the training set. The training ex- 4.2 Setup amples for the corrector are generated using these Our NQ and MLQA readers fine-tune a RoBERTa- reader predictions and the original GT annotations. large and an mBERT (cased, 104 languages) lan- To create examples that do not require correction, guage model, respectively. Following Alberti we create a new example from each original ex- et al.(2019), we fine-tune the RoBERTa model ample where we delimit the GT answer itself in first on SQuAD2.0 (Rajpurkar et al., 2018) and the input, indicating no need for correction. For then on NQ. Our experiments showed that training examples that need correction, we use the reader’s on both answerable and unanswerable questions top k incorrect predictions (k is a hyperparameter) yields a stronger and more robust reader for NQ, to create an example for each, where the input is even though we evaluate only on answerable ques- the reader’s predicted span and the target is the GT. tions. For MLQA, we follow Lewis et al.(2019) The presence of both GT (correct) and incorrect to train on SQuAD1.1 (Rajpurkar et al., 2016), as predictions in the input data ensures that the cor- MLQA does not contain any training data. We ob- rector learns both to detect errors in the reader’s tain similar baseline results as reported in (Lewis predictions and to correct them. et al., 2019). All our implementations are based on 4 Experiments the Transformers library by Wolf et al.(2019). For each dataset, the answer corrector uses the 4.1 Datasets same underlying transformer language model as We evaluate our answer correction model on two the corresponding reader. While creating training benchmark datasets. data for the corrector, to generate examples that need correction, we take the two (k = 2) highest- Natural Questions (NQ) (Kwiatkowski et al., scoring incorrect reader predictions (the value of 2019) is an English MRC benchmark which k was tuned on dev). Since our goal is to fully contains questions from Google users, and requires correct any inaccuracies in the reader’s prediction, systems to read and comprehend entire Wikipedia we use exact match (EM) as our evaluation metric. articles. We evaluate our system only on the We train the corrector model for one epoch with answerable questions in the dev and test sets. NQ a batch size of 32, a warmup rate of 0.1 and a contains 307,373 instances in the train set, 3,456 maximum query length of 30. For NQ, we use answerable questions in the dev set and 7,842 a learning rate of 2e-5 and a maximum sequence total questions in the blind test set of which an length of 512; the corresponding values for MLQA undisclosed number is answerable. To compute are 3e-5 and 384, respectively. exact match on answerable test set questions, we submitted a system that always outputs an answer 4.3 Results 1 and took the recall value from the leaderboard. We report results obtained by averaging over three seeds. Table2 shows the results on the answerable MLQA (Lewis et al., 2019) is a multilingual ex- questions of NQ. Our answer corrector improves tractive MRC dataset with monolingual and cross- upon the reader by 1.6 points on the dev set and 1.3 lingual instances in seven languages: English (en), points on the blind test set. Arabic (ar), German (de) , Spanish (es), Hindi (hi), Results on MLQA are shown in Table3. We Vietnamese (vi) and Simplified Chinese (zh). It compare performances in two settings: one with 1ai.google.com/research/NaturalQuestions/leaderboard the paragraph in English and the question in any En-Context G-XLT Model EM Model Dev Test Dev Test Reader 61.2 mBERT Reader 47.5 45.6 35.0 34.7 Ensemble of Readers 62.1 + Corrector 48.3 46.4 35.5 35.3 Reader + Corrector 62.8

Table 3: Exact match results on MLQA. En-Context Table 5: Error correction versus model ensembling. refers to examples with an English paragraph, G-XLT refers to the generalized cross-lingual transfer task. 5 Analysis 5.1 Comparison with Equal Parameters of the seven languages (En-Context), and the other In our approach, the reader and the corrector have being the Generalized Cross-Lingual task (G-XLT) a common architecture, but their parameters are proposed in (Lewis et al., 2019), where the final separate and independently learned. To compare performance is the average over all 49 (question, with an equally sized baseline, we build an ensem- paragraph) language pairs involving the seven lan- ble system for NQ which averages the output logits guages. of two different RoBERTa readers. As Table5 shows, the corrector on top of a single reader still outperforms this ensemble of readers. These results q\c en es hi vi de ar zh confirm that the proposed correction objective com- en 0.2↑ ↓0.1 ↓0.2 ↓0.4 ↓0.1 ↓0.1 ↓0.3 plements the reader’s extraction objective well and es 0.9↑ ↓0.2 ↓0.1 0.2↑ 0.8↑ 0.5↑ 1.4↑ hi 0.8↑ 0.8↑ 0.8↑ 0.8↑ 0.6↑ 0.4↑ 0.2↑ is fundamental to our overall performance gain. vi 0.9↑ 1.7↑ 0.7↑ 0.3↑ 1.3↑ 0.9↑ 0.5↑ de 1.7↑ 0.6↑ ↓0.1 0.6↑ 0.1↑ 1.3↑ 0.9↑ 5.2 Changes in Answers ar 0.5↑ 1.0↑ 0.4↑ 0.7↑ 0.9↑ 0.5↑ 0.4↑ zh 0.9↑ 0.1↑ 0.9↑ 0.8↑ 1.3↑ 0.4↑ 0.3↑ We inspect the changes made by the answer correc- tor to the reader’s predictions on the NQ dev set. AVG 0.8↑ 0.6↑ 0.3↑ 0.4↑ 0.7↑ 0.6↑ 0.5↑ Overall, it altered 13% (450 out of 3,456) of the Table 4: Changes in exact match with the answer cor- reader predictions. Of all changes, 24% resulted in rector, for all the language pair combinations in the the correction of an incorrect or a partially correct MLQA test set. The final row shows the gain for each answer to a GT answer and 10% replaced the origi- paragraph language averaged over questions in differ- nal correct answer with a new correct answer (due ent languages. to multiple GT annotations in NQ). In 57% of the cases, the change did not correct the error. On a closer look, however, we observe that the F1 score Table4 shows the differences in exact match went up in more of these cases (30%) compared to scores for all 49 MLQA language pair combina- when it dropped (15%). Finally, 9% of the changes tions, from using the answer corrector over the introduced an error in a correct reader prediction. reader. On average, the corrector gives perfor- These statistics are shown in Table6. mance gains for paragraphs in all languages (last row). The highest gains are observed in English R\R+C Correct Incorrect contexts, which is expected as the model was Correct 45 (10%) 43 (9%) trained to correct English answers in context. How- Incorrect 109 (24%) 253 (57%) ever, we find that the approach generalizes well to the other languages in a zero-shot setting as exact Table 6: Statistics for the correction model altering match improves in 40 of the 49 language pairs. original reader predictions. The row header refers to predictions from the reader and the column header We performed Fisher randomization tests refers to the final output from the corrector. (Fisher, 1936) on the exact match numbers to ver- ify the statistical significance of our results. For Table7 shows some examples of correction MLQA, we found our reader + corrector pipeline made by the model for each of the three single- to be significantly better than the baseline reader on span error categories of Table1. Two examples the 158k-example test set at p < 0.01. For NQ, the wherein the corrector introduces an error into a pre- p-value for the dev set results was approximately viously correct output from the reader model are 0.05. shown in Table8. Error Class Question Passage Prediction Prediction ⊂ GT who won the king ... Title Winner : LAAB Crew From R: LAAB Crew of dance season 2 Team Sherif , 1st Runner-up : ADS kids From Team Sherif , 2nd Runner-up R+C: LAAB Crew From Team : Bipin and Princy From Team Jeffery ... Sherif GT ⊂ Prediction unsaturated fats are ... An unsaturated fat is a fat or fatty R: An unsaturated fat is a fat or comprised of lipids acid in which there is at least one dou- fatty acid in which there is at least that contain ble bond within the fatty acid chain. A one double bond fatty acid chain is monounsaturated if it contains one double ... R+C: at least one double bond Prediction ∩ GT 6= ∅ what is most likely ... colloquially as red tides. Freshwater R: Freshwater algal blooms are cause of algal algal blooms are the result of an excess the result of an excess of nutrients blooms of nutrients , particularly some phos- phates. The excess of nutrients may R+C: an excess of nutrients , par- originate from fertilizers that are applied ticularly some phosphates to land for agricultural or recreational ...

Table 7: Some examples for different error classes in the Natural Questions dev set wherein the answer corrector corrects a previously incorrect reader output. Ground truth answer is marked in bold in the passage. R and C refer to reader and corrector, respectively.

Question Passage Prediction where are the cones ... Cone cells, or cones, are one of three types of R: in the retina in the eye located photoreceptor cells in the retina of mammalian eyes (e.g. the human eye). They are responsible R+C: retina for color vision and function best in .. who sang the theme ... Jesse Frederick James Conaway (born 1948), R: Jesse Frederick James Conaway song to step by step known professionally as Jesse Frederick, is an R+C: Jesse Frederick James Conaway American film and television composer and singer (born 1948), known professionally as best known for writing ... Jesse Frederick

Table 8: Examples from the Natural Questions dev set wherein the answer corrector introduces an error in a previously correct reader output. The ground truth answer is marked in bold in each passage. R and C refer to reader and corrector, respectively.

Table9 shows the percentage of errors corrected Error class Total Corrected in each error class. Corrections were made in all GT ⊂ Prediction 165 (28%) 62 (38%) three categories, but more in GT ⊂ Prediction and Prediction ⊂ GT 191 (33%) 18 (9%) Prediction ∩ GT 6= ∅ than in Prediction ⊂ GT, in- Prediction ∩ GT 6= ∅ 37 (6%) 8 (22%) dicating that the corrector learns the concepts of Multi-span GT 194 (34%) - minimality and syntactic structure better than that of adequacy. We note that most existing MRC sys- Table 9: Correction statistics for different error cate- gories in 587 partial match (EM=0, F1>0) reader pre- tems that only output a single contiguous span are dictions. not equipped to handle multi-span discontinuous GT. rection and validation of the answerability of the 6 Conclusion question, re-using the original reader’s output repre- We describe a novel method for answer span cor- sentations in the correction model and architectural rection in machine reading comprehension. The changes enabling parameter sharing between the proposed method operates by marking an original, reader and the corrector. possibly incorrect, answer prediction in context Acknowledgments and then making a new prediction using a correc- tor model. We show that this method corrects the We thank Tom Kwiatkowski for his help with de- predictions of a state-of-the-art English-language bugging while submitting to the Google Natural reader in different error categories. In our experi- Questions leaderboard. We would also like to thank ments, the approach also generalizes well to multi- the multilingual NLP team at IBM Research AI and lingual and cross-lingual MRC in seven languages. the anonymous reviewers for their helpful sugges- Future work will explore joint answer span cor- tions and feedback. References Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Chris Alberti, Kenton Lee, and Michael Collins. 2019. Luke Zettlemoyer, and Veselin Stoyanov. 2019. A BERT Baseline for the Natural Questions. arXiv RoBERTa: A Robustly Optimized BERT Pretrain- preprint arXiv:1901.08634. ing Approach. arXiv preprint arXiv:1907.11692. Seohyun Back, Sai Chetan Chinthakindi, Akhil Kedia, Lin Pan, Rishav Chakravarti, Anthony Ferritto, Haejun Lee, and Jaegul Choo. 2020. NeurQuRI: Michael Glass, Alfio Gliozzo, Salim Roukos, Radu Neural question requirement inspector for answer- Florian, and Avirup Sil. 2019. Frustratingly ability prediction in machine reading comprehen- Easy Natural Question Answering. arXiv preprint sion. In International Conference on Learning Rep- arXiv:1909.05286. resentations. Anselmo Penas,˜ Alvaro´ Rodrigo, and Felisa Verdejo. Pradeep Dasigi, Nelson F Liu, Ana Marasovic, Noah A 2007. Overview of the answer validation exercise Smith, and Matt Gardner. 2019. Quoref: A Read- 2007. In Workshop of the Cross-Language Evalu- ing Comprehension Dataset with Questions Requir- ation Forum for European Languages, pages 237– ing Coreferential Reasoning. In Proceedings of the 248. Springer. 2019 Conference on Empirical Methods in Natu- ral Language Processing and the 9th International Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Joint Conference on Natural Language Processing Know What You Don’t Know: Unanswerable Ques- (EMNLP-IJCNLP), pages 5927–5934. tions for SQuAD. In Proceedings of the 56th An- nual Meeting of the Association for Computational Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Linguistics (Volume 2: Short Papers), pages 784– Kristina Toutanova. 2019. BERT: Pre-training of 789, Melbourne, Australia. Association for Compu- Deep Bidirectional Transformers for Language Un- tational Linguistics. derstanding. In Proceedings of the 2019 Conference Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and of the North American Chapter of the Association Percy Liang. 2016. SQuAD: 100,000+ Questions for Computational Linguistics: Human Language for Machine Comprehension of Text. In Proceed- Technologies, Volume 1 (Long and Short Papers), ings of the 2016 Conference on Empirical Methods pages 4171–4186, Minneapolis, Minnesota. Associ- in Natural Language Processing, pages 2383–2392, ation for Computational Linguistics. Austin, Texas. Association for Computational Lin- guistics. Ronald Aylmer Fisher. 1936. Design of experiments. Br Med J, 1(3923):554–554. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- Minghao Hu, Furu Wei, Yu xing Peng, Zhen Xian ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow- Huang, Nan Yang, and Ming Zhou. 2019. Read icz, and Jamie Brew. 2019. HuggingFace’s Trans- + Verify: Machine Reading Comprehension with formers: State-of-the-art Natural Language Process- Unanswerable Questions. In AAAI. ing. ArXiv, abs/1910.03771. Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, Oyvind Tafjord, Peter Clark, and Hannaneh Ha- William Cohen, Ruslan Salakhutdinov, and Christo- jishirzi. 2020. UnifiedQA: Crossing Format Bound- pher D Manning. 2018. HotpotQA: A Dataset for aries With a Single QA System. arXiv preprint Diverse, Explainable Multi-hop Question Answer- arXiv:2005.00700. ing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- pages 2369–2380. field, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Zhuosheng Zhang, Junjie Yang, and Hai Zhao. 2020. Kenton Lee, et al. 2019. Natural questions: a bench- Retrospective reader for machine reading compre- mark for question answering research. Transactions hension. arXiv preprint arXiv:2001.09694. of the Association for Computational Linguistics, 7:453–466.

Patrick Lewis, Barlas Oguz,˘ Ruty Rinott, Sebastian Riedel, and Holger Schwenk. 2019. MLQA: Evalu- ating Cross-lingual Extractive Question Answering. arXiv preprint arXiv:1910.07475.

Dayiheng Liu, Yeyun Gong, Jie Fu, Yu Yan, Jiusheng Chen, Daxin Jiang, Jiancheng Lv, and Nan Duan. 2020. RikiNet: Reading Wikipedia Pages for Natural Question Answering. arXiv preprint arXiv:2004.14560.