A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning
Total Page:16
File Type:pdf, Size:1020Kb
QUOREF: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning Pradeep Dasigi~ Nelson F. Liu~| Ana Marasovic´~ Noah A. Smith~| Matt Gardner~ ~Allen Institute for Artificial Intelligence |Paul G. Allen School of Computer Science & Engineering, University of Washington fpradeepd,anam,[email protected], fnfliu,[email protected] Abstract Byzantines were avid players of tavli (Byzantine Greek: τάβλη), a game known in English as backgammon, which is still Machine comprehension of texts longer than popular in former Byzantine realms, and still known by the a single sentence often requires coreference name tavli in Greece. Byzantine nobles were devoted to horsemanship, particularly tzykanion, now known as polo. resolution. However, most current reading The game came from Sassanid Persia in the early period and a comprehension benchmarks do not contain Tzykanisterion (stadium for playing the game) was built by complex coreferential phenomena and hence Theodosius II (r. 408–450) inside the Great Palace of fail to evaluate the ability of models to re- Constantinople. Emperor Basil I (r. 867–886) excelled at it; Emperor Alexander (r. 912–913) died from exhaustion while solve coreference. We present a new crowd- playing, Emperor Alexios I Komnenos (r. 1081–1118) was sourced dataset containing more than 24K injured while playing with Tatikios, and John I of Trebizond (r. span-selection questions that require resolv- 1235–1238) died from a fatal injury during a game. Aside from Constantinople and Trebizond, other Byzantine cities also ing coreference among entities in over 4.7K featured tzykanisteria, most notably Sparta, Ephesus, and English paragraphs from Wikipedia. Obtain- Athens, an indication of a thriving urban aristocracy. ing questions focused on such phenomena is challenging, because it is hard to avoid lexi- Q1. What is the Byzantine name of the game that Emperor Basil I excelled at? it → tzykanion cal cues that shortcut complex reasoning. We Q2. What are the names of the sport that is played in a deal with this issue by using a strong baseline Tzykanisterion? the game → tzykanion; polo model as an adversary in the crowdsourcing Q3. What cities had tzykanisteria? cities → Constantinople; loop, which helps crowdworkers avoid writing Trebizond; Sparta; Ephesus; Athens questions with exploitable surface cues. We show that state-of-the-art reading comprehen- Figure 1: Example paragraph and questions from the sion models perform significantly worse than dataset. Highlighted text in paragraphs is where the humans on this benchmark—the best model questions with matching highlights are anchored. Next to the questions are the relevant coreferent mentions performance is 70.5 F1, while the estimated from the paragraph. They are bolded for the first ques- human performance is 93.4 F1. tion, italicized for the second, and underlined for the 1 Introduction third in the paragraph. Paragraphs and other longer texts typically make multiple references to the same entities. Track- casens et al., 2011; Poesio et al., 2018), and (2) ing these references and resolving coreference is even if we can get crowdworkers to target coref- essential for full machine comprehension of these erence phenomena in their questions, these ques- texts. Significant progress has recently been made tions may contain giveaways that let models arrive in reading comprehension research, due to large at the correct answer without performing the de- crowdsourced datasets (Rajpurkar et al., 2016; Ba- sired reasoning (see x3 for examples). jaj et al., 2016; Joshi et al., 2017; Kwiatkowski We introduce a new dataset, QUOREF,1 that et al., 2019, inter alia). However, these datasets contains questions requiring coreferential reason- focus largely on understanding local predicate- ing (see examples in Figure1). The questions argument structure, with very few questions re- are derived from paragraphs taken from a diverse quiring long-distance entity tracking. Obtain- set of English Wikipedia articles and are collected ing such questions is hard for two reasons: (1) using an annotation process (x2) that deals with teaching crowdworkers about coreference is chal- the aforementioned issues in the following ways: lenging, with even experts disagreeing on its nu- 1Links to dataset, code, models, and leaderboard available ances (Pradhan et al., 2007; Versley, 2008; Re- at https://allennlp.org/quoref. First, we devise a set of instructions that gets Train Dev. Test workers to find anaphoric expressions and their Number of questions 19399 2418 2537 referents, asking questions that connect two men- Number of paragraphs 3771 454 477 tions in a paragraph. These questions mostly re- Avg. paragraph len (tokens) 384±105 381±101 385±103 Avg. question len (tokens) 17±6 17±6 17±6 volve around traditional notions of coreference Paragraph vocabulary size 57648 18226 18885 (Figure1 Q1), but they can also involve referen- Question vocabulary size 19803 5579 5624 tial phenomena that are more nebulous (Figure1 % of multi-span answers 10.2 9.1 9.7 Q3). Second, inspired by Dua et al.(2019), we disallow questions that can be answered by an ad- Table 1: Key statistics of QUOREF splits. versary model (uncased base BERT, Devlin et al., 2019, trained on SQuAD 1.1, Rajpurkar et al., that those spans are coreferential. We did not 2016) running in the background as the workers ask them to explicitly mark the co-referring spans. write questions. This adversary is not particularly Workers were asked to write questions for a ran- skilled at answering questions requiring corefer- dom sample of paragraphs from our pool, and we ence, but can follow obvious lexical cues—it thus showed them examples of good and bad questions helps workers avoid writing questions that short- in the instructions (see AppendixA). For each cut coreferential reasoning. question, the workers were also required to select QUOREF contains more than 24K questions one or more spans in the corresponding paragraph whose answers are spans or sets of spans in 4.7K as the answer, and these spans are not required to paragraphs from English Wikipedia that can be ar- be same as the coreferential spans that triggered rived at by resolving coreference in those para- the questions.4 We used an uncased base BERT graphs. We manually analyze a sample of the QA model (Devlin et al., 2019) trained on SQuAD dataset (x3) and find that 78% of the questions can- 1.1 (Rajpurkar et al., 2016) as an adversary run- not be answered without resolving coreference. ning in the background that attempted to answer We also show (x4) that the best system perfor- the questions written by workers in real time, and mance is 70.5% F , while the estimated human 1 the workers were able to submit their questions performance is 93.4%. These findings indicate only if their answer did not match the adversary’s that this dataset is an appropriate benchmark for prediction.5 AppendixA further details the logis- coreference-aware reading comprehension. tics of the crowdsourcing tasks. Some basic statis- 2 Dataset Construction tics of the resulting dataset can be seen in Table1. Collecting paragraphs We scraped paragraphs 3 Semantic Phenomena in QUOREF from Wikipedia pages about English movies, art To better understand the phenomena present in and architecture, geography, history, and music. QUOREF, we manually analyzed a random sample For movies, we followed the list of English lan- 2 of 100 paragraph-question pairs. The following guage films, and extracted plot summaries that are some empirical observations. are at least 40 tokens, and for the remaining cat- egories, we followed the lists of featured arti- Requirement of coreference resolution We cles.3 Since movie plot summaries usually men- found that 78% of the manually analyzed ques- tion many characters, it was easier to find hard tions cannot be answered without coreference res- QUOREF questions for them, and we sampled olution. The remaining 22% involve some form of about 40% of the paragraphs from this category. coreference, but do not require it to be resolved for answering them. Examples include a paragraph Crowdsourcing setup We crowdsourced ques- that mentions only one city, “Bristol”, and a sen- tions about these paragraphs on Mechanical Turk. tence that says “the city was bombed”. The associ- We asked workers to find two or more co-referring ated question, Which city was bombed?, does not spans in the paragraph, and to write questions such really require coreference resolution from a model that answering them would require the knowledge 4For example, the last question in Table2 is about the 2https://en.wikipedia.org/wiki/ coreference of fshe, Fania, his motherg, but none of these Category:English-language_films mentions is the answer. 3https://en.wikipedia.org/wiki/ 5Among models with acceptably low latency, we qualita- Wikipedia:Featured_articles tively found uncased base BERT to be the most effective. Reasoning Paragraph Snippet Question Answer Pronominal Anna and Declan eventually make their way on foot Who does Declan get the three van resolution (69%) to a roadside pub, where they discover the three into a fight with? thieves van thieves going through Anna’s luggage. Declan fights them, displaying unexpected strength for a man of his size, and retrieves Anna’s bag. Nominal Later, Hippolyta was granted a daughter, Princess What is the name of the Hippolyta resolution (54%) Diana, ::: Diana defies her mother and ::: person who is defied by her daughter? Multiple The now upbeat collective keep the toucan, nick- What is the name of the Amigo resolutions (32%) naming it “Amigo” ::: When authorities show up to character who hides in catch the bird, Pete and Liz spirit him away by Liz Liz’s dress? hiding him in her dress ::: Commonsense Amos reflects back on his early childhood ::: with How does Arieh’s wife kills herself reasoning (10%) his mother Fania and father Arieh. ::: One of his die? by overdose mother’s friends is killed while hanging up laundry during the war.