모4c: a Benchmark for Evaluating RC Systems to Get the Right Answer For
Total Page:16
File Type:pdf, Size:1020Kb
R4C: A Benchmark for Evaluating RC Systems to Get the Right Answer for the Right Reason Naoya Inoue1;2 Pontus Stenetorp2;3 Kentaro Inui1;2 1Tohoku University 2RIKEN 3University College London fnaoya-i, [email protected] [email protected] Abstract Question What was the former band of the member of Mother Recent studies have revealed that reading com- Love Bone who died just before the release of “Apple”? prehension (RC) systems learn to exploit an- Articles notation artifacts and other biases in current Title: Return to Olympus [1] Return to Olympus is the datasets. This prevents the community from only album by the alternative rock band Malfunkshun. reliably measuring the progress of RC systems. [2] It was released after the band had broken up and 4 after lead singer Andrew Wood (later of Mother Love To address this issue, we introduce R C, a Input Bone) had died... [3] Stone Gossard had compiled… new task for evaluating RC systems’ internal Title: Mother Love Bone [4] Mother Love Bone was 4 reasoning. R C requires giving not only an- an American rock band that… [5] The band was active swers but also derivations: explanations that from… [6] Frontman Andrew Wood’s personality and justify predicted answers. We present a reli- compositions helped to catapult the group to... [7] Wood died only days before the scheduled release able, crowdsourced framework for scalably an- of the band’s debut album, “Apple”, thus ending the… notating RC datasets with derivations. We cre- ate and publicly release the R4C dataset, the Explanation Answer first, quality-assured dataset consisting of 4.6k Supporting facts (SFs): Malfunkshun questions, each of which is annotated with 3 [1], [2], [4], [6], [7] reference derivations (i.e. 13.8k derivations). R4C: Derivation Experiments show that our automatic evalua- [Malfunkshun] [Andrew Wood] [Malfunkshun] tion metrics using multiple reference deriva- is is lead singer of Output is former of 4 tions are reliable, and that R C assesses dif- [a rock band] [Malfunkshun] [Mother Love Bone] ferent skills from an existing benchmark. [Andrew Wood] [Andrew Wood] is a member of died just before the [Mother Love Bone] release of [Apple] 1 Introduction Reading comprehension (RC) has become a key Figure 1: R4C, a new RC task extending upon the stan- benchmark for natural language understanding dard RC setting, requiring systems to provide not only (NLU) systems, and a large number of datasets are an answer, but also a derivation. The example is taken from HotpotQA (Yang et al., 2018), where sentences now available (Welbl et al., 2018; Kociskˇ y` et al., [1-2, 4, 6-7] are supporting facts, and [3,5] are not. 2018; Yang et al., 2018, i.a.). However, it has been established that these datasets suffer from annota- tion artifacts and other biases, which may allow systems to “cheat”: Instead of learning to read and related information is scattered across several ar- comprehend texts in their entirety, systems learn ticles (Welbl et al., 2018; Yang et al., 2018) (i.e. to exploit these biases and find answers via sim- multi-hop QA). However, recent studies show that ple heuristics, such as looking for an entity with such multi-hop QA also has weaknesses (Chen and a particular semantic type (Sugawara et al., 2018; Durrett, 2019; Min et al., 2019; Jiang et al., 2019), Mudrakarta et al., 2018) (e.g. given a question start- e.g. combining multiple sources of information ing with Who, a system finds a person entity found is not always necessary to find answers. Another in a document). direction, which we follow, includes evaluating To address this issue, the community has intro- a systems’ reasoning (Jansen, 2018; Yang et al., duced increasingly more difficult Question Answer- 2018; Thorne and Vlachos, 2018; Camburu et al., ing (QA) problems, for example, so that answer- 2018; Fan et al., 2019; Rajani et al., 2019). In 6740 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6740–6750 July 5 - 10, 2020. c 2020 Association for Computational Linguistics the context of RC, Yang et al.(2018) propose Hot- leads to a trade-off between the expressivity of potQA, which requires systems not only to give an reasoning and the interpretability of an evaluation answer but also to identify supporting facts (SFs), metric. To maintain a reasonable trade-off, we sentences containing information that supports the choose to represent derivations in a semi-structured answer. SFs are defined as sentences containing natural language form. Specifically, a derivation is information that supports the answer (see “Support- defined as a set of derivation steps. Each deriva- ing facts” in Fig.1 for an example). tion step di 2 D is defined as a relational fact, i.e. h r t h t As shown in SFs [1] , [2] , and [7] , however, di ≡ hdi ; di ; dii, where di , di are entities (noun r only a subset of SFs may contribute to the neces- phrases), and di is a verb phrase representing a t h sary reasoning. For example, [1] states two facts: relationship between di and di (see Fig.1 for an (a) Return to Olympus is an album by Malfunkshun; example), similar to the Open Information Extrac- h r t and (b) Malfunkshun is a rock band. Among these, tion paradigm (Etzioni et al., 2008). di ; di ; di may only (b) is related to the necessary reasoning. Thus, be a phrase not contained in R (e.g. is lead singer achieving a high accuracy in the SF detection task of in Fig.1). does not fully prove a RC systems’s reasoning abil- ity. 2.2 Evaluation metrics This paper proposes R4C, a new task of RC that While the output derivations are semi-structured, requires systems to provide an answer and deriva- the linguistic diversity of entities and relations still tion1: a minimal explanation that justifies predicted prevents automatic evaluation. One typical solution answers in a semi-structured natural language form is crowdsourced judgement, but it is costly both (see “Derivation” in Fig.1 for an example). Our in terms of time and budget. We thus resort to a main contributions can be summarized as follows: reference-based similarity metric. • We propose R4C, which enables us to quanti- Specifically, for output derivation D, we assume tatively evaluate a systems’ internal reasoning n sets of golden derivations G1;G2; :::; Gn. For in a finer-grained manner than the SF detec- evaluation, we would like to assess how well deriva- 4 tion task. We show that R C assesses differ- tion steps in D can be aligned with those in Gi in ent skills from the SF detection task. the best case. For each golden derivation Gi, we calculate c(D; Gi), an alignment score of D with • We create and publicly release the first dataset respect to Gi or a soft version of the number of of R4C consisting of 4,588 questions, each of correct derivation steps in D (i.e. 0 ≤ c(D; Gi) ≤ which is annotated with 3 high-quality deriva- min(jDj; jGij)). We then find a golden derivation tions (i.e. 13,764 derivations), available at G∗ that gives the highest c(D; G∗) and define the https://naoya-i.github.io/r4c/. precision, recall and f1 as follows: • We present and publicly release a reliable, c(D; G∗) c(D; G∗) crowdsourced framework for scalably anno- pr(D) = ; rc(D) = tating existing RC datasets with derivations in jDj jG∗j order to facilitate large-scale dataset construc- 2 · pr(D; G∗) · rc(D; G∗) f (D) = tion of derivations in the RC community. 1 pr(D; G∗) + rc(D; G∗) 2 Task description An official evaluation script is available at https: 2.1 Task definition //naoya-i.github.io/r4c/. We build R4C on top of the standard RC task. Alignment score To calculate c(D; Gi), we q R Given a question and articles , the task is (i) would like to find the best alignment between to find the answer a from R and (ii) to generate a derivation steps in D and those in Gi. See Fig.2 D a derivation that justifies why is believed to be for an example, where two possible alignments the answer to q. A1;A2 are shown. As derivation steps in D agree There are several design choices for derivations, with those in Gi with A2 more than those with including whether derivations should be structured, A1, we would like to consider A2 when evaluating. whether the vocabulary should be closed, etc. This We first define c(D; Gi;Aj), the correctness of D 1 4 R C is short for “Right for the Right Reasons RC.” given a specific alignment Aj, and then pick the 6741 Output D Golden Gi [Malfunkshun] is 0.1 [a rock band] [Return to Olympus] A1 is [Andrew Wood] is lead singer of [an album] 0.1 A2 [Malfunkshun] A [Malfunkshun] 1 0.05 [Malfunkshun] is former of is former of A [Mother Love Bone] 1.0 2 [Mother Love Bone] [Andrew Wood] 0.2 is a member of [Andrew Wood] A1 died before the release [Mother Love Bone] of [Apple] [Andrew Wood] 0.8 died just before the A2 release of [Apple] Figure 2: Two possible alignments A1 and A2 between D and Gi with their alignment scores a(·; ·). The pre- cision and recall of D is (0.1+1.0+0.8)/3 = 0.633 and (0.1+1.0+0.8)/5=0.380, respectively. best alignment as follows: Figure 3: Crowdsourcing interface for derivation anno- X tation.