Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence

Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence Tal Schuster Adam Fisch Regina Barzilay Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology {tals,fisch,regina}@csail.mit.edu Abstract Beaverton, Oregon From Wikipedia, the free encyclopedia Revision ID: 336934876 Typical fact verification models use retrieved its population is estimated to be 86,205, almost 14% more written evidence to verify claims. Evidence than the 2000 census figure of 76,129. Revision as of 04:10, 10 January 2010 sources, however, often change over time as its population is estimated to be 91,757, almost 14% more more information is gathered and revised. In Refutes than the 2000 census figure of 76,129. order to adapt, models must be sensitive to Claim: More than 90K people live in Beaverton Supports subtle differences in supporting evidence. We present VITAMINC, a benchmark infused with Figure 1: In VITAMINC, we focus on Wikipedia revi- challenging cases that require fact verification sions in which the factual content changes. This exam- models to discern and adjust to slight factual ple revision now supports an initially refuted claim. changes. We collect over 100,000 Wikipedia revisions that modify an underlying fact, and measured by how well they adjust to new evidence. leverage these revisions, together with addi- In this way, we seek to advance fact verification by tional synthetically constructed ones, to create requiring that models remain reliable and robust to a total of over 400,000 claim-evidence pairs. the change present in practical settings. Unlike previous resources, the examples in To this end, we focus on fact verification with VITAMINC are contrastive, i.e., they contain contrastive evidence. That is, we infuse the stan- evidence pairs that are nearly identical in lan- dard fact verification paradigm with challenging guage and content, with the exception that one supports a given claim while the other does cases that require models to be sensitive to fac- not. We show that training using this design tual changes in their presented evidence (hereon increases robustness—improving accuracy by referred to interchangeably as “context”). We 10% on adversarial fact verification and 6% on present VITAMINC,2 a new large-scale fact ver- adversarial natural language inference (NLI). ification dataset that is based on factual revisions to Moreover, the structure of VITAMINC leads Wikipedia. The key concept is exemplified in Fig- us to define additional tasks for fact-checking ure1: there a factual revision yields a contrastive resources: tagging relevant words in the evidence for verifying the claim, identifying fac- pair of contexts that are nearly identical in language tual revisions, and providing automatic edits and content—except that one context refutes the via factually consistent text generation.1 given claim, while the other supports it. This type of contrastive structure exposes exist- 1 Introduction ing deficiencies in model behavior. To illustrate arXiv:2103.08541v1 [cs.CL] 15 Mar 2021 Determining the truthfulness of factual claims by this, we train a classifier on the popular FEVER comparing them to textual sources of evidence has fact verification dataset (Thorne et al., 2018) and received intense research interest in recent years. evaluate it on contrastive claim-evidence pairs. We An underlying, but often overlooked, challenge for find that the model flips its prediction from the orig- this paradigm, however, is the dynamic nature of inal verdict on only 56% of the contrastive cases. today’s written resources. An extraordinary amount When examples from VITAMINC are included dur- of new information becomes available daily; as a ing training, however, the model’s sensitivity in- result, many consequential facts are established, creases, flipping on 86% of contrastive cases. changed, or added to over time. We argue that Such context-sensitive inference has two main the quality of fact verification systems should be benefits. First, it ensures that the model consid- 1The VITAMINC dataset and our models are available at: 2Etymology of VITAMINC: Contrastive evidence keeps fact https://github.com/TalSchuster/VitaminC verification models robust and healthy, hence “Vitamin C.” ers the provided evidence rather than relying on achieve unexpectedly high performance (Schus- built-in static knowledge, such as that obtained via ter et al., 2019). Other recent datasets cover ver- language model pre-training (Petroni et al., 2019; ification against tables (Chen et al., 2020), rela- Roberts et al., 2020). This is particularly important tional databases (Jo et al., 2019), Wikipedia refer- for scenarios in which the source of truth is mutable ences (Sathe et al., 2020), multiple articles (Jiang (e.g., the current US president, or new declarations et al., 2020), and search snippets (Augenstein et al., as in Figure1). Second, this setting discourages 2019). These resources all assume static ground certain biases and idiosyncrasies—such as exploit- truths. In contrast, VITAMINC compares objective ing differences in how true vs. false claims are claims to a dynamic source of truth, and requires posed—that are common in similar crowd-sourced models to change their verdicts accordingly. datasets (Poliak et al., 2018; Schuster et al., 2019). Indeed, we show that augmenting both fact verifica- Annotation Bias. Annotation artifacts are common in many NLP datasets, and affect performance tion models and NLI models with VITAMINC data improves their robustness to adversarial inputs. on adversarial and contrastive examples (Gardner et al., 2020; Ribeiro et al., 2020; Ross et al., 2020). Furthermore, our emphasis on contrastive con- Sentence-pair inference tasks such as fact verifica- texts allows us to expand on the scope of commonly tion (Paul Panenghat et al., 2020; Schuster et al., considered tasks. Most of the fact verification lit- 2019) and NLI (Gururangan et al., 2018; McCoy erature focuses on resolving claims to be true or et al., 2019; Poliak et al., 2018; Tsuchiya, 2018) false (Popat et al., 2018; Thorne and Vlachos, 2018; are no exception. Alleviating this bias requires ei- Wang, 2017). The surrounding ecosystem, how- ther modeling solutions (Karimi Mahabadi et al., ever, includes additional challenges, some of which 2020; Pratapa et al., 2020; Shah et al., 2020; Thorne we explore here: Documents such as Wikipedia ar- and Vlachos, 2020; Utama et al., 2020b), which ticles are updated frequently; which edits represent have limited effectiveness (Utama et al., 2020a), factual changes? For a given claim and (refuting or or adversarially removing troublesome training ex- supporting) evidence pair, which words or phrases amples (Bras et al., 2020) or manually collecting in the evidence are most relevant? If we know that new ones (Nie et al., 2020; Thorne et al., 2019a), a certain claim is true, can we modify an out-dated which is model specific. Instead, our dataset de- document to be consistent with it? We show that sign avoids single-sentence artifacts and provides the unique structure of our VITAMINC dataset can model-agnostic challenging examples that increase be leveraged to provide both supervised and dis- the robustness of trained models. tantly supervised data for these new questions. Our key contributions are as follows: Explainability. Current fact verification datasets 1. We pose a contrastive fact verification paradigm provide sentence-level rationales (DeYoung et al., that requires sensitivity to changes in data; 2020; Petroni et al., 2020) but do not enforce the model’s verdict to rely on them—leading to a po- 2. We introduce VITAMINC, a new large-scale tential discrepancy. VITAMINC ensures the verdict dataset that supports this paradigm; is conditioned on the retrieved evidence. Moreover, 3. We demonstrate that training on VITAMINC we use the revision history as distant supervision leads to better performance on standard tasks; for word-level rationales, allowing for finer-grained 4. We show how VITAMINC opens the door to ad- explanations (Camburu et al., 2018; Lei et al., 2016; ditional research directions in fact verification. Portelli et al., 2020; Thorne et al., 2019b). 2 Related Work Factually Consistent Generation. Generating texts that match given facts is a known chal- Fact Verification. The FEVER dataset (Thorne lenge (Fan et al., 2020; Kryscinski et al., 2020; et al., 2018) fueled the development of many fact- Lewis et al., 2020b; Parikh et al., 2020; Shah et al., checking models (e.g., see Hanselowski et al., 2020; Tian et al., 2020) as language models tend to 2018; Nie et al., 2019a,b; Yoneda et al., 2018, in- degenerate and hallucinate (Holtzman et al., 2020; ter alia). The claim creation process, however, Schuster et al., 2020; Zhou et al., 2020). More- required crowd-workers to write claims related over, evaluation is non-trivial, and usually manual. to Wikipedia articles, and was found to engender VITAMINC includes supervised data for training biases that allow an evidence-agnostic model to sequence-to-sequence models, and provides auto- 2 matic evaluation via the fact verification classifier. recursively recycled for re-training the automated BERT classifier in the future to expand the corpus 3 The VITAMINC Dataset further (we also introduce this as a task, see §4.1). VITAMINC (abbreviated VitC) is based on revi- 3.2 Writing Claims sions to English Wikipedia. Wikipedia has become The factual Wikipedia revisions guide us in creat- a comprehensive online resource that is rigorously ing challenging claims for fact verification. For maintained by a large and active community (Ben- each revision, annotators were asked to write two jakob and Harrison, 2019). While adversaries do symmetric claims related to the same edit: try to insert disinformation, popular pages are usually quickly corrected (Kumar et al., 2016). Fur- 1. The first should be supported by the original thermore, Wikipedia’s policies dictate that its con- sentence and refuted by the revised sentence; tent should be written from a neutral perspective— 2. The second should be supported by the revised or should otherwise objectively state all points of sentence and refuted by the original sentence.

Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support