Arxiv:1803.05355V3 [Cs.CL] 18 Dec 2018
Total Page:16
File Type:pdf, Size:1020Kb
FEVER: a large-scale dataset for Fact Extraction and VERification James Thorne1, Andreas Vlachos1, Christos Christodoulopoulos2, and Arpit Mittal2 1Department of Computer Science, University of Sheffield 2Amazon Research Cambridge fj.thorne, [email protected] fchrchrs, [email protected] Abstract the key difference is that in these tasks the passage to verify each claim is given, and in recent years it In this paper we introduce a new publicly typically consists a single sentence, while in veri- available dataset for verification against fication systems it is retrieved from a large set of textual sources, FEVER: Fact Extraction documents in order to form the evidence. Another and VERification. It consists of 185,445 related task is question answering (QA), for which claims generated by altering sentences ex- approaches have recently been extended to han- tracted from Wikipedia and subsequently dle large-scale resources such as Wikipedia (Chen verified without knowledge of the sen- et al., 2017). However, questions typically pro- tence they were derived from. The vide the information needed to identify the answer, claims are classified as SUPPORTED,RE- while information missing from a claim can of- FUTED or NOTENOUGHINFO by annota- ten be crucial in retrieving refuting evidence. For tors achieving 0.6841 in Fleiss κ. For example, a claim stating “Fiji’s largest island is the first two classes, the annotators also Kauai.” can be refuted by retrieving “Kauai is the recorded the sentence(s) forming the nec- oldest Hawaiian Island.” as evidence. essary evidence for their judgment. To Progress on the aforementioned tasks has bene- characterize the challenge of the dataset fited from the availability of large-scale datasets presented, we develop a pipeline approach (Bowman et al., 2015; Rajpurkar et al., 2016). and compare it to suitably designed ora- However, despite the rising interest in verification cles. The best accuracy we achieve on la- and fact checking among researchers, the datasets beling a claim accompanied by the correct currently used for this task are limited to a few evidence is 31.87%, while if we ignore the hundred claims. Indicatively, the recently con- evidence we achieve 50.91%. Thus we be- ducted Fake News Challenge (Pomerleau and Rao, lieve that FEVER is a challenging testbed 2017) with 50 participating teams used a dataset that will help stimulate progress on claim consisting of 300 claims verified against 2,595 as- verification against textual sources. sociated news articles which is orders of magni- tude smaller than those used for TE and QA. 1 Introduction In this paper we present a new dataset for claim arXiv:1803.05355v3 [cs.CL] 18 Dec 2018 The ever-increasing amounts of textual informa- verification, FEVER: Fact Extraction and VER- tion available combined with the ease in sharing it ification. It consists of 185,445 claims manu- through the web has increased the demand for ver- ally verified against the introductory sections of ification, also referred to as fact checking. While Wikipedia pages and classified as SUPPORTED, it has received a lot of attention in the context of REFUTED or NOTENOUGHINFO. For the first two journalism, verification is important for other do- classes, systems and annotators need to also return mains, e.g. information in scientific publications, the combination of sentences forming the neces- product reviews, etc. sary evidence supporting or refuting the claim (see In this paper we focus on verification of textual Figure1). The claims were generated by human claims against textual sources. When compared annotators extracting claims from Wikipedia and to textual entailment (TE)/natural language infer- mutating them in a variety of ways, some of which ence (Dagan et al., 2009; Bowman et al., 2015), were meaning-altering. The verification of each claim was conducted in a separate annotation pro- Claim: The Rodney King riots took place in cess by annotators who were aware of the page but the most populous county in the USA. not the sentence from which original claim was [wiki/Los Angeles Riots] extracted and thus in 31.75% of the claims more The 1992 Los Angeles riots, than one sentence was considered appropriate ev- also known as the Rodney King riots idence. Claims require composition of evidence were a series of riots, lootings, ar- from multiple sentences in 16.82% of cases. Fur- sons, and civil disturbances that thermore, in 12.15% of the claims, this evidence occurred in Los Angeles County, Cali- was taken from multiple pages. fornia in April and May 1992. To ensure annotation consistency, we developed [wiki/Los Angeles County] suitable guidelines and user interfaces, resulting Los Angeles County, officially in inter-annotator agreement of 0.6841 in Fleiss κ the County of Los Angeles, (Fleiss, 1971) in claim verification classification, is the most populous county in the USA. and 95.42% precision and 72.36% recall in evi- dence retrieval. Verdict: Supported To characterize the challenges posed by FEVER we develop a pipeline approach which, given a Figure 1: Manually verified claim requiring evidence claim, first identifies relevant documents, then se- from multiple Wikipedia pages. lects sentences forming the evidence from the doc- uments and finally classifies the claim w.r.t. ev- idence. The best performing version achieves remains too challenging for the ML/NLP methods 31.87% accuracy in verification when requiring currently available. Wang(2017) extended this ap- correct evidence to be retrieved for claims SUP- proach by including all 12.8K claims available by PORTED or REFUTED, and 50.91% if the correct- Politifact via its API, however the justification and ness of the evidence is ignored, both indicating the the evidence contained in it was ignored in the ex- difficulty but also the feasibility of the task. We periments as it was not machine-readable. Instead, also conducted oracle experiments in which com- the claims were classified considering only the text ponents of the pipeline were replaced by the gold and the metadata related to the person making the standard annotations, and observed that the most claim. While this rendered the task amenable to challenging part of the task is selecting the sen- current NLP/ML methods, it does not allow for tences containing the evidence. In addition to pub- verification against any sources and no evidence lishing the data via our website1, we also publish needs to be returned to justify the verdicts. the annotation interfaces2 and the baseline system3 The Fake News challenge (Pomerleau and Rao, to stimulate further research on verification. 2017) modelled verification as stance classifica- tion: given a claim and an article, predict whether 2 Related Works the article supports, refutes, observes (neutrally Vlachos and Riedel(2014) constructed a dataset states the claim) or is irrelevant to the claim. It for claim verification consisting of 106 claims, consists of 50K labelled claim-article pairs, com- selecting data from fact-checking websites such bining 300 claims with 2,582 articles. The claims as PolitiFact, taking advantage of the labelled and the articles were curated and labeled by jour- claims available there. However, in order to de- nalists in the context of the Emergent Project (Sil- velop claim verification components we typically verman, 2015), and the dataset was first proposed require the justification for each verdict, includ- by Ferreira and Vlachos(2016), who only classi- ing the sources used. While this information is fied the claim w.r.t. the article headline instead of usually available in justifications provided by the the whole article. Similar to recognizing textual journalists, they are not in a machine-readable entailment (RTE) (Dagan et al., 2009), the systems form. Thus, also considering the small number of were provided with the sources to verify against, claims, the task defined by the dataset proposed instead of having to retrieve them. A differently motivated but closely related 1 http://fever.ai dataset is the one developed by Angeli and Man- 2https://github.com/awslabs/fever 3https://github.com/sheffieldnlp/ ning(2014) to evaluate natural logic inference fever-baselines for common sense reasoning, as it evaluated sim- ple claims such as “not all birds can fly” against be simplifications and paraphrases. At the other textual sources — including Wikipedia — which extreme, if we allowed world knowledge to be were processed with an Open Information Extrac- freely incorporated it would result in claims that tion system (Mausam et al., 2012). However, the would be hard to verify on Wikipedia alone. We claims were small in number (1,378) and limited address this issue by introducing a dictionary: a in variety as they were derived from eight binary list of terms that were (hyper-)linked in the orig- ConceptNet relations (Tandon et al., 2011). inal sentence, along with the first sentence from Claim verification is also related to the multilin- their corresponding Wikipedia pages. Using this gual Answer Validation Exercise (Rodrigo et al., dictionary, we provide additional knowledge that 2009) conducted in the context of the TREC can be used to increase the complexity of the gen- shared tasks. Apart from the difference in dataset erated claims in a controlled manner. size (1,000 instances per language), the key dif- The annotators were also asked to generate mu- ference is that the claims being validated were an- tations of the claims: altered versions of the origi- swers returned to questions by QA systems. The nal claims, which may or may not change whether questions and the QA systems themselves pro- they are supported by Wikipedia, or even if they vide additional context to the claim, while in our can be verified against it. Inspired by the opera- task definition the claims are outside any partic- tors used in Natural Logic Inference (Angeli and ular context.