<<

FEVER: a large-scale dataset for Fact Extraction and VERification

James Thorne1, Andreas Vlachos1, Christos Christodoulopoulos2, and Arpit Mittal2

1Department of Computer Science, University of Sheffield 2Amazon Research Cambridge {j.thorne, a.vlachos}@sheffield.ac.uk {chrchrs, mitarpit}@amazon.co.uk

Abstract the key difference is that in these tasks the passage to verify each claim is given, and in recent years it In this paper we introduce a new publicly typically consists a single sentence, while in veri- available dataset for verification against fication systems it is retrieved from a large of textual sources, FEVER: Fact Extraction documents in order to form the evidence. Another and VERification. It consists of 185,445 related task is question answering (QA), for which claims generated by altering sentences ex- approaches have recently been extended to han- tracted from Wikipedia and subsequently dle large-scale resources such as Wikipedia (Chen verified without knowledge of the sen- et al., 2017). However, questions typically pro- tence they were derived from. The vide the needed to identify the answer, claims are classified as SUPPORTED,RE- while information missing from a claim can of- FUTED or NOTENOUGHINFO by annota- ten be crucial in retrieving refuting evidence. For tors achieving 0.6841 in Fleiss κ. For example, a claim stating “Fiji’s largest island is the first two classes, the annotators also Kauai.” can be refuted by retrieving “Kauai is the recorded the sentence(s) forming the nec- oldest Hawaiian Island.” as evidence. essary evidence for their judgment. To on the aforementioned tasks has bene- characterize the challenge of the dataset fited from the availability of large-scale datasets presented, we develop a pipeline approach (Bowman et al., 2015; Rajpurkar et al., 2016). and compare it to suitably designed ora- However, despite the rising interest in verification cles. The best accuracy we achieve on la- and fact checking among researchers, the datasets beling a claim accompanied by the correct currently used for this task are limited to a few evidence is 31.87%, while if we ignore the hundred claims. Indicatively, the recently con- evidence we achieve 50.91%. Thus we be- ducted Challenge (Pomerleau and Rao, lieve that FEVER is a challenging testbed 2017) with 50 participating teams used a dataset that will help stimulate progress on claim consisting of 300 claims verified against 2,595 as- verification against textual sources. sociated news articles which is orders of magni- tude smaller than those used for TE and QA. 1 Introduction In this paper we present a new dataset for claim arXiv:1803.05355v3 [cs.CL] 18 Dec 2018 The ever-increasing amounts of textual informa- verification, FEVER: Fact Extraction and VER- tion available combined with the ease in sharing it ification. It consists of 185,445 claims manu- through the web has increased the demand for ver- ally verified against the introductory sections of ification, also referred to as fact checking. While Wikipedia pages and classified as SUPPORTED, it has received a lot of attention in the context of REFUTED or NOTENOUGHINFO. For the first two journalism, verification is important for other do- classes, systems and annotators need to also return mains, e.g. information in scientific publications, the combination of sentences forming the neces- product reviews, etc. sary evidence supporting or refuting the claim (see In this paper we focus on verification of textual Figure1). The claims were generated by human claims against textual sources. When compared annotators extracting claims from Wikipedia and to textual entailment (TE)/natural language infer- mutating them in a variety of ways, some of which ence (Dagan et al., 2009; Bowman et al., 2015), were meaning-altering. The verification of each claim was conducted in a separate annotation pro- Claim: The Rodney King riots took place in cess by annotators who were aware of the page but the most populous county in the USA. not the sentence from which original claim was [wiki/Los Angeles Riots] extracted and thus in 31.75% of the claims more The 1992 Los Angeles riots, than one sentence was considered appropriate ev- also known as the Rodney King riots idence. Claims require composition of evidence were a series of riots, lootings, ar- from multiple sentences in 16.82% of cases. Fur- sons, and civil disturbances that thermore, in 12.15% of the claims, this evidence occurred in Los Angeles County, Cali- was taken from multiple pages. fornia in April and May 1992. To ensure annotation consistency, we developed [wiki/Los Angeles County] suitable guidelines and user interfaces, resulting Los Angeles County, officially in inter-annotator agreement of 0.6841 in Fleiss κ the County of Los Angeles, (Fleiss, 1971) in claim verification classification, is the most populous county in the USA. and 95.42% precision and 72.36% recall in evi- dence retrieval. Verdict: Supported To characterize the challenges posed by FEVER we develop a pipeline approach which, given a Figure 1: Manually verified claim requiring evidence claim, first identifies relevant documents, then se- from multiple Wikipedia pages. lects sentences forming the evidence from the doc- uments and finally classifies the claim w.r.t. ev- idence. The best performing version achieves remains too challenging for the ML/NLP methods 31.87% accuracy in verification when requiring currently available. Wang(2017) extended this ap- correct evidence to be retrieved for claims SUP- proach by including all 12.8K claims available by PORTED or REFUTED, and 50.91% if the correct- Politifact via its API, however the justification and ness of the evidence is ignored, both indicating the the evidence contained in it was ignored in the ex- difficulty but also the feasibility of the task. We periments as it was not machine-readable. Instead, also conducted oracle experiments in which com- the claims were classified considering only the text ponents of the pipeline were replaced by the gold and the metadata related to the person making the standard annotations, and observed that the most claim. While this rendered the task amenable to challenging part of the task is selecting the sen- current NLP/ML methods, it does not allow for tences containing the evidence. In addition to pub- verification against any sources and no evidence lishing the data via our website1, we also publish needs to be returned to justify the verdicts. the annotation interfaces2 and the baseline system3 The Fake News challenge (Pomerleau and Rao, to stimulate further research on verification. 2017) modelled verification as stance classifica- tion: given a claim and an article, predict whether 2 Related Works the article supports, refutes, observes (neutrally Vlachos and Riedel(2014) constructed a dataset states the claim) or is irrelevant to the claim. It for claim verification consisting of 106 claims, consists of 50K labelled claim-article pairs, com- selecting data from fact-checking websites such bining 300 claims with 2,582 articles. The claims as PolitiFact, taking advantage of the labelled and the articles were curated and labeled by jour- claims available there. However, in order to de- nalists in the context of the Emergent Project (Sil- velop claim verification components we typically verman, 2015), and the dataset was first proposed require the justification for each verdict, includ- by Ferreira and Vlachos(2016), who only classi- ing the sources used. While this information is fied the claim w.r.t. the article headline instead of usually available in justifications provided by the the whole article. Similar to recognizing textual journalists, they are not in a machine-readable entailment (RTE) (Dagan et al., 2009), the systems form. Thus, also considering the small number of were provided with the sources to verify against, claims, the task defined by the dataset proposed instead of having to retrieve them. A differently motivated but closely related 1 http://fever.ai dataset is the one developed by Angeli and Man- 2https://github.com/awslabs/fever 3https://github.com/sheffieldnlp/ ning(2014) to evaluate natural logic inference fever-baselines for common sense reasoning, as it evaluated sim- ple claims such as “not all birds can fly” against be simplifications and paraphrases. At the other textual sources — including Wikipedia — which extreme, if we allowed world knowledge to be were processed with an Open Information Extrac- freely incorporated it would result in claims that tion system (Mausam et al., 2012). However, the would be hard to verify on Wikipedia alone. We claims were small in number (1,378) and limited address this issue by introducing a dictionary: a in variety as they were derived from eight binary list of terms that were (hyper-)linked in the orig- ConceptNet relations (Tandon et al., 2011). inal sentence, along with the first sentence from Claim verification is also related to the multilin- their corresponding Wikipedia pages. Using this gual Answer Validation Exercise (Rodrigo et al., dictionary, we provide additional knowledge that 2009) conducted in the context of the TREC can be used to increase the complexity of the gen- shared tasks. Apart from the difference in dataset erated claims in a controlled manner. size (1,000 instances per language), the key dif- The annotators were also asked to generate mu- ference is that the claims being validated were an- tations of the claims: altered versions of the origi- swers returned to questions by QA systems. The nal claims, which may or may not change whether questions and the QA systems themselves pro- they are supported by Wikipedia, or even if they vide additional context to the claim, while in our can be verified against it. Inspired by the opera- task definition the claims are outside any partic- tors used in Natural Logic Inference (Angeli and ular context. In the same vein, Kobayashi et al. Manning, 2014), we specified six types of mu- (2017) collected a dataset of 412 statements in tation: paraphrasing, negation, substitution of an context from high-school student exams that were entity/relation with a similar/dissimilar one, and validated against Wikipedia and history textbooks. making the claim more general/specific. See Ap- pendixA for the full guidelines. 3 Fact extraction and verification dataset During trials of the annotation task, we dis- The dataset was constructed in two stages: covered that the majority of annotators had dif- ficulty generating non-trivial negation mutations Claim Generation Extracting information from (e.g. mutations beyond adding “not” to the orig- Wikipedia and generating claims from it. inal). Besides providing numerous examples for Claim Labeling Classifying whether a claim is each mutation, we also redesigned the annotation supported or refuted by Wikipedia and select- interface so that all mutation types were visible ing the evidence for it, or deciding there’s not at once and highlighted mutations that contained enough information to make a decision. “not” in order to discourage trivial negations. Fi- nally, we provided the annotators with an ontology 3.1 Task 1 - Claim Generation diagram to illustrate the different levels of entity The objective of this task was to generate claims similarity and class membership. from information extracted from Wikipedia. We This process resulted in claims (both extracted used the June 2017 Wikipedia dump, processed and mutated) with a mean length of 9.4 tokens it with Stanford CoreNLP (Manning et al., 2014), which is comparable to the average hypothesis and sampled sentences from the introductory sec- length of 8.3 tokens in Bowman et al.(2015). tions of approximately 50,000 popular pages.4 The annotators were given a sentence from the 3.2 Task 2 - Claim Labeling sample chosen at random, and were asked to gen- The annotators were asked to label each individ- erate a set of claims containing a single piece of ual claim generated during Task 1 as SUPPORTED, information, focusing on the entity that its original REFUTED or NOTENOUGHINFO. For the first two Wikipedia page was about. We asked the annota- cases, the annotators were asked to find the evi- tors to generate claims about a single fact which dence from any page that supports or refutes the could be arbitrarily complex and allowed for a va- claim (see Figure2 for a screenshot of the inter- riety of expressions for the entities. face). In order to encourage inter-annotator con- If only the source sentences were used to gen- sistency, we gave the following general guideline: erate claims then this would result in trivially ver- ifiable claims, as the new claims would in essence If I was given only the selected sen- 4These consisted of 5,000 from a Wikipedia ‘most ac- tences, do I have strong reason to be- cessed pages’ list and the pages hyperlinked from them. lieve the claim is true (supported) or stronger reason to believe the claim is plicitly during claim labeling. As a result 1.01% of false (refuted). If I’m not certain, what claims were skipped, 2.11% contained typos and additional information (dictionary) do I 6.63% of the generated claims were flagged as too have to add to reach this conclusion. vague/ambiguous and were excluded e.g. or “Sons of Anarchy premiered.”. In the annotation interface, all sentences from the introductory section of the page for the main 3.4.1 5-way Agreement entity of the claim and of every linked entity in We randomly selected 4% (n = 7506) of claims those sentences were provided as a default source which were not skipped to be annotated by 5 an- of evidence (left-hand side in Fig.2). Using this notators. We calculated the Fleiss κ score (Fleiss, interface the annotators recorded the sentences 1971) to be 0.6841 which we consider encourag- necessary to justify their classification decisions. ing given the complexity of the task. In compari- In order to allow exploration beyond the main and son Bowman et al.(2015) reported a κ of 0.7 for a linked pages, we also allowed annotators to add simpler task, since the annotators were given the an arbitrary Wikipedia page by providing its URL premise/evidence to verify a hypothesis against and the system would add its introductory section without the additional task of finding it. as additional sentences that could be then selected 3.4.2 Agreement against Super-Annotators as evidence (right-hand side in Fig.2). The title of the page could also be used as evidence to re- We randomly selected 1% of the data to be anno- solve co-reference, but this decision was not ex- tated by super-annotators: expert annotators with plicitly recorded. We did not set a hard time limit no suggested time restrictions. The purpose of for the task, but the annotators were advised not to this exercise was to provide as much coverage of spend more than 2-3 minutes per claim. The label evidence as possible. We instructed the super- NOTENOUGHINFO was used if the claim could annotators to search over the whole Wikipedia not be supported or refuted by any amount of in- for every possible sentence that could be used as formation in Wikipedia (either because it is too evidence. We compared the regular annotations general, or too specific). against this set of evidence and the precision/recall was 95.42% and 72.36% respectively. 3.3 Annotators 3.4.3 Validation by the Authors The annotation team consisted of a total of 50 As a final quality control step, we chose 227 exam- members, 25 of which were involved only in the ples and annotated them for accuracy of the labels first task. All annotators were native US English and the evidence provided. We found that 91.2% speakers and were trained either directly by the of the examples were annotated correctly. 3% of authors, or by experienced annotators. The inter- the claims were mistakes in claim generation that face for both tasks was developed by the authors had not been flagged during labeling. We found a in collaboration with an initial team of two anno- similar number of these claims which did not meet tators. Their notes and suggestions were incorpo- the guidelines during a manual error analysis of rated into the annotation guidelines. the baseline system (Section 5.8). The majority of the feedback received from the annotators was very positive: they found the task 3.4.4 Findings engaging and challenging, and after the initial When compared against the super-annotators, all stages of annotation they had developed an under- except two annotators achieved > 90% precision standing of the needs of the task which let them and all but 9 achieved recall > 70% in evidence discuss solutions about edge cases as a group. retrieval. The majority of the low-recall cases are for claims such as “Akshay Kumar is an actor.” 3.4 Data Validation where the super-annotator added 34 sentences as Given the complexity of the second task (claim evidence, most of them being filmography listings labeling), we conducted three forms of data val- (e.g. “In 2000, he starred in the Priyadarshan- idation: 5-way inter-annotator agreement, agree- directed comedy Hera Pheri”). ment against super-annotators (defined in Sec- During the validation by the authors, we found tion 3.4.2), and manual validation by the authors. that most of the examples that were annotated in- The validation for claim generation was done im- correctly were cases where the label was correct, Figure 2: Screenshot of Task 2 - Claim Labeling but the evidence selected was not sufficient (only sentence-level evidence selection and textual en- 4 out of 227 examples were labeled incorrectly ac- tailment. Each component is evaluated in isolation cording to the guidelines). through oracle evaluations on the development set We tried to resolve this issue by asking our an- and we report the final accuracies on the test set. notators to err on the side of caution. For example, Document Retrieval We use the document re- while the claim “Shakira is Canadian” could be la- trieval component from the DrQA system (Chen beled as REFUTED by the sentence “Shakira is a et al., 2017) which returns the k nearest docu- Colombian singer, songwriter, dancer, and record ments for a query using cosine similarity between producer”, we advocated that unless more explicit binned unigram and bigram Term Frequency- evidence is provided (e.g. “She was denied Cana- Inverse Document Frequency (TF-IDF) vectors. dian citizenship”), the claim should be labeled as NOTENOUGHINFO, since dual citizenships are Sentence Selection Our simple sentence selec- permitted, and the annotators’ world knowledge tion method ranks sentences by TF-IDF similar- should not be factored in. ity to the claim. We sort the most-similar sen- A related issue is entity resolution. For a claim tences first and tune a cut-off using validation ac- like “David Beckham was with United.”, it might curacy on the development set. We evaluate both be trivial for an annotator to accept “David Beck- DrQA and a simple unigram TF-IDF implementa- ham made his European League debut playing tion to rank the sentences for selection. We further for Manchester United.” as supporting evidence. evaluate impact of sentence selection on the RTE This implicitly assumes that “United” refers to module by predicting entailment given the original “Manchester United”, however there are many documents without sentence selection. Uniteds in Wikipedia and not just football clubs, e.g. United Airlines. The annotators knew the Recognizing Textual Entailment We compare page of the main entity and thus it was relatively two models for recognizing textual entailment. easy to resolve ambiguous entities. While we pro- For a simple well-performing baseline, we se- vide this information as part of the dataset, we ar- lected Riedel et al.(2017)’s submission from the gue that it should only be used for system train- 2017 Fake News Challenge. It is a multi-layer per- ing/development. ceptron (MLP) with a single hidden layer which uses term frequencies and TF-IDF cosine similar- 4 Baseline System Description ity between the claim and evidence as features. Evaluating the state-of-the-art in RTE, we used We construct a simple pipelined system com- a decomposable attention (DA) model between prising three components: document retrieval, the claim and the evidence passage (Parikh et al., 2016). We selected it because at the time of de- predicted sentences in comparison to the human- velopment this model was the highest scoring sys- annotated sentences for those claims requiring tem for the Stanford Natural Language Inference evidence on our complete pipeline system (Sec- task (Bowman et al., 2015) with publicly available tion 5.7). As in Fig.1, some claims require multi- code that did not require the input text to be parsed hop inference involving sentences from more than syntactically, nor was an ensemble. one document to be correctly supported as SUP- The RTE component must correctly classify a PORTED/REFUTED. In this case all sentences must claim as NOTENOUGHINFO when the evidence be selected for the evidence to be marked as cor- retrieved is not relevant or informative. However, rect. We report this as the proportion of fully sup- the instances labeled as NOTENOUGHINFO have ported claims. Some claims may be equally sup- no evidence annotated, thus cannot be used to train ported by different pieces of evidence; in this case RTE for this class. To overcome this issue, we one complete set of sentences should be predicted. simulate training instances for the NOTENOUGH- Systems that select information that the anno- INFO through two methods: sampling a sentence tators did not will be penalized in terms of preci- from the nearest page (NEARESTP) to the claim sion. We recognize that it is not feasible to ensure as evidence using our document retrieval compo- that the evidence selection annotations are com- nent and sampling a sentence from Wikipedia uni- plete, nevertheless we argue that they are useful formly at random (RANDOMS). for automatic evaluation during system develop- ment. For a more reliable evaluation we advocate 5 Experiments crowd-sourcing annotations of false-positive pre- 5.1 Dataset Statistics dictions at a later date in a similar manner to the We partitioned the annotated claims into training, TAC KBP Slot Filler Validation (Ellis et al., 2016). development and test sets. We ensured that each Wikipedia page used to generate claims occurs in 5.3 Document Retrieval exactly one set. We reserved a further 19,998 ex- The document retrieval component of the base- amples for use as a test set for a shared task. line system returns the k nearest documents to the Split SUPPORTED REFUTED NEI claim using the DrQA (Chen et al., 2017) TF-IDF Training 80,035 29,775 35,639 implementation to return the k-nearest documents. Dev 3,333 3,333 3,333 In the scenario where evidence from multiple doc- Test 3,333 3,333 3,333 uments is required, k must be greater than this Reserved 6,666 6,666 6,666 figure. We simulate the upper bound in accuracy using an oracle 3-way RTE classifier that predicts Table 1: Dataset split sizes for SUPPORTED,REFUTED SUPPORTED/REFUTED ones correctly only if the and NOTENOUGHINFO (NEI) classes documents containing the supporting/refuting evi- dence are returned by document retrieval and al- ways predicts NOTENOUGHINFO instances cor- 5.2 Evaluation rectly independently of the evidence. Results are Predicting whether a claim is SUPPORTED,RE- shown in Table2. FUTED or NOTENOUGHINFO is a 3-way classi- fication task that we evaluate using accuracy. In Fully Oracle the case of the first two classes, appropriate ev- k Supported (%) Accuracy (%) idence must be provided, at a sentence-level, to justify the classification. We consider an answer 1 25.31 50.21 returned correct for the first two classes only if 5 55.30 70.20 correct evidence is returned. Given that the devel- 10 65.86 77.24 opment and test datasets have balanced class dis- 25 75.92 83.95 tributions, a random baseline will have ∼ 33% ac- 50 82.49 90.13 curacy if one ignores the requirement for evidence 100 86.59 91.06 for SUPPORTED and REFUTED. We evaluate the correctness of the evidence Table 2: Dev. set document retrieval evaluation. retrieved by computing the F1-score of all the 5.4 Sentence Selection yield a better system in the pipeline setting. In Mirroring document retrieval, we extract the top l- contrast, the nearest page (NEARESTP) method most similar sentences from the k-most relevant samples a sentence from the highest-ranked page documents using TF-IDF vector similarity. We returned by our document retrieval module. This modified document retrieval component of DrQA simulates finding related information that may not (Chen et al., 2017) to select sentences using bi- be sufficient to support or refute a claim. We will gram TF-IDF with binning and compared this to evaluate both RANDOMS and NEARESTP in the a simple unigram TF-IDF implementation using full pipeline setting, but we will not pursue the NLTK (Loper and Bird, 2002). Using the param- SNLI-trained model further as it performed sub- eters k = 5 documents and l = 5 sentences, stantially worse. 55.30% of claims (excluding NOTENOUGHINFO) 5.6 Full Pipeline can be fully supported or refuted by the retrieved The complete pipeline consists of the DrQA docu- documents before sentence selection (see Table2). ment retrieval module (Section 5.3), DrQA-based After applying the sentence selection component, sentence retrieval module (Section 5.4), and the 44.22% of claims can be fully supported using the decomposable attention RTE model (Section 5.5). extracted sentences with DrQA and only 34.03% The two parameters: k, describing the number with NLTK. This would yield oracle accuracies of documents and l, describing the number sentences 62.81% and 56.02% respectively. to return were found using grid-search optimiz- 5.5 Recognizing Textual Entailment ing the RTE accuracy with the DA model. For k = 5 l = 5 The RTE component is trained on labeled claims the pipeline, we set and and re- paired with sentence-level. Where multiple sen- port the development set accuracy, both with and tences are required as evidence, the strings are without the requirement to provide correct evi- concatenated. As discussed in Section4, such dence for the SUPPORTED/REFUTED predictions (marked as ScoreEv and NoScoreEv respec- data is not annotated for claims labeled NOTE- tively). NOUGHINFO, thus we compare random sampling- based and similarity-based strategies for generat- Accuracy (%) ing it. We evaluate classification accuracy on the Model development set in an oracle evaluation, assuming NoScoreEv ScoreEv correct evidence sentences are selected (Table3). MLP / NP 41.86 19.04 Additionally, for the DA model, we predict entail- MLP / RS 40.63 19.42 ment given evidence, using the AllenNLP (Gard- DA / NP 52.09 32.57 ner et al., 2017) pre-trained Stanford Natural Lan- DA / RS 50.37 23.53 guage Inference (SNLI) model for comparison. Table 4: Full pipeline results on development set Accuracy (%) Model The decomposable attention model trained with NEARESTPRANDOMSSNLI NEARESTP is the most accurate when evidence is MLP 65.13 73.81 - considered. Inspection of the confusion matrices DA 80.82 88.00 38.54 shows that the RANDOMS strategy harms recall for the NOTENOUGHINFO class. This is due to the Table 3: Oracle classification on claims in the develop- difference between the sampled pages in the train- ment set using gold sentences as evidence ing set and the ones retrieved in the development set causing related but uninformative evidence to The random sampling (RANDOMS) approach be misclassified as SUPPORTED and REFUTED. (where a sentence is sampled at random from Wikipedia in place of evidence for claims la- Ablation of the sentence selection module We beled as NOTENOUGHINFO) yielded sentences evaluate the impact of the sentence selection mod- that were not only semantically different to the ule on both the RTE accuracy by removing it. claim, but also unrelated. While the the accu- While the sentence selection module may improve racy of models trained with sampling approach is accuracy in the RTE component, it is discarding higher in oracle evaluation setting, this may not sentences that are required as evidence to support claims, harming performance (see Section 5.4). tence retrieval modules for claims which required We assess the accuracies in both oracle setting evidence on the test set was 45.89% (considering (similar to Section 5.5) (see Table5) as well as complete groups of evidence) and the precision in the full pipeline (see Table6). 10.79%. The resulting F1 score is 17.47%. In the oracle setting, the decomposable atten- tion models are worst affected by removal of the 5.8 Manual Error Analysis sentence selection module: exhibiting an substan- Using the predictions on the test set, we sampled tial decrease in accuracy. The NEARESTP train- 961 of the predictions with an incorrect label or ing regime exhibits a 17% decrease and the RAN- incorrect evidence and performed a manual analy- DOMS accuracy decreases by 19%, despite near- sis (procedure described in AppendixB). Of these, perfect recall of the NOTENOUGHINFO class. 28.51% (n = 274) had the correct predicted la- bel but did not satisfy the requirements for evi- Oracle Accuracy (%) dence. The information retrieval component of the Model pipeline failed to identify any correct evidence in NEARESTPRANDOMS 58.27% (n = 560) of cases which accounted for MLP 57.16 73.36 the large disparity between accuracy of the sys- DA 63.68 69.05 tem when evidence was and was not considered. Where suitable evidence was found, the RTE com- Table 5: Oracle accuracy on claims in the dev. set using ponent incorrectly classified 13.84% (n = 133) of gold documents as evidence (c.f. Table3). claims. The pipeline retrieved new evidence that had In the pipeline setting, we run the RTE compo- not been identified by annotators in 21.85% (n = nent without sentence selection using k = 5 most 210) of claims. This was in-line with our expec- similar predicted documents. The removal of the tation given the measured recall rate of annotators sentence selection component decreased the accu- (see Section 3.4.2), who achieved recall of 72.36% racy (NOSCOREEV) approximately 10% for both of evidence identified by the super-annotators. decomposable attention models. We found that 4.05% (n = 41) of claims did not meet our guidelines. Of these, there were 11 Accuracy (%) Model claims which could be checked without evidence NEARESTPRANDOMS as these either tautologous or self-contradictory. MLP 38.85 40.45 Some correct claims appeared ungrammatical due DA 41.57 40.62 to the mis-parsing of named entities (e.g. Exotic Birds is the name of a band but could be parsed Table 6: Pipeline accuracy on the dev. set without the as a type of animal). Annotator errors (where the sentence selection module (c.f. Table4). wrong label was applied) were present in 1.35% (n = 13) of incorrectly classified claims. Interestingly, our system found new evidence 5.7 Evaluating Full Pipeline on Test Set that contradicted the gold evidence in 0.52% (n = We evaluate our pipeline approach on the test set 5) of cases. This was caused either by entity based on the results observed in Section 5.6. First, resolution errors or by inconsistent information we use DrQA to select select 5 documents near- present in Wikipedia pages (e.g. Pakistan was de- est to the claim. Then, we select 5 sentences using scribed as having both the 41st and 42nd largest our DrQA-based sentence retrieval component and GDP in two different pages). concatenate them. Finally, we predict entailment using the Decomposable Attention model trained 5.9 Ablation of Training Data with the NEARESTP strategy. The classification To evaluate whether the size of the dataset is accuracy is 31.87%. Ignoring the requirement for suitable for training the RTE component of the correct evidence (NoScoreEv) the accuracy is pipeline, we plot the learning curves for the DA 50.91%, which highlights that while the systems and MLP models (Fig.3). For each model, we were predicting the correct label, the evidence se- trained 5 models with different random initial- lected was different to that which the human anno- izations using the NEARESTP method (see Sec- tators chose. The recall of the document and sen- tion 5.5). We selected the highest performing model when evaluated on development set and re- this enhance the interpretability of predictions, but port the oracle RTE accuracy on the test set. We also facilitate the development of new methods for observe that with fewer than 6000 training in- reading comprehension. stances, the accuracy of DA is unstable. How- Another use case for the FEVER dataset is ever, with more data, its accuracy increases with claim extraction: generating short concise textual respect to the log of the number of training in- facts from longer encyclopedic texts. For sources stances and exceeds that of MLP. While both like Wikipedia or news articles, the sentences can learning curves exhibit the typical diminishing re- contain multiple individual claims, making them turn trends, they indicate that the dataset is large not only difficult to parse, but also hard to evalu- enough to demonstrate the differences of models ate against evidence. During the construction on with different learning capabilities. the FEVER dataset, we allowed for an extension of the task where simple claims can be extracted from multiple complex sentences. MLP 80 DA Finally, we would like to note that while we 75 chose Wikipedia as our textual source, we do not 70 consider it to be the only source of information 65 worth considering in verification, hence not using 60 TRUE or FALSE in our classification scheme. We 55 expect systems developed on the dataset presented 50 to be portable to different textual sources.

Test Accuracy Set Oracle (%) 45 40 35 7 Conclusions 102 103 104 105 Number of Training Instances In this paper we have introduced FEVER, a pub- Figure 3: Learning curves for the RTE models. licly available dataset for fact extraction and veri- fication against textual sources. We discussed the data collection and annotation methods and shared 6 Discussion some of the insights obtained during the annota- tion process that we hope will be useful to other The pipeline presented and evaluated in the pre- large-scale annotation efforts. vious section is one possible approach to the task In order to evaluate the challenge this dataset proposed in our dataset, but we envisage differ- presents, we developed a pipeline approach that ent ones to be equally valid and possibly bet- comprises information retrieval and textual entail- ter performing. For instance, it would be inter- ment components. We showed that the task is esting to test how approaches similar to natural challenging yet feasible, with the best performing logic inference (Angeli and Manning, 2014) can system achieving an accuracy of 31.87%. be applied, where a knowledge base/graph is con- structed by reading the textual sources and then a We also discussed other uses for the FEVER reasoning process over the claim is applied, possi- dataset and presented some further extensions that bly using recent advances in neural theorem prov- we would like to work on in the future. We believe ing (Rocktaschel¨ and Riedel, 2017). A different that FEVER will provide a stimulating challenge approach could be to consider a combination of for claim extraction and verification systems. question generation (Heilman and Smith, 2010) followed by a question answering model such as Acknowledgments BiDAF (Seo et al., 2016), possibly requiring modi- The work reported was partly conducted while James Thorne fication as they are designed to select a single span was at Amazon Research Cambridge. Andreas Vlachos is supported by the EU H2020 SUMMA project (grant agree- of text from a document rather than return one or ment number 688139). The authors would like to thank the more sentences as per our scoring criteria. The following people for their advice and suggestions: David sentence-level evidence annotation in our dataset Hardcastle, Marie Hanabusa, Timothy Howd, Neil Lawrence, Benjamin Riedel, Craig Saunders and Iris Spik. The authors will help develop models selecting and attending also wish to thank the team of annotators involved in prepar- to the relevant information from multiple docu- ing this dataset. ments and non-contiguous passages. Not only will References Edward Loper and Steven Bird. 2002. Nltk: The natural language toolkit. In Proceedings of the Gabor Angeli and Christopher D. Manning. 2014. Nat- ACL-02 Workshop on Effective Tools and Method- uralLI: Natural logic inference for common sense ologies for Teaching Natural Language Process- reasoning. In Proceedings of the 2014 Conference ing and Computational Linguistics - Volume 1. As- on Empirical Methods in Natural Language Pro- sociation for Computational Linguistics, Strouds- cessing. pages 534–545. burg, PA, USA, ETMTNLP ’02, pages 63–70. Samuel R. Bowman, Gabor Angeli, Christopher Potts, https://doi.org/10.3115/1118108.1118117. and Christopher D. Manning. 2015. A large anno- Christopher D Manning, Mihai Surdeanu, John Bauer, tated corpus for learning natural language inference. Jenny Rose Finkel, Steven Bethard, and David Mc- In Proceedings of the 2015 Conference on Empirical Closky. 2014. The Stanford CoreNLP natural lan- Methods in Natural Language Processing. guage processing toolkit. In ACL (System Demon- strations). pages 55–60. Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer open- Mausam, Michael Schmitz, Robert Bart, Stephen domain questions. In Proceedings of the 55th An- Soderland, and Oren Etzioni. 2012. Open language nual Meeting of the Association for Computational learning for information extraction. In Proceedings Linguistics (Volume 1: Long Papers). Association of the 2012 Joint Conference on Empirical Methods for Computational Linguistics, pages 1870–1879. in Natural Language Processing and Computational https://doi.org/10.18653/v1/P17-1171. Natural Language Learning. pages 523–534. Ido Dagan, Bill Dolan, Bernardo Magnini, and Ankur Parikh, Oscar Tackstr¨ om,¨ Dipanjan Das, and Dan Roth. 2009. Recognizing textual entail- Jakob Uszkoreit. 2016. A decomposable attention ment: Rational, evaluation and approaches. model for natural language inference. In Proceed- Natural Language Engineering 15(4):i–xvii. ings of the 2016 Conference on Empirical Meth- https://doi.org/10.1017/S1351324909990209. ods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, pages Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster, 2249–2255. https://aclweb.org/anthology/D16- Zhiyi Song, Ann Bies, and Stephanie Strassel. 2016. 1244. Overview of Linguistic Resources for the TAC KBP 2016 Evaluations : and Results. Pro- Dean Pomerleau and Delip Rao. 2017. Fake news chal- ceedings of TAC KBP 2016 Workshop, National In- lenge. http://fakenewschallenge.org/. stitute of Standards and Technology, Maryland, USA P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. (Ldc). SQuAD: 100,000+ questions for machine compre- Empirical Methods in Natural William Ferreira and Andreas Vlachos. 2016. Emer- hension of text. In Language Processing (EMNLP) gent: a novel data-set for stance classification. In . Proceedings of the 2016 Conference of the North Benjamin Riedel, Isabelle Augenstein, George Sp- American Chapter of the Association for Computa- ithourakis, and Sebastian Riedel. 2017. A simple tional Linguistics: Human Language Technologies. but tough-to-beat baseline for the Fake News Chal- San Diego, California, pages 1163–1168. lenge stance detection task. CoRR abs/1707.03264. http://arxiv.org/abs/1707.03264. Joseph L Fleiss. 1971. Measuring nominal scale agree- ment among many raters. Psychological bulletin Tim Rocktaschel¨ and Sebastian Riedel. 2017. End- 76(5):378. to-end differentiable proving. In Advances in Neu- ral Information Processing Systems 31: Annual Matt Gardner, Joel Grus, Mark Neumann, Oyvind Conference on Neural Information Processing Sys- Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Pe- tems 2017, December 4-9, 2017, Long Beach, ters, Michael Schmitz, and Luke Zettlemoyer. 2017. California, . volume abs/1705.11040. AllenNLP: A Deep Semantic Natural Language Pro- http://arxiv.org/abs/1705.11040. cessing Platform . Alvaro´ Rodrigo, Anselmo Penas,˜ and Felisa Verdejo. Michael Heilman and Noah A. Smith. 2010. Good 2009. Overview of the answer validation exercise Question! statistical ranking for question genera- 2008. In Carol Peters, Thomas Deselaers, Nicola tion. In Proceedings of the 2010 Annual Conference Ferro, Julio Gonzalo, Gareth J. F. Jones, Mikko Ku- of the North American Chapter of the Association rimo, Thomas Mandl, Anselmo Penas,˜ and Vivien for Computational Linguistics. pages 609–617. Petras, editors, Evaluating Systems for Multilingual and Multimodal Information Access: 9th Workshop Mio Kobayashi, Ai Ishii, Chikara Hoshino, Hiroshi of the Cross-Language Evaluation Forum. pages Miyashita, and Takuya Matsuzaki. 2017. Auto- 296–313. mated historical fact-checking by passage retrieval, word statistics, and virtual question-answering. In Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, Proceedings of the Eighth International Joint Con- and Hannaneh Hajishirzi. 2016. Bidirectional at- ference on Natural Language Processing (Volume 1: tention flow for machine comprehension. CoRR Long Papers). volume 1, pages 967–975. abs/1611.01603. http://arxiv.org/abs/1611.01603. Craig Silverman. 2015. Lies, Damn Lies and Viral Content. http: //towcenter.org/research/ lies-damn-lies-and-viral-content/. Niket Tandon, Gerard de Melo, and Gerhard Weikum. 2011. Deriving a Web-scale common sense fact database. In Proceedings of the 25th AAAI Confer- ence on Artificial Intelligence (AAAI 2011). AAAI Press, Palo Alto, CA, USA, pages 152–157. Andreas Vlachos and Sebastian Riedel. 2014. Fact checking: Task definition and dataset construction. In Proceedings of the ACL 2014 Workshop on Lan- guage Technologies and Computational Social Sci- ence. ACL. http://www.aclweb.org/anthology/W14- 2508. William Yang Wang. 2017. “Liar, Liar Pants on Fire”: A new benchmark dataset for fake news detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). http://aclweb.org/anthology/P17- 2067. A Annotation Guidelines • Prince Hamlet is the Prince of Denmark A.1 Task 1 Definitions • In 2004, the coach of the Argentinian Claim A claim is a single sentence expressing men’s national field hockey team was Carlos information (true or mutated) about a single as- Retegui pect of one target entity. Using only the source A.2 Task 1 (subtask 1) Guidelines sentence to generate claims will result in simple The objective of this task is to generate true claims claims that are not challenging. But, allowing from this source sentence that was extracted from world knowledge to be incorporated is too uncon- Wikipedia. strained and will result in claims that cannot be evidenced by this dataset. We address this gap by • Extract simple factoid claims about entity introducing a dictionary that provides additional given the source sentence. knowledge that can be used to increase the com- • Use the source sentence and dictionary as the plexity of claims in a controlled manner. basis for your claims. Dictionary Additional world knowledge is • Reference any entity directly (i.e. pronouns given to the annotator in a dictionary. This and nominals should not be used) allows for more complex claims to be generated in a structured manner by the annotator. This • Minor variations of names are acceptable knowledge may be incorporated into claims or (e.g. John F Kennedy, JFK, President may be needed when labelling whether evidence Kennedy). supports or refutes claims. • Avoid vague or cautions language (e.g. might Mutation True claims will be distorted or mu- be, may be, could be, is reported that) tated as part of the claim generation workflow. This may be achieved by making the sentence neg- • Correct capitalisation of entity names should ative, substituting words or ideas or by making be followed (, not india). words more or less specific. The annotation sys- • Sentences should end with a period. tem will select which type of mutation will be used. • Numbers can be formatted in any appropri- Requirements: ate English format (including as words for smaller quantities). • Claims must reference the target entity di- rectly and avoid use of pronouns/nominals • Some of the extracted text might not be ac- (e.g. he, she, it, the country) curate. These are still valid candidates for summary. It is not your job to fact check the • Claims must not use specula- information tive/cautious/vague language (e.g. may World Knowledge be, might be, it is reported that) • Do not incorporate your own knowledge or • True claims should only be facts that can be beliefs. deduced by information given in the source sentence and dictionary • Additional world knowledge is given to the you in the form of a dictionary. Use this to • Minor variations over the entity name are ac- make more complex claims (we prefer using ceptable: (e.g. Amazon River vs River Ama- this dictionary instead of your own knowl- zon) edge because the information in this dictio- nary can be backed up from Wikipedia) Examples of true claims: • If the source sentence is not suitable, leave • Keanu Reeves has acted in a Shakespeare the box blank to skip. play • If a dictionary entry is not suitable or unin- • The Assassin’s Creed game franchise was formative, ignore it. launched in 2007 Figure 4: Screenshot of annotation task 1 (subtask 1)

A.3 Task 1 (subtask 1) Examples • Some of the extracted text might not be ac- See Tables7 and8 for examples from the real data. curate. These are still valid candidates for summary. It is not your job to fact check the information A.4 Task 1 (substask 2) Guidelines Specific guidelines for this screen The objective of this task is to generate modifica- • tions to claims. The modifications can be either Aim to spend about up to 1 minute generating true or false. You will be given specific instruc- each claim. tions about the types of modifications to make. • You are allowed to incorporate your own world knowledge in making these modifica- • Use the original claims and dictionary as the tions and . basis for your modifications to facts about en- tity • All facts should reference any entity directly (i.e. pronouns and nominals should not be • Reference any entity directly (i.e. pronouns used). and nominals should not be used) • The mutations you produce should be objec- • Avoid vague or cautions language (e.g. might tive (i.e. not subjective) and verifiable using be, may be, could be, is reported that) information/knowledge that would be pub- licly available • Correct capitalisation of entity names should be followed (India, not india). • If it is not possible to generate facts or misin- formation, leave the box blank. • Sentences should end with a period. There are six types of mutation the annotator • Numbers can be formatted in any appropri- will be asked to introduce. These will all be given ate English format (including as words for on the same annotation page as all the claim mod- smaller quantities). ification types are related. Entity INDIA It shares land borders with Pakistan to the west; , Nepal, and Bhutan Source Sentence to the northeast; and Myanmar (Burma) and Bangladesh to the east. Bhutan Bhutan, officially the Kingdom of Bhutan, is a landlocked country in Asia, and it is the smallest state located entirely within the Himalaya mountain range. China China, officially the People’s Republic of China (PRC), is a unitary sovereign Dictionary state in East Asia and the world’s most populous country, with a population of over 1.381 billion. Pakistan Pakistan, officially the Islamic Republic of Pakistan, is a federal parliamentary republic in South Asia on the crossroads of Central and Western Asia. - One of the land borders that India shares is with the world’s most populous country. (uses information from the dictionary entry for china) Claims - India borders 6 countries. (summarises some of the information in the source sentence) - The Republic of India is situated between Pakistan and Burma. (deduced by Pakistan being West of India, and Burma being to the East)

Table 7: Task 1 (subtask 1) example: India

Entity Canada is sparsely populated, the majority of its land territory being Source Sentence dominated by forest and tundra and the Rocky Mountains. Province of Canada The United Province of Canada, or the Province of Canada, or the United was a British colony in North America from 1841 to 1867. Rocky Mountains Dictionary The Rocky Mountains, commonly known as the Rockies, are a major mountain range in western North America. tundra In physical geography, tundra is a type of biome where the tree growth is hindered by low temperatures and short growing season. - The terrain in Canada is mostly forest and tundra. - Parts of Canada are subject to low temperatures. Claims - Canada is in North America. - In some areas of Canada, it is difficult for trees to grow.

Table 8: Task 1 (subtask 1) example: Canada (a) (b)

Figure 5: Screenshots of annotation task 1 (subtask 2)

1. Rephrase the claim so that it has the same meaning

2. Negate the meaning of the claim

3. Substitute the verb and/or object in the claim to alternative from the same set of things Figure 6: Toy ontology to be used with the provided examples of similar and dissimilar mutations 4. Substitute the verb and/or object in the claim to alternative from a different set of things A.6 Task 2 Guidelines 5. Make the claim more specific so that the new The purpose of this task is to identify evidence claim implies the original claim (by making from a Wikipedia page that can be used to support the meaning more specific) or refute simple factoid sentences called claims. The claims are generated by humans (as part of 6. Make the claim more general so that the new the WF1 annotation workflow) from a Wikipedia claim can be implied by the original claim page. Some claims are true. Some claims are fake. (by making the meaning less specific) You must find the evidence from the page that sup- ports or refutes the claim. It may not always be possible to generate claims Other Wikipedia pages will also provide ad- for each modification type. In this case, the box ditional information that can serves as evidence. may be left blank. For each line, we will provide extracts from the linked pages in the dictionary column which ap- A.5 Task 1 (subtask 2) Examples pear when you “Expand” the sentence. The sen- tences from these linked pages that contain rel- The following example illustrates how given a sin- evant supplementary information should be indi- gle source sentence, the following mutations could vidually selected to record which information is be made and why they are suitable. For the claim used in justifying your decisions. “Barack Obama toured the UK.”, Figure6 shows Step-by-step guide: the relations between objects, and Table9 contains examples for each type of mutation. 1. Read and understand the claim Type Claim Rationale Rephrase President Obama visited some Rephrased. Same meaning. places in the . Negate Obama has never been to the UK Obama could not have toured the before. UK if he has never been there. Substitute Similar Barack Obama visited France. Both the UK and France are countries Substitute Dissimilar Barrack Obama attended the In the claim, Barack Obama is Whitehouse Correspondents visiting a country, whereas the Dinner. dinner is a political event. More specific Barrack Obama made state visit London is in the UK. If Obama to London. visited London, he must have visited the UK. More general Barrack Obama visited a country The UK is in the EU. If Obama in the EU. visited the UK, he visited an EU country.

Table 9: Example mutations

2. Read the Wikipedia page and identify sen- iii. If the claim or sentence contains an tences that contain relevant information. entity that is not in the dictionary, then a custom page can be added by 3. On identifying a relevant sentence, press the clicking “Add Custom Page”. Use Expand button to highlight it. This will load a search engine of your choice to the dictionary and the buttons to annotate it: find the page and then paste the Wikipedia URL into the box. (a) If the highlighted sentence contains iv. Tick the sentences from the dic- enough information in a definitive state- tionary that provide the minimal ment to support or refute the claim, amount of supporting information press the Supports or Refutes button to needed to form your decision. If add your annotation. No information there are multiple equally relevant from the dictionary is needed in this case entries (such as a list of movies), (this includes information from the main then just select the first. Once all Wikipedia page). Then continue anno- required information is added, then tating from step 2. press the Supports or Refutes button (b) If the highlighted sentence contains to add your annotation and continue some information supporting or refuting from step 2. the claim but also needs supporting in- (c) If the highlighted sentence and the dic- formation, this can be added from the tionary do not contain enough informa- dictionary. tion to support or refute the claim, press i. The hyperlinked sentences from the the Cancel button and continue from passage are automatically added to step 2 to identify more relevant sen- the dictionary tences. ii. If a sentence from the main Wikipedia article is needed to pro- 4. On reaching the end of the Wikipedia page. vide supporting information. Click Press Submit if you could find information “Add Main Wikipedia Page” to add that supports or refutes the claim. If you it to the dictionary. could not find any supporting evidence, press Skip then select Not enough information Claim: John McCain is a conservative. Refuted by: “He was the Republican nom- The objective is to find sentences that support or inee for the 2008 U.S. presidential election.” refute the claim. AND “The Republican Party’s current You must apply common-sense reasoning to the is American conservatism, which contrasts with evidence you read but avoid applying your own the Democrats’ more progressive platform (also world-knowledge by basing your decisions on the called modern liberalism).” information presented in the Wikipedia page and Adding Main Wikipedia Page to Dictionary dictionary. In the case where the claim can be supported from As a guide - you should ask yourself: multiple sentences from the main Wikipedia page, If I was given only the selected sentences, do I information the main Wikipedia page should be have stronger reason to believe claim is true (sup- added to the dictionary to add supporting informa- ported) or stronger reason to believe the claim is tion. This is because each line that is annotated false (refuted). If I’m not certain, what additional in the left column for the main Wikipedia page is information (dictionary) do I have to add to reach stored independently. this conclusion. Claim: George Washington was a soldier, born in A.7 Task 2 Examples 1732. A.7.1 What does it mean to Support or Wikipedia page: George Washington Refute a claim Sentence 1: George Washington was an The following count as valid justifications for American politician and soldier who served marking an item as supported/refuted: as the first President of the United States from Sentence directly states information that sup- 1789 to 1797 and was one of the Founding ports/refutes the claim or states information that is Fathers of the United States. synonymous/antonymous with information in the Sentence 2: In 1775, the Second Continental claim Congress commissioned him as commander- Claim: Water occurs artificially in-chief of the Continental Army in the Refuted by: “It also occurs in nature as snow, American Revolution. glaciers ...” Claim: Samuel L. Jackson was in the third Sentence 3: The Gregorian calendar was movie in the Die Hard film series. adopted within the British Empire in 1752, Supported by: “He is a highly prolific actor, hav- and it renders a birth date of February 22, ing appeared in over 100 films, including Die Hard 1732. 3.” Sentence 1 contains enough information to wholly Sentence refutes the claim through negation or support the claim without the need for any addi- quantification tional information. Claim: Schindler’s List received no awards. Sentence 2 and 3 contain partial informa- Refuted by: “It was the recipient of seven tion that can be combined. Expand sentence 2 Academy Awards (out of twelve nominations), in- and click add main Wikipedia page to add the cluding Best Picture, Best Director...” Wikipedia page to add George Washington to the Sentence provides information about a different dictionary. Sentence 3 can now be added to dictio- entity and only one entity is permitted (e.g. place nary to support the claim. of birth can only be one place) The order of the sentences doesn’t matter (se- Claim: David Schwimmer finished acting in lecting sentence 2+3 is the same as adding sen- Friends in 2005. tence 3+2) because we sort the sentences in doc- Refuted by: “After the series finale of Friends in ument order. This means that you only need to 2004, Schwimmer was cast as the title character in annotate this once. the 2005 drama Duane Hopwood.” If you attempt to add the main Wikipedia page Sentence provides information that, in conjunc- to the dictionary from sentence 3 having already tion with other sentences from the dictionary, ful- used it for sentence 2, the system will warn you fils one of the above criteria that you are making a duplicate annotation. Figure 7: Screenshot of annotation task 2

A.7.2 Adding Custom Pages • The claim contains typographical errors, You may need to add a custom page from spelling mistakes, is ungrammatical or could Wikipedia to the dictionary. This may happen in be fixed with a very minor change cases where the claim discusses an entity that was – Select The claim has a typo or grammat- not in the original Wikipedia page ical error Claim: Colin Firth is a Gemini. In Original Page: “Colin Firth (born 10 September 1960)... Keep in mind that clicking Not Enough Infor- ” Requires Additional Information from Gemini: mation or The claim is ambiguous or contains per- “Under the tropical zodiac, the sun transits this sonal information is still very useful feedback for sign between May 21 and June 21.” the AI systems. They need examples of what a Tense The difference in verb tenses that do not verifiable claim looks like, and negative examples affect the meaning should be ignored. are as useful (if not more so) than positive ones! Claim: Frank Sinatra is a musician Supported: A.8 Task 2 additional guidelines He is one of the best-selling music artists of all time, having sold more than 150 million records After conferring with the organizers, the annota- worldwide. tors expanded the guidelines to include common Claim: Frank Sinatra is a musician Supported: case that were not explicitly covered in the guide- Francis Albert Sinatra was an American singer lines: 1. Any claims involving “many”, “several”, A.7.3 Skipping “rarely”, “barely” or other indeterminate There may be times where it is appropriate to skip count words are going to be ambiguous and the claim by pressing the Skip button: can be flagged.

• The claim cannot be verified using the infor- 2. Same goes for “popular”, “famous”, and mation with the information provided: “successful (for people, for works we can as- sume commercial success)” – If the claim could potentially be verified using other publicly available informa- 3. We cannot prove personhood for fictional tion. Select Not Enough Information characters like we can for real people (dogs and cats can be authors, actors, and citizens – If the claim can’t be verified using any in fiction) publicly available information (because it’s ambiguous, vague, personal or im- 4. If a claim is “[Person] was in [film]”, the only plausible) select The claim is ambiguous way to refute it would be (a) if they were born or contains personal information after it was released, or (b) their acting debut – If the claim doesn’t meet the guidelines is mentioned and occurs a realistically long from WF1, select: The claim doesn’t enough amount of time after the film’s release meet the WF1 guidelines (at least a few years). 5. A list of movies that someone was in or jobs a person held is not necessarily exclusive, we cannot refute someone being a lawyer be- cause the first sentence of their wiki article says they were an actor.

6. A person is not their roles, if a claim is some- thing like “Tom Cruise participated in a heist in Mission Impossible 3”, we cannot prove it, because Ethan Hunt did that, not Tom Cruise.

7. Our workflow is time-insensitive, but if a claim tags something to a time period, we can treat it as such. “Neil Armstrong is an astro- naut” can be supported, but “Neil Armstrong was an astronaut in 2013” can be refuted, be- cause he was dead at the time.

8. If someone won 5 Academy Awards, they won 3 Academy Awards. Similarly, if they won an Academy Award, they were nomi- nated for an award.

9. Multiple citizenships can exist.

10. If a claim says “[Person] was in [film] in 2009”, then the film’s release date can sup- port it. If the claim is “[Person] acted in [film] in 2009”, filming dates or release dates can prove it.

11. Flag anything related to death, large-scale re- cent disasters, and controversial religious or social statements.

B Manual Error Analysis The manual error analysis was conducted with the decision process described in Figure8. For the cases of finding new evidence, recommended ac- tions have been listed. While these were not per- formed for this version of the dataset, this may form a future update following a pool-based eval- uation in the FEVER Shared Task. Figure 8: Manual Error Coding Process