Arxiv:2011.03088V2 [Cs.CL] 16 Nov 2020 Polarized Opinions Among Masses

HOVER: A Dataset for Many-Hop Fact Extraction And Claim Verification Yichen Jiangy∗ Shikha Bordiaz∗ Zheng Zhongz Charles Dogninz Maneesh Singhz Mohit Bansaly yUNC Chapel Hill zVerisk Analytics, Inc. fshikha.bordia, zheng.zhong, charles.dognin, [email protected] fyichenj, [email protected] Abstract while real-world “claims” might refer to information from multiple sources. QA datasets like HOT- We introduce HOVER (HOppy VERification), POTQA (Yang et al., 2018) and QAngaroo (Welbl a dataset for many-hop evidence extraction and fact verification. It challenges models et al., 2018) represent the first efforts to challenge to extract facts from several Wikipedia arti- models to reason with information from three doc- cles that are relevant to a claim and classify uments at most. However, Chen and Durrett(2019) whether the claim is SUPPORTED or NOT- and Min et al.(2019) show that single-hop models SUPPORTED by the facts. In HOVER, the can achieve good results in these multi-hop datasets. claims require evidence to be extracted from Moreover, most models were also shown to degrade as many as four English Wikipedia articles and in adversarial evaluation (Perez et al., 2020), where embody reasoning graphs of diverse shapes. word-matching reasoning shortcuts are suppressed Moreover, most of the 3/4-hop claims are writ- ten in multiple sentences, which adds to the by extra adversarial documents (Jiang and Bansal, complexity of understanding long-range de- 2019). In the HOTPOTQA open-domain setting, pendency relations such as coreference. We the two supporting documents can be accurately show that the performance of an existing state- retrieved by a neural model exploiting a single hy- of-the-art semantic-matching model degrades perlink (Nie et al., 2019b; Asai et al., 2020). significantly on our dataset as the number of reasoning hops increases, hence demonstrat- Hence, while providing very useful starting ing the necessity of many-hop reasoning to achieve strong results. We hope that the in- points for the community, FEVER is mostly re- troduction of this challenging dataset and the stricted to a single-hop setting and existing multi- accompanying evaluation task will encourage hop QA datasets are limited by the number of rea- research in many-hop fact retrieval and infor- soning steps and the word overlapping between the mation verification.1 question and all evidence. An ideal multi-hop ex- ample should have at least one piece of evidence 1 Introduction (supporting document) that cannot be retrieved The proliferation of social media platforms and with high precision by shallowly performing direct digital content has been accompanied by a rise in semantic matching with only the claim. Instead, un- deliberate disinformation and hoaxes, leading to covering this document requires information from arXiv:2011.03088v2 [cs.CL] 16 Nov 2020 polarized opinions among masses. With the in- previously retrieved documents. In this paper, we creasing number of inexact statements, there is a try to address these issues by creating HOVER (i.e., large interest in a fact-checking system that can ver- HOppy VERification) whose claims (1) require ev- ify claims based on automatically retrieved facts idence from as many as four English Wikipedia and evidence. FEVER (Thorne et al., 2018) is articles and (2) contain significantly less semantic an open-domain fact extraction and verification overlap between the claims and some supporting dataset closely related to this real-world application. documents to avoid reasoning shortcuts. We create However, more than 87% of the claims in FEVER HOVER with 26k claims in three stages. In stage 1 require information from a single Wikipedia article, (left box in Fig.1), we ask a group of trained and evaluated crowd-workers to rewrite the question- ∗Equal contribution. 1We make the HOVER dataset publicly available at answer pairs from HOTPOTQA (Yang et al., 2018) https://hover-nlp.github.io into claims that mention facts from two English C A B D D C A B C A B D D CC AA BB CD D A B C #H ReasoningA B C Graph Examples A B C Claim: Patrick Carpentier currently drives a Ford Fusion, introduced for model year 2006, in the NASCAR Sprint Cup Series. 2 A A B B D Doc A: Ford Fusion is manufactured and marketed by Ford. Introduced for the 2006 model year, ... D C A B Doc B: Patrick Carpentier competed in the NASCAR Sprint Cup Series, driving the Ford Fusion. A B Claim: The Ford Fusion was introduced for model year 2006. The Rookie of The Year in the 1997 C CART season drives it in the NASCAR Sprint Cup Series. 3 C A B C Doc C: The 1997 CART PPG World Series season, the nineteenth in the CART era of U.S. A B A B C open-wheel racing, consisted of 17 races, ... Rookie of the Year was Patrick Carpentier. D A B D Claim: The model of car Trevor Bayne drives was introduced for model year 2006. The Rookie of D C C The Year in the 1997 CART season drives it in the NASCAR Sprint Cup. A B A B Doc D: Trevor Bayne is an American professional stock car racing driver. He last competed in the A B NASCAR Cup Series, driving the No. 6 Ford Fusion... A B C D A B D Claim: The Ford Fusion was introduced for model year 2006. It was driven in the NASCAR Sprint C Cup Series by The Rookie of The Year of a Cart season, in which the 1997 Marlboro 500 was the 4 D C A B D 17th and last round. A B Doc D: The 1997 Marlboro 500 was the 17th and last round of the 1997 CART season... D C C C Claim: The Ford Fusion was introduced for model year 2006. The Rookie of The Year in the 1997 A B CART season drives it in the series held by the group that held an event at the Saugus Speedway. A A B B C Doc D: Saugus Speedway is a 1/3 mile racetrack in Saugus, California on a 35 acre site. The track D C A B D hosted one NASCAR Craftsman Truck Series event in 1995... D A B C Table 1:D Types (graphC shape) of many-hop reasoning required to extract the evidence and to verify the claim in the A A B B dataset. AllA claimsB presented are created and extended based on a single Q-A pair in HOTPOTQA. The highlighted C (blue+underlined)A B C words from the original 2/3-hop claims are replaced with the italicized phrase based on the A B D informationA fromB the newly-introduced Docs to form the 3/4-hop claims. D C C A B WikipediaA articles.B We then introduce extra hops2 SUPPORTED, REFUTED, or NOTENOUGHINFO. C to a subsetA of theseB 2-hop claims by asking crowd- However, we find that the decision between RE- A B workers toA substituteB an entity in the claim with FUTED and NOTENOUGHINFO can be ambigu- C information from another English Wikipedia ar- ous in many-hop claims and even the high-quality, A B ticle that describes the original entity. We then trained annotators from Appen, instead of Mturk, repeat thisA processB on these 3-hop claims to further cannot consistently choose the correct label from create 4-hopA B claims. To make many-hop claims these two classes. Recent works (Pavlick and more natural and readable, we encourage crowd- Kwiatkowski, 2019; Chen et al., 2020a) have raised workers to write the 3/4-hop claims in multiple concern over the uncertainty of NLI tasks with cat- sentences and connect them using coreference. An egorical labels and proposed to shift to a proba- entire evolution history from 2-hop claims to 3/4- bilistic scale. Since this work is mainly targeting hop claims is presented in the leftmost box in Fig.1 the many-hop retrieval, we combine the REFUTED and Table1, where the latter further presents the and NOTENOUGHINFO into a single class, namely reasoning graphs of various shapes embodied by NOT-SUPPORTED. This binary classification task the many-hop claims. is still challenging for models given the incomplete In stage 2 (the central box in Fig.1), we cre- evidence retrieved, as we will explain later. ate claims that are not supported by the evidence by mutating the claims collected in stage 1 with Next, we introduce the baseline system and a combination of automatic word/entity substitu- demonstrate its limited ability in addressing many- tion and human editing. Specifically, we ask the hop claims. Following a state-of-the-art sys- trained crowd-workers to rewrite a claim by mak- tem (Nie et al., 2019a) for FEVER, we build the ing it either more specific/general than or negat- baseline with a TF-IDF document retrieval stage ing the original claim. We ensure the quality of and three BERT models fine-tuned to conduct doc- the machine-generated claims using human vali- ument retrieval, sentence selection, and claim ver- dation detailed in Sec. 2.2. In stage 3, we fol- ification respectively. We show that the bi-gram low Thorne et al.(2018) to label the claims as TF-IDF (Chen et al., 2017)’s top-100 retrieved documents can only recover all supporting documents 2The number of hops of a claim is the same as the number in 80% of 2-hop claims, 39% of 3-hop claims, and of supporting documents for this claim. 15% of 4-hop claims. The performance of down- Figure 1: Data Collection flow chart for HOVER.

Load more