HOVER: A Dataset for Many-Hop Fact Extraction And Claim Verification Yichen Jiangy∗ Shikha Bordiaz∗ Zheng Zhongz Charles Dogninz Maneesh Singhz Mohit Bansaly yUNC Chapel Hill zVerisk Analytics, Inc. fshikha.bordia, zheng.zhong, charles.dognin,
[email protected] fyichenj,
[email protected] Abstract while real-world “claims” might refer to informa- tion from multiple sources. QA datasets like HOT- We introduce HOVER (HOppy VERification), POTQA (Yang et al., 2018) and QAngaroo (Welbl a dataset for many-hop evidence extraction and fact verification. It challenges models et al., 2018) represent the first efforts to challenge to extract facts from several Wikipedia arti- models to reason with information from three doc- cles that are relevant to a claim and classify uments at most. However, Chen and Durrett(2019) whether the claim is SUPPORTED or NOT- and Min et al.(2019) show that single-hop models SUPPORTED by the facts. In HOVER, the can achieve good results in these multi-hop datasets. claims require evidence to be extracted from Moreover, most models were also shown to degrade as many as four English Wikipedia articles and in adversarial evaluation (Perez et al., 2020), where embody reasoning graphs of diverse shapes. word-matching reasoning shortcuts are suppressed Moreover, most of the 3/4-hop claims are writ- ten in multiple sentences, which adds to the by extra adversarial documents (Jiang and Bansal, complexity of understanding long-range de- 2019). In the HOTPOTQA open-domain setting, pendency relations such as coreference. We the two supporting documents can be accurately show that the performance of an existing state- retrieved by a neural model exploiting a single hy- of-the-art semantic-matching model degrades perlink (Nie et al., 2019b; Asai et al., 2020).