Mining an “Anti-Knowledge Base” from Wikipedia Updates with Applications to Fact Checking and Beyond

Mining an “Anti-Knowledge Base” from Wikipedia Updates with Applications to Fact Checking and Beyond Georgios Karagiannis, Immanuel Xuezhi Wang, Cong Yu Trummer, Saehan Jo, Shubham Google, NY, USA Khandelwal fxuezhiw,[email protected] Cornell University, NY, USA fgk446,it224,sj683,[email protected] ABSTRACT mantic web and data mining. The research work has been We introduce the problem of anti-knowledge mining. Our impactful|leading to significant advancement in major search goal is to create an \anti-knowledge base" that contains fac- engine features such as Google's Knowledge Graph [26] and tual mistakes. The resulting data can be used for analysis, Bing's Satori [19]. Little attention, however, has been paid training, and benchmarking in the research domain of auto- to the large number of erroneous facts on the Web. While mated fact checking. Prior data sets feature manually gen- storing all the world's wrong facts would be infeasible and erated fact checks of famous misclaims. Instead, we focus not particularly useful, we argue that a knowledge base of on the long tail of factual mistakes made by Web authors, common factual mistakes, what we call an anti-knowledge ranging from erroneous sports results to incorrect capitals. base, can complement the traditional knowledge base for We mine mistakes automatically, by an unsupervised ap- many applications including, but not limited to, automated proach, from Wikipedia updates that correct factual mis- fact checking. Specifically, collecting common factual mistakes. Identifying such updates (only a small fraction of the takes enables us to develop learning algorithms for detect- total number of updates) is one of the primary challenges. ing long tail factual mistakes in a much more precise way We mine anti-knowledge by a multi-step pipeline. First, than verification against a traditional knowledge base can we filter out candidate updates via several simple heuris- accomplish. Furthermore, an anti-knowledge base can pro- tics. Next, we correlate Wikipedia updates with other state- vide values for the research community in conducting new ments made on the Web. Using claim occurrence frequen- studies such as how users make mistakes and designing novel cies as input to a probabilistic model, we infer the likelihood systems such as alerting users when they are working on of corrections via an iterative expectation-maximization ap- mistake-prone topics. proach. Finally, we extract mistakes in the form of subject- In this paper, we formally introduce the concept of an predicate-object triples and rank them according to several anti-knowledge base (AKB) and propose the first algorith- criteria. Our end result is a data set containing over 110,000 mic approach for anti-knowledge mining (AKM). Similar to ranked mistakes with a precision of 85% in the top 1% and traditional knowledge bases, an anti-knowledge base con- a precision of over 60% in the top 25%. We demonstrate tains hsubject, predicate, objecti triples with associated meta- that baselines achieve significantly lower precision. Also, we data, where each triple represents a factually mistaken claim exploit our data to verify several hypothesis on why users that has been asserted by users on the Web. Whenever pos- make mistakes. We finally show that the AKB can be used sible, we pair each mistaken triple with a correction triple in to find mistakes on the entire Web. the anti-knowledge base so that users get the correct knowledge as well. A formal and more elaborate definitions can PVLDB Reference Format: be found in Section 3. Georgios Karagiannis, Immanuel Trummer, Saehan Jo, Shubham At first, it may seem possible to mine mistakes by com- Khandelwal, Xuezhi Wang, Cong Yu. VLDB. PVLDB, 13(4): paring claims extracted from text against an existing knowl- xxxx-yyyy, 2019. DOI: https://doi.org/10.14778/3372716.3372727 edge base. As we will show in Section 8, this approach leads however to an excessive number of false positives (i.e., claims erroneously marked up as mistakes) due to incom- 1. INTRODUCTION plete knowledge bases and imperfect extraction. We could Constructing a knowledge base of high quality facts at leverage prior work on detecting conflicting statements on large scale is one of the main topics of many research com- the Web [28]. However, conflicting statements may simply munities ranging from natural language processing to se- indicate a controversial topic where different users have different opinions, rather than a factual mistake. Our key idea is to exploit a property that distinguishes factual mistakes This work is licensed under the Creative Commons Attribution- from generally accepted claims as well as subjective topics: NonCommercial-NoDerivatives 4.0 International License. To view a copy a diligent user community will correct factual mistakes once of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For they are spotted. any use beyond those covered by this license, obtain permission by emailing A static view on the Web is insufficient to leverage this [email protected]. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. property. Hence, we assume that we are given an update Proceedings of the VLDB Endowment, Vol. 13, No. 4 log on a collection of documents. By comparing original ISSN 2150-8097. and updated text versions for each update, we obtain can- DOI: https://doi.org/10.14778/3372716.3372727 561 didates for factual mistakes. Of course, text updates may 2. RELATED WORK not only correct factual mistakes (but also expand text, cor- Prior efforts [29] at generating corpora for fact checking rect spelling mistakes, or update controversial statements based on the ClaimReview standard [24] leverage manual an- to reflect one author's personal opinion). One of the main notations by human fact checkers. This limits coverage and challenges solved by our approach is therefore to filter out size of generated data sets. Prior work on mining conflicting those updates that correct factual mistakes to form AKB Web sources has focused on extracting correct facts [31] or entries. To do so, we exploit a multi-stage pipeline com- subjective opinions [28] as opposed to mining factual mis- bining simple heuristics, a probabilistic model trained via takes with high precision. A large body of work in the area unsupervised learning, as well as feedback obtained from of fact checking is aimed at supporting human fact check- a static Web snapshot (counting the number of supporting ers in verifying specific claims or documents [9{13], e.g. by statements for text before and after updates). identifying relevant text documents for verification [29]. The proposed approach is fully automated and targeted at min- Example 1.1. Imagine the sentence \Heineken is Danish" ing factual mistakes from large document collections. Note is updated to \Heineken is Dutch". This update is correct- that tools such as ClaimBuster [11, 12] link claims to verify ing a factual mistake since Heineken is indeed based in the to previously annotated fact checks. Automatically gener- Netherlands (as can be learned from authoritative sources ated fact checks may extend the coverage of tools working such as the company's Web site). Our goal is to distinguish according to that principle. such updates from other types of updates. We use this exam- Knowledge Base Construction has received wide attention ple (an actual update mined by our approach) throughout in both academia and industry [7,18,25,27,33]. Open infor- most of the paper. mation extraction is a closely related area, where the goal For this paper, we leverage the Wikipedia update log as is to extract relational tuples from massive corpora without source for factual mistake candidates (while we exploit a a pre-specified vocabulary [1, 8]). Our work is complemen- snapshot of the entire Web for feedback). We select Wikipedia tary as we mine anti-knowledge. We apply our approach to due to its size, breadth of topics, and due to its large and ac- Wikipedia update logs. In that, our work connects to prior tive user community. Furthermore, mistakes that appear on work leveraging Wikipedia as data source (e.g., identifying Wikipedia tend to propagate further (we show some exam- relevant articles to controversial topics [23], identifying up- ples for mistakes propagating from Wikipedia to the rest of date intentions [30], or mining data sets for grammatical the Web in Section 8). Therefore, the potential impact (e.g., error corrections [16]). The goal of our work (as well as via fact checking tools) of mining mistakes from Wikipedia the method) differs however from all of those. Our work is is higher than for other Web sources. While we focus on also complementary to prior work leveraging different types Wikipedia in this paper, please note that the general ap- of data from the Wikipedia corpus (e.g., analyzing logs of proach we propose is not specific to Wikipedia and could structured Wikidata queries [3], as opposed to logs of up- also be applied to different sources of updates and static dates to natural language Wikipedia text). feedback (e.g., an update log of Web documents in a com- Our approach uses a probabilistic model and unsuper- pany's intranet). Finally, note that Wikipedia is imperfect vised learning (in one stage of the pipeline) to classify up- and may contain uncorrected mistakes. We will not be able dated sentences as factual mistakes, based on several in- to spot those mistakes by mining the update log. However, put features. In that, it connects in general to systems our goal is to mine common factual mistakes. Each occur- performing weakly supervised learning such as Snorkel [20] rence increases the likelihood that at least one corresponding and approaches for efficient inference, e.g.

Mining an “Anti-Knowledge Base” from Wikipedia Updates with Applications to Fact Checking and Beyond

Realising the Potential of Algal Biomass Production Through Semantic Web and Linked Data

From Cataloguing Cards to Semantic Web 1

Download Slides

Knowledge Extraction for Hybrid Question Answering

Datatone: Managing Ambiguity in Natural Language Interfaces for Data Visualization Tong Gao1, Mira Dontcheva2, Eytan Adar1, Zhicheng Liu2, Karrie Karahalios3

Directions in AI Research and Applications at Siemens Corporate

A Framework for Ontology-Based Library Data Generation, Access and Exploitation

Ontologies to Interpret Remote Sensing Images: Why Do We Need Them?

Question Answering

Ques+On Answering

The Negative Impact of A-Box Materialization on Rdf2vec Knowledge Graph Embeddings

Question Answering