Finding Evidence for Unsubstantiated Claims on Wikipedia
Total Page:16
File Type:pdf, Size:1020Kb
Masaryk University Faculty of Informatics Finding evidence for unsubstantiated claims on Wikipedia Bachelor’s Thesis Vojtěch Krajňanský Brno, Fall 2017 Masaryk University Faculty of Informatics Finding evidence for unsubstantiated claims on Wikipedia Bachelor’s Thesis Vojtěch Krajňanský Brno, Fall 2017 This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document. Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Vojtěch Krajňanský Advisor: Mgr. et Mgr. Vít Baisa, Ph.D. i Acknowledgements I would like to thank my advisor, Mgr. et Mgr. Vít Baisa, Ph.D., for his patient guidance which proved invaluable in helping me finish this work. My thanks also belong to Mgr. Michal Krajňanský, who first sparked my interest in the field of computer science, andwho provided me with helpful insights in our various discussions about computational linguistics and natural language processing. Finally, I would like to express my gratitude to Lorcan Cook, for proofreading this work. iii Abstract The goal of this thesis is the creation of a Python script to help Wikipedia annotators semi-automate the process of source searching for claims which have been marked as requiring an external source. The result is a language independent tool which parses a given set of Wikipedia articles, retrieves parts of text which have been deemed to be unsubstantiated claims, formulates a free text search query for a given claim and retrieves relevant responses via the Bing Web Search API. It then generates an HTML report summarizing the results and presenting the user with an overview of suggested sources for them to assess and choose the best sources for the claims. iv Keywords information retrieval, evidence finding, Wikipedia, query formulation, free text search, Python v Contents 1 Introduction 1 1.1 Motivation ..........................1 1.2 Thesis Outline ........................2 1.3 Results ............................2 2 State-of-the-art and Related Work 5 3 Thesis Foundations 9 3.1 Natural Language Processing and Information Retrieval ..9 3.1.1 Key Concepts . 10 3.2 Task Definition ........................ 12 4 Implementation 15 4.1 Unsubstantiated Claim Identification ............ 15 4.1.1 Wiki markup . 15 4.1.2 Citation Needed Templates . 16 4.1.3 Claim Text Identification . 16 4.1.4 Quote Templates . 18 4.2 Query Formulation ...................... 18 4.2.1 Assumptions . 18 4.2.2 Term Scoring . 20 4.2.3 Keyword Extraction . 23 4.3 Web Search Engine ...................... 25 4.4 Website Relevance Assessment ................ 26 4.4.1 Copycat Websites . 26 4.5 Report Generation ....................... 28 4.6 Tool Usage ........................... 28 4.6.1 Configuration File . 29 5 Performance Evaluation 31 5.1 Evaluation Metric ....................... 31 5.2 Results ............................ 31 6 Summary 33 Bibliography 35 vii A Project Structure 39 B Blacklisted Sites 41 C Term Score Cutoff Estimation 43 viii List of Figures 4.1 A citation needed template after claim text 16 4.2 A citation needed template before a list 17 ix 1 Introduction 1.1 Motivation Today, Wikipedia (www.wikipedia.org) is one of the most commonly accessed sources of factual information for people around the world. As of April 18, 2017, Alexa.com, a website that aggregates information about internet traffic, ranks Wikipedia as the 5th most popular website, globally [1]. This makes Wikipedia one of the most influential websites in the world, possibly even capable of affecting public opinion on certain issues. As of April 18, 2017, the English-language edition of Wikipedia contained 5,386,775 articles, with an increase of over 20,000 new ar- ticles each month [2], and over 30 million users have registered a username. Out of these users, only a fraction contributes to the site’s maintenance regularly. In February 2015, only about 12,000 editors had an edit count of at least 600 overall and 50 in the last six months [3]. These relatively few volunteers’ responsibilities range from fixing typographic errors and policing articles for vandalism to resolving disputes and perfecting content. They are the ones making sure that Wikipedia articles abide by its core sourcing policy which has been previously described as “verifiability, not truth” [4]. That is, they ought to present only material that has been previously published by a re- liable source and describe leading opinions on disputed issues in a neutral way. Verifiability of information available on the Internet has recently become a much discussed topic in various social circles, with main- stream western media often calling our era the post-truth age. The term post-truth has in fact been named the 2016 international word of the year by Oxford Dictionaries [5]. Wikipedia editors have a duty to en- sure that the information presented is as up to date as possible, while referencing credible sources. Various software robots have been created to help with the more mundane tasks of the site’s maintenance, such as vandalism identifi- cation, spell checking, etc. [6] However, finding reliable evidence for disputed claims still remains a human dominated manual task. 1 1. Introduction The goal of this work is to create and present a language inde- pendent tool to help semi-automate the process of evidence finding for unsubstantiated claims in order to ease and accelerate it, saving time and energy of the human party involved. The tool should parse a given set of articles, find claims that have been identified as requiring an external source, and subsequently search the Internet for electronic material which could serve as a credible reference. A collection of candidate sources is then presented to the user, ranked by estimated relevance, from which they are able to choose the best result as they see fit. Different lightweight, language independent approaches to query formulation from natural language text are explored in an effort to maximize the precision of retrieved source candidates for a given claim. 1.2 Thesis Outline Chapter 2 describes recent work related to this thesis. Chapter 3 in- troduces the key concepts used throughout the thesis, and precisely defines the task of source searching for unsubstantiated claims found on Wikipedia. In Chapter 4, the proposed solution for the task and its implementation are described. Chapter 5 introduces a quality metric against which the presented tool’s success rate is measured and pro- vides an overview of its experimental evaluation. Finally, in Chapter 6, the results of the work are summarized and various methods for improvement are suggested. 1.3 Results The result of this work is a script written in the Python programming language, which provides the user with a simple means to automati- cally search for a set of web documents presumed to contain factual information supporting claims which have been marked by Wikipedia editors as requiring an additional external source. The user is first required to create a dictionary of word–document frequency mappings for a text corpus created from the whole Wikipedia article dump of a language of their choice. Afterwards, the user provides a dump of 2 1. Introduction articles of their choice to search for unsubstantiated claims as marked by editors, and find relevant sources for these claims using the Bing Web Search API. Both the dictionary creation and source search steps are performed by running a simple command in the commandline. The tool is able to formulate free text queries from natural language text for which the Google search engine returns a response containing a factual source for a given claim in more than 60% of cases, tested on a set of 50 English and 20 Czech articles. The Bing API shows a worse performance for the queries, returning a factual source for 46% of the English claims, and a mere 15% of the Czech ones. I try to explain the reason for this large a difference between the Bing API success rates for the two languages in Section 5.2. The script is divided into 6 Python modules, each of which han- dles one of the responsibilities necessitated by the task. All code and related files are publicly available in a GitHub repository at https://github.com/Vocco/wikifinder. 3 2 State-of-the-art and Related Work No tool currently exists to help Wikipedia annotators automate the process of finding evidence in the form of electronic documents/web- pages for unsubstantiated claims. The means to creating one, however, lie in techniques and theory established by the scientific community, especially in the field of information retrieval. The tasks such atool must handle can be summarized as follows: Wiki markup and plain text parsing, semantic and topical analysis, query formulation from natural language text, and relevance assessment for found sources. Wiki markup is the syntax and keywords used by the MediaWiki software to format a page [7]. Although not meant primarily as a Wiki markup parser, the gensim Python library has been chosen as the base framework for Wikipedia processing and query formulation, as it has been shown to efficiently compute bag-of-words (BOW) representations of documents on Wikipedia dumps [8] and can be easily modified for detecting citation needed templates used by Wiki markup to identify claims that need to be verified. It also provides several utility functions such as computing the cosine similarity of two documents, which is used during the website relevance assessment process. Once a portion of text has been identified as a claim that needs to be supported by an outside source, it needs to be analyzed and reformulated into a query for a web search engine.