A Machine-Learning-Based Pipeline Approach to Automated Fact-Checking Hanselowski, Andreas (2020)

A Machine-Learning-Based Pipeline Approach to Automated Fact-Checking Hanselowski, Andreas (2020) DOI (TUprints): https://doi.org/10.25534/tuprints-00014136 Lizenz: CC-BY-SA 4.0 International - Creative Commons, Namensnennung, Weitergabe unter gleichen Bedingungen Publikationstyp: Dissertation Fachbereich: 20 Fachbereich Informatik Quelle des Originals: https://tuprints.ulb.tu-darmstadt.de/14136 A Machine-Learning-Based Pipeline Approach to Automated Fact-Checking Vom Fachbereich Informatik der Technischen Universität Darmstadt genehmigte Dissertation zur Erlangung des akademischen Grades Dr. rer. nat. vorgelegt von Dr.-Ing. Andreas Hanselowski geboren in Sokuluk Tag der Einreichung: 17. März 2020 Tag der Disputation: 14. Mai 2020 Referenten: Prof. Dr. Iryna Gurevych, Darmstadt Prof. Chris Reed, Dundee, U.K. Darmstadt 2020 D17 Hanselowski, Andreas: A Machine-Learning-Based Pipeline Approach to Automated Fact-Checking Darmstadt, Technische Universität Darmstadt, Jahr der Veröffentlichung der Dissertation auf TUprints: 2020 URN: urn:nbn:de:tuda-tuprints-141365 Tag der mündlichen Prüfung: 14.05.2020 Veröffentlicht unter CC BY-SA 4.0 International https://creativecommons.org/licenses/ To my beloved wife, Lidiia i ii Wissenschaftlicher Werdegang des Verfassers1 09/06–08/09 Bachelor Studium (B.Eng.) in Maschinenbau an der Hochschule für Technik, Wirtschaft und Gestaltung (HTWG) Konstanz 09/09–11/11 Master Studium (M.Sc.) in Computational Mechanics of Materials and Structures (COMMAS) an der Universität Stuttgart 12/11–09/16 Promotion am Institut für Technische und Numerische Mechanik (ITM) an der Universität Stuttgart zum Dr.-Ing. 10/16–09/19 Promotion am Ubiquitous Knowledge Processing (UKP) Lab an der Technischen Universität Darmstadt zum Dr. rer. nat. 1Gemäß §20 Abs. 3 der Promotionsordnung der TU Darmstadt iii iv Abstract In the past couple of years, there has been a significant increase of the amount of false information on the web. The falsehoods quickly spread through social networks reaching a wider audience than ever before. This poses new challenges to our society as we have to reevaluate which information source we should trust and how we consume and distribute content on the web. As a response to the rising amount of disinformation on the Internet, the number of fact-checking platforms has increased. On these platforms, professional fact-checkers validate the published information and make their conclusions publicly available. Nevertheless, the manual validation of information by fact-checkers is laborious and time-consuming, and as a result, not all of the published content can be validated. Since the conclusions of the validations are released with a delay, the interest in the topic has often already declined, and thus, only a small fraction of the original news consumers can be reached. Automated fact-checking holds the promise to address these drawbacks as it would allow fact-checkers to identify and eliminate false information as it appears on the web and before it reaches a wide audience. However, despite significant progress in the field of automated fact-checking, substantial challenges remain: (i) The datasets available for training machine learning-based fact-checking systems do not provide high-quality annotation of real fact-checking instances for all the tasks in the fact-checking process. (ii) Many of today’s fact-checking systems are based on knowledge bases that have low coverage. Moreover, because for these systems sentences in natural language need to be transformed into formal queries, which is a difficult task, the systems are error-prone. (iii) Current end-to-end trained machine learning systems can process raw text and thus, potentially harness the vast amount of knowledge on the Internet, but they are intransparent and do not reach the desired performance. In fact, fact-checking is a challenging task and today’s machine learning approaches are not mature enough to solve the problem without human assistance. In order to tackle the identified challenges, in this thesis, we make the following contributions: (1) We introduce a new corpus on the basis of the Snopes fact-checking website that contains real fact-checking instances and provides high-quality annotations for the different sub-tasks in the fact-checking process. In addition to the corpus, we release our corpus creation methodology that allows for efficiently creating large datasets with a high inter-annotator agreement in order to train machine learning models for automated fact-checking. (2) In order to address the drawbacks of current automated fact-checking systems, we propose a pipeline approach that consists of the four sub-systems: document retrieval, stance detection, evidence extraction, and claim validation. Since today’s machine learning models are not advanced enough to complete the task without human assistance, our pipeline approach is designed to help fact-checkers to speed up the fact-checking process rather than taking over the job entirely. Our pipeline is able to process raw text and thus, make use of the large amount of textual information available on the web, but at the same time, it is transparent, as the outputs of sub-components of the pipeline can be observed. Thus, the different parts of the fact-checking process are automated and potential errors can be identified and traced back to their origin. v (3) In order to assess the performance of the developed system, we evaluate the sub-components of the pipeline in highly competitive shared tasks. The stance detection component of the system is evaluated in the Fake News Challenge reaching the second rank out of 50 competing systems.2 The document retrieval component together with the evidence extraction sub-system and the claim validation component are evaluated in the FEVER shared task.3 The first two systems combined reach the first rank in the FEVER shared task Sentence Ranking sub-task outper- forming 23 other competing systems. The claim validation component reaches the third rank in the FEVER Recognizing Textual Entailment sub-task. (4) We evaluate our pipeline system, as well as other promising machine learning models for automated fact-checking, on our newly constructed Snopes fact-checking corpus. The results show that even though the systems are able to reach reasonable performance on other datasets, the systems under-perform on our newly created corpus. Our analysis reveals that the more realistic fact-checking problem setting defined by our corpus is more challenging than the problem setting posed by other fact-checking corpora. We therefore conclude that further research is required in order to increase the performance of the automated systems in real fact-checking scenarios. 2http://www.fakenewschallenge.org/ 3http://fever.ai/2018/task.html vi vii viii Zusammenfassung In den letzten Jahren hat die Menge an Falschinformation im Internet stark zugenommen. Falsche Informationen verteilen sich sehr schnell in sozialen Netzw- erken und erreichen durch diese größere Leserschaft als je zuvor. Das stellt unsere Gesellschaft vor neue Herausforderungen, da wir neu bewerten müssen, welchen Informationsquellen wir Glauben schenken dürfen und wie wir Webinhalte kon- sumieren und mit anderen teilen. Als eine Antwort auf die wachsende Menge an Falschinformation im Internet hat sich die Anzahl der Fact-Checking Organisatio- nen erheblich erhöht. Auf diesen Plattformen validieren professionelle Fact-Checker publizierte Informationen und veröffentlichen die Ergebnisse ihrer Untersuchungen. Die manuelle Validierung der Informationen durch Fact-Checker ist jedoch sehr ar- beitsintensiv und zeitaufwendig. Dadurch können nicht alle Inhalte überprüft werden und für validierte Inhalte erfolgt die Publikation der Analyse oft mit Verspä- tung. Zu diesem Zeitpunkt ist das Interesse an dem Thema in vielen Fällen schon gesunken, wodurch nur ein Bruchteil der ursprünglichen Leserschaft erreicht werden kann. Automatisches Fact-Checking hat das Potenzial, diese Probleme zu lösen, weil es den Fact-Checkern ermöglichen könnte, Falschinformation zu erkennen und zu ent- fernen, bevor diese ein weites Publikum erreicht. Trotz der substanziellen Fortschritte auf diesem Gebiet, müssen noch mehrere Herausforderungen bewältigt werden, bevor automatisches Fact-Checking unter realen Bedingungen einsatzfähig wird: (i) Den Datensätzen, die für das Trainieren von Machine-Learning basierten Fact-Checking Systemen zur Verfügung stehen, fehlen qualitativ hochwertige Annotationen aus realen Fact-Checking Fällen für alle Teilaufgaben in dem Fact-Checking Prozess. (ii) Viele der heutigen Fact-Checking Systeme basieren auf Wissensdatenbanken, die nur eine relativ geringe Anzahl von Fakten abdecken, und weil für solche Sys- teme Sätze in natürlicher Sprache in formale Anfragen umgewandelt werden müssen, sind sie fehleranfällig. (iii) Moderne Machine-Learning basierte Systeme, die mittels Ende-zu-Ende Ansatz trainiert werden, können Text in natürlicher Sprache verar- beiten und dadurch potenziell die große Menge an Information im Internet nutzen. Diese Systeme sind aber intransparent und erreichen nicht die gewünschte Leistung. In der Tat ist Fact-Checking eine anspruchsvolle Aufgabe und moderne Machine- Learning basierte Systeme sind nicht ausgereift genug, um das Problem völlig ohne menschliche Unterstützung zu lösen. Um den identifizierten Herausforderungen zu begegnen, leisten wir in dieser Thesis die folgenden Beiträge: (1) Wir erstellen ein neues Korpus, das auf der Snopes Fact-Checking Plattform basiert.

A Machine-Learning-Based Pipeline Approach to Automated Fact-Checking Hanselowski, Andreas (2020)

Words That Work: It's Not What You Say, It's What People Hear

Fake News Detection in Social Media

Combatting White Nationalism: Lessons from Marx by Andrew Kliman

Online Media and the 2016 US Presidential Election

Putin's New Russia

Automatic Question Answering: Beyond the Factoid

Digital Health Passports: the Snare That Will Lure Many Into the One-World Cashless System

Congressional Record United States Th of America PROCEEDINGS and DEBATES of the 115 CONGRESS, SECOND SESSION

A Systematic Literature Review on Disinformation

O Facto Falso: Do Factóide Às Fake News Autor(Es)

Toolkit for Independent Primary Care Providers

Arxiv:2007.00576V6 [Cs.CL] 12 May 2021 Pandemic When Increased Investment in Relevant Such As Twitter