Implementation and Analysis of Truth Discovery Algorithms

Cardiff University School of Computer Science and Informatics CM3203 ONE SEMESTER INDIVIDUAL PROJECT REPORT Implementation and Analysis of Truth Discovery Algorithms Joe Singleton Supervisor: Dr Richard Booth Moderator: Dr Federico Cerutti May 2019 ii Abstract With the vast amount of data available in today’s world, partic- ularly on the web, it is common to find conflicting information from different sources. Given an input consisting of conflicting claims from multiple sources of unknown trustworthiness and reliability, truth discovery algorithms aim to evaluate which claims should be believed and which sources should be trusted. The evaluations of trust and belief should cohere with one another, so that a claim re- ceives a high belief ranking if it is backed up by trustworthy sources and vice versa. This project investigates truth discovery from practical and theoretical perspectives. On the practical side, a number of algorithms from the literature are implemented in software and analysed. On the theoretical side, a formal framework is developed to study truth discovery from a general point of view, allowing results to be proved and comparisons made between truth discovery and related areas in the literature. Desirable properties of truth discovery algorithms are defined in the framework, and we consider whether they are satisfied by a particular real-world algorithm, Sums. Contents Contents iii List of Figures vi 1 Introduction1 2 Background3 2.1 Related Areas.......................... 4 2.1.1 Resolving Conflicts in Data............... 4 2.1.2 Trust Analysis...................... 5 2.2 Truth Discovery Background.................. 5 2.3 Existing Work.......................... 7 2.3.1 Software Implementations............... 7 2.3.2 Theoretical Work.................... 9 3 Software Implementation 11 3.1 Specification and Design....................11 3.1.1 Datasets.........................14 3.1.2 Algorithms........................ 22 3.1.3 Results and Evaluation.................28 3.1.4 Visualisation....................... 31 3.1.5 User Interfaces .....................34 3.2 Implementation......................... 39 3.2.1 Programming Languages................40 iii CONTENTS iv 3.2.2 Third-party Libraries Used...............41 3.2.3 Truth Discovery Algorithms ..............42 3.3 Results and Evaluation.....................50 3.3.1 Real-World Dataset Demonstration ..........50 3.3.2 Synthetic Data Accuracy Experiments.........52 3.3.3 Convergence Analysis.................. 56 3.3.4 Time Complexity Analysis ...............58 3.3.5 User Interfaces .....................59 3.3.6 Testing.......................... 64 3.3.7 Evaluation against Requirements ...........68 3.4 Future Work........................... 70 4 Theoretical Analysis 72 4.1 Approach ............................72 4.1.1 Overview of Approach .................74 4.2 A Framework for Truth Discovery...............75 4.2.1 Standard Definitions and Notation ..........75 4.2.2 Truth Discovery Definitions ..............76 4.2.3 Axioms.......................... 78 4.2.4 Iterative Truth Discovery Operators..........88 4.3 Evaluation............................ 93 4.4 Future Work........................... 97 4.4.1 Unfinished Business................... 97 4.4.2 Future Directions....................98 5 Conclusions 101 6 Reflection on Learning 103 References 105 A Web-interface Screenshots 111 B Test Results 119 C Proofs 124 C.1 Proof of proposition1 .....................124 C.2 Proof of proposition2 .....................125 C.3 Proof of proposition3 .....................126 CONTENTS v C.4 Proof of proposition4 .....................126 C.5 Proof of proposition5 .....................127 C.6 Proof of lemma1........................127 C.7 Proof of theorem1 .......................129 C.8 Proof of theorem2 .......................134 List of Figures 3.1 Example truth discovery dataset as a list of claim tuples . 15 3.2 Matrix representation of the dataset shown in figure 3.1. 16 3.3 CSV representation of the dataset shown in figure 3.1. 18 3.4 UML class diagram showing high-level design for datasets. 21 3.5 UML class diagram for algorithms and iterators..........27 3.6 Example of results of an algorithm as key-value mappings. 28 3.7 UML class diagram for results.................... 31 3.8 Results from figure 3.6 represented graphically .........32 3.9 UML class diagram for graphical representations. 34 3.10 Command-line interface example.................36 3.11 Command-line YAML output example............... 37 3.12 UML class diagram for user interfaces. ..............39 3.13 Python code for loading the stock dataset.............51 3.14 Results for the stock dataset for each algorithm. 52 3.15 Source trust distribution experiment on synthetic datasets. 53 3.16 Claim probability experiment on synthetic datasets. 54 3.17 Domain size experiment on synthetic datasets........... 55 3.18 Convergence experiment on a large synthetic dataset. 57 3.19 Algorithm run time experiments on synthetic datasets. 58 3.20 Python code demonstrating simple use of the API. 60 3.21 Example output of the code in figure 3.20. ...........61 3.22 Example of CLI interface for running an algorithm. 62 3.23 Example of a Python unit test.................... 66 vi List of Figures vii 3.24 Status of the requirements in the finished implemetion. 69 4.1 Example of two equivalent truth discovery networks . 80 4.2 Network where we do not wish for an IIA-type axiom to hold. 85 4.3 Network where some notion of independence may be applied. 86 4.4 A network in which Sums does not rank all sources equally. 92 4.5 A counterexample to independence for Sums . 93 A.1 Basic view of the input form in the web interface. 112 A.2 CSV dataset entry in the web interface. 113 A.3 Advanced algorithm options in the web interface. 114 A.4 Example results of a single algorithm run via the web interface. 115 A.5 Example results of several algorithms run via the web interface. 116 A.6 Graph representation of results in the web interface. 117 A.7 Frame from an animation of the results of an algorithm. 118 Chapter 1 Introduction There is an increasing amount of data available in today’s world, particu- larly from the vast number of pages on the web, user-submitted content on social media platforms and blogs, and sensor data from scientific in- struments. Data can be found in many different formats, from structured datasets to natural language articles, and from many disparate sources, e.g. news outlets, scientific institutions, and members of the public. With this abundance of data, it is common to find information about a single object from multiple sources. An inherent problem faced when us- ing such data is that different sources may provide conflicting information for the same object. Conflicts have a variety of causes, including out of date information, poor quality data sources, errors in information extraction (when parsing natural language text, for example), and deliberate spread of misinforma- tion. When it comes to finding information about an object with conflicting reports, the question is this: which information should we accept as correct? Truth discovery has emerged as a topic aiming to tackle this problem of determining what to believe by considering also the trustworthiness of sources. A main principle in many approaches is that believable claims are those that are made by trustworthy sources, and that trustworthy sources generally make believable claims. 1 CHAPTER 1. INTRODUCTION 2 This project has both practical and theoretical components. On the practical side, we develop a robust software framework in Python for truth discovery, which supports running truth discovery algorithms on real-world datasets and evaluating their performance. A few popular algorithms from the literature are implemented, although the framework aims to be easily extendible, well-documented and well-tested so as to allow more algorithms and features to be implemented in the future. The framework will provide a uniform interface for users and researchers in truth discovery to run different algorithms on datasets, and compare be- haviour between algorithms. Additionally, it could be used to aid develop- ment of new algorithms by evaluating performance against a number of existing algorithms. To demonstrate its capabilities, some basic analysis and evaluation of the implemented algorithms is performed. For the theoretical component, we define a formal mathematical framework for truth discovery, highlighting parallels with other areas in the literature, especially the theory of voting in social choice [41]. Following the axiomatic approach used in social choice, we look for axioms (desirable properties) of truth discovery algorithms expressed in the formal framework, and consider which axioms are satisfied by a particular algorithm, namely Sums [27]. As well as providing some immediate results, this provides foundations for further theoretical work on truth discovery. Chapter 2 Background The fundamental problem of truth discovery is to resolve conflicts in a set of claims from different sources. A na¨ıve approach is to apply a majority vote, where the claim made by the largest number of sources is accepted. Unfortunately, this is prone to yield poor results when the sources are not all equally trustworthy. For example, the study in [31] found that false information on Twitter is shared more quickly and more widely than true information. Applying a majority vote in the Twittersphere would therefore, in many cases, select the false information as correct.

Load more