Towards Understanding the Information Ecosystem Through the Lens of Multiple Web Communities Arxiv:1911.10517V1 [Cs.SI] 24
Total Page:16
File Type:pdf, Size:1020Kb
Towards Understanding the Information Ecosystem Through the Lens of Multiple Web Communities Savvas Zannettou A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of Cyprus University of Technology. arXiv:1911.10517v1 [cs.SI] 24 Nov 2019 Department of Electrical Engineering, Computer Engineering and Informatics Cyprus University of Technology November 26, 2019 Abstract The Web consists of numerous Web communities, news sources, and services, which are often ex- ploited by various entities for the dissemination of false or otherwise malevolent information. Yet, we lack tools and techniques to effectively track the propagation of information across the multiple diverse communities, and to capture and model the interplay and influence between them. Further- more, we lack a basic understanding of what the role and impact of some emerging communities and services on the Web information ecosystem are, and how such communities are exploited by bad actors (e.g., state-sponsored trolls) that spread false and weaponized information. In this thesis, we shed some light on the complexity and diversity of the information ecosystem on the Web by presenting a typology that includes the various types of false information, the involved actors as well as their possible motives. Then, we follow a data-driven cross-platform quantitative approach to analyze billions of posts from Twitter, Reddit, 4chan’s Politically Incorrect board (/pol/), and Gab, to shed light on: 1) how news and image-based memes travel from one Web community to another and how we can model and quantify the influence between the various Web communities; 2) characterizing the role of emerging Web communities and services on the information ecosystem, by studying Gab and two popular Web archiving services, namely the Wayback Machine and archive.is; and 3) how popular Web communities are exploited by state-sponsored actors for the purpose of spreading disinformation and sowing public discord. In a nutshell, our analysis reveal that small fringe Web communities like 4chan’s /pol/ and The Donald subreddit have a disproportionate influence on mainstream communities such as Twitter with regard to the dissemination of news and image-based memes. We find that Gab acts as the new hub for the alt-right community, while for Web archiving services we find that they are popular on fringe Web communities and that they can be misused by Reddit moderators in order to penalize ad revenue from news sources with conflicting ideology. Finally, when studying state-sponsored actors, we find that they exhibit substantial differences compared to random users, that their tactics change and evolve over time, and that they were particularly influential in spreading news on popular mainstream communities like Twitter and Reddit. Acknowledgements First, I would like to acknowledge my advisor, Michael Sirivianos, and my de facto co-advisor, Jeremy Blackburn, for their continuous support and feedback throughout my PhD journey. Their support and guidance was instrumental in turning me into an independent and competent researcher. Second, I would like to thank Emiliano De Cristofaro and The Legendary Gianluca Stringhini, who I consider my “second advisors.” Their expertise and feedback complemented the one received by my two advisors, hence helping me in further expanding my research and writing skills. More importantly, I am grateful to these four individuals mainly because they ensured that this journey was fun while undertaking important and cool research. Also, I want to thank various colleagues from Cyprus University of Technology, Telefonica Research, University College London, Princeton University, and University of Illinois at Urbana- Champaign: their help and feedback was pivotal for undertaking the studies presented in this thesis. Furthermore, I would like to thank my family for their support, patience, and encouragement, which ensured that I was mentally strong to overcome all the obstacles faced during my PhD journey. Finally, I would like to thank anonymous users on 4chan for their comments on our papers, as well as for creating some high quality memes about our research: their comments and “meme magic” made this journey much more enjoyable. Contents 1 Introduction 18 1.1 Contributions . 22 1.2 Peer-Reviewed Papers . 23 1.3 Thesis Organization . 24 2 Background 25 2.1 Web Communities . 25 2.1.1 Twitter . 25 2.1.2 Reddit . 26 2.1.3 4chan . 26 2.1.4 Gab . 27 2.1.5 Remarks . 28 2.2 Hawkes Processes . 29 3 Literature Review 34 3.1 Typology of the False Information Ecosystem . 34 3.1.1 Types of False Information . 34 3.1.2 False Information Actors . 36 3.1.3 Motives behind false information propagation . 38 3.2 User Perception of False Information . 39 3.2.1 OSN data analysis . 39 3.2.2 Questionnaires/Interviews . 41 3.2.3 Crowdsourcing platforms . 41 3.2.4 User Perception - Remarks . 42 3.3 Propagation of False Information . 42 3.3.1 OSN Data Analysis . 42 3.3.2 Epidemic and Statistical Modeling . 45 3.3.3 Systems . 46 3.3.4 Propagation of False Information - Remarks . 46 Contents 5 3.4 Detection and Containment of False Information . 47 3.4.1 Detection of false information . 47 3.4.1.1 Machine Learning . 47 3.4.1.2 Systems . 52 3.4.1.3 Other models/algorithms . 53 3.4.2 Containment of false information . 54 3.4.3 Detection and Containment of False Information - Remarks . 56 3.5 False Information in the political stage . 57 3.5.1 Machine Learning . 57 3.5.2 OSN Data Analysis . 58 3.5.3 Other models/algorithms . 60 3.5.4 False information in political stage - Remarks . 61 3.6 Other related work . 62 3.6.1 General Studies . 62 3.6.2 Systems . 63 3.6.3 Use of images on the false information ecosystem . 64 4 Understanding the Spread Of Information Through The Lens Of Multiple Web Com- munities 66 4.1 How Web Communities Influence Each Other Through the Lens of News Sources . 66 4.1.1 Motivation . 66 4.1.2 Datasets . 67 4.1.3 General Characterization . 69 4.1.4 Temporal Analysis . 73 4.1.4.1 URL Occurrence . 73 4.1.4.2 Cross Platform Analysis . 75 4.1.5 Influence Estimation . 79 4.1.5.1 Methodology . 79 4.1.5.2 Results . 80 4.1.6 Remarks . 83 4.2 Detecting and Understanding the Spread of Memes Across Multiple Web Communities 84 4.2.1 Motivation . 84 4.2.2 Methodology . 85 4.2.2.1 Overview . 86 4.2.2.2 Processing Pipeline . 86 4.2.2.3 Distance Metric . 92 4.2.3 Datasets . 94 Contents 6 4.2.3.1 Web Communities . 94 4.2.3.2 Meme Annotation Site . 95 4.2.3.3 Running the pipeline on our datasets . 97 4.2.4 Analysis . 97 4.2.4.1 Cluster-based Analysis . 98 4.2.4.1.1 Clusters . 98 4.2.4.1.2 Memes’ Branching Nature . 102 4.2.4.2 Meme Visualization . 103 4.2.4.3 Web Community-based Analysis . 104 4.2.4.3.1 Meme Popularity . 104 4.2.4.3.2 Temporal Analysis . 106 4.2.4.3.3 Score Analysis . 107 4.2.4.3.4 Sub-Communities . 108 4.2.4.4 Take-Aways . 108 4.2.5 Influence Estimation . 109 4.2.5.1 Influence Results . 109 4.2.5.2 Interesting Images . 113 4.2.6 Remarks . 113 5 Characterizing the Role of Emerging Web Communities and Services on the Informa- tion Ecosystem 116 5.1 What is Gab? . 116 5.1.1 Motivation . 116 5.1.2 Dataset . 117 5.1.3 Analysis . 117 5.1.3.1 Ranking of users . 117 5.1.3.2 User account analysis . 119 5.1.3.3 Posts Analysis . 121 5.1.4 Remarks . 126 5.2 Understanding Web Archiving Services and their Use on Multiple Web Communities 126 5.2.1 Motivation . 126 5.2.2 Background . 128 5.2.3 Datasets . 129 5.2.4 Cross-Platform Analysis . ..