Multilingual Analysis of Conflicts in Wikipedia

Department of Mathematics "Tullio Levi-Civita" Bachelor’s Degree in Computer Science University of Padua Multilingual Analysis of Conflicts in Wikipedia Bachelor’s Thesis Author: Marco Chilese (Student ID: 1143012) Supervisor: Prof. Massimo Marchiori Co-Supervisor: Prof. Claudio Enrico Palazzi A.Y. 2018-2019 September 26, 2019 “Per aspera ad astra” — Cicerone Acknowlegments I would first like to thank my supervisor Prof. Massimo Marchiori, for his continuous great ideas and for supporting and encouraging me. I would particularly thank Enrico Bonetti Vieno for his precious advices, for his competence, for supporting me during the project development and for his work for integrating this project in Negapedia. Statement of Original Authorship The work contained in this thesis has not been previously submitted to meet requirements for an award at this or any other higher education institution. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made. Marco Chilese Padua, Italy 09.26.2019 Abstract The aim of the stage is to analyze conflicts in Wikipedia, giving a qualitative analysis and not just a quantitative one. Wikipedia’s pages are the result of "edit wars", additions, removals generated by users which try to make their point of view prevail on others. To this conflict can be quantified, and it is actually done by the Negapedia project. So, the project purpose is to provide a complementary view of the conflict, which is going to complete the quantitative one: i.e. show the theme of the conflict in a page through its words. Marco Chilese CONTENTS Contents 1 Introduction 1 1.1 Internship Goals . .3 2 Technologies 5 2.1 General Considerations . .5 2.2 Repository Structure . .5 2.3 Back-end Development: Data Processing . .5 2.3.1 Code Quality and Testing . .7 2.4 I/O Performance: Python Vs. Golang . .9 2.4.1 Writing Files . .9 2.4.2 Reading Files . 12 2.5 Front-end Development: Data Presentation . 15 2.6 The Product . 16 2.6.1 Public API . 17 2.7 Stand-Alone Version . 18 2.8 Development Environment . 18 2.8.1 Processing Times . 19 2.8.2 Minimum and Recommended Requirements . 19 3 Wikipedia Dump: Structure and Content 21 3.1 Structure . 21 3.2 Reverts . 23 4 Dump Analysis 25 4.1 Dump Pre-processing . 25 4.1.1 Dump Parse . 25 4.1.2 Dump Non-Revert Reduction and Revision Filtering Method 27 4.2 Text Cleaning and Normalization . 28 4.2.1 WikiText Markup Cleaning . 28 4.2.2 Text Normalization . 28 4.2.3 Text Mapping . 31 4.3 Text Analysis: a Statistical Approach . 32 4.3.1 Global Pages File . 32 4.3.2 Global Words File . 32 4.3.3 TF-IDF: Attributing Importance to Words . 33 4.3.4 Global Pages File With TF-IDF . 34 4.3.5 De-Stemming . 34 4.3.6 Global Topics File . 36 I CONTENTS Marco Chilese 4.3.7 Top N Words Analysis . 36 4.3.8 Bad Language Analysis . 36 5 The Words Cloud 39 5.1 Pages Words Cloud . 42 5.2 Topic Words Cloud . 44 5.3 Wiki Words Cloud . 46 5.4 Bad-Words Cloud . 48 5.4.1 Global Bad-Words Cloud . 48 5.4.2 Pages Bad-Words Cloud . 49 5.5 Brief Considerations About Words Distribution . 51 6 Integration in Negapedia 53 6.1 Current Integration . 53 6.1.1 Data Exporters . 53 6.2 The State of Art . 54 7 Analysis of the Results 57 7.1 Amount of Data . 57 7.2 Considerations About Pages Data . 57 8 Conclusions 61 8.1 Requirements . 61 8.2 Development . 61 8.3 About the Future . 61 A Available Languages 63 A.1 Project Handled Languages . 63 A.2 Bad Language: Handled Languages . 64 A.3 Add Support for a New Language . 64 References 69 II Marco Chilese LIST OF FIGURES List of Figures 1 Wikipedia page in Negapedia. .2 2 Cloud word of English Wikipedia page of Wikipedia. .3 3 Python vs Golang: sequential writing timing. .9 4 Python vs Cython vs Golang: parallel timing writing. 10 5 Python vs Cython vs Go: I/O performance comparison. 11 6 Python vs Cython vs Go: sequential reading timing. 12 7 Python vs Cyhton vs Go: parallel timing reading. 13 8 Python vs Cython vs Go: I/O reading performance comparison. 14 9 "Computer" page word cloud. 15 10 "Microsoft" page word cloud. 15 11 "Apple Inc." page word cloud. 16 12 "University" page word cloud. 16 13 Example of page revision history. 23 14 High level representation of whole analysis process. 25 15 High level representation of parse and dump reduction process. 27 16 Wikimedia Markup Cleaning Process. 28 17 Text Stemming and Stopwords Removal Process. 30 18 Linear word size interpolation. 40 19 Word cloud of "Cold War" page. 42 20 Word distribution "Cold War" page. 43 21 Top 50 words for the topic "Technology and applied sciences." 44 22 Word distribution for top 50 words for the topic "Technology and applied sciences".......................... 45 23 Cloud word of most popular 50 words in English Wikipedia. 46 24 Word distribution for global Wikipedia words. 47 25 Global bad-words words cloud for English Wikipedia. 48 26 Global bad-words distribution for English Wikipedia. 49 27 Bad-words from "Web services protocol stack" page. 49 28 Bad-words from "Sexuality in ancient Rome" page. 50 29 Kubernetes Architecture. 55 30 State of art system representation using Kubernetes. 56 31 Distribution of words in topics. 58 III LIST OF TABLES Marco Chilese List of Tables 2 Requirements description. .4 3 Python vs Cython vs Go: sequential writing timing results. 10 4 Python vs Cython vs Go: sequential writing speed results. 10 5 Python vs Cython vs Go: parallel writing timing results. 10 6 Python vs Cython vs Go: parallel writing speed results. 11 7 Golang vs Cython vs Python: sequential time speedup. 11 8 Python vs Cython vs Go: parallel time speedup. 12 9 Python vs Cython vs Go: sequential reading timing results. 12 10 Python vs Cython vs Go: sequential reading speed results. 13 11 Python vs Cython vs Go: parallel reading timing results. 13 12 Python vs Cython vs Go: parallel reading speed results. 14 13 Python vs Cython vs Go: sequential time speedup. 14 14 Python vs Cython vs Go: parallel time speedup. 15 17 English Wikipedia, last 10 Revert of August 2019 dumps: amount of data. 57 IV Marco Chilese LISTINGS Listings 1 Wikipedia pages-meta-history dump XML structure. 22 2 JSON Page Data Format. 26 3 JSON Page Data after Revert removal. 27 4 JSON page data format after word mapping. 31 5 Global Page File JSON data format. 32 6 Global Words File JSON data format. 32 7 Global Page File with TF-IDF JSON data format. 34 8 Stemming and De-Stemming dictionary built algorithm. 35 9 Global Topics File JSON data format. 36 10 Bad words report JSON data format. 37 11 "Cold War" word-cloud page. 42 12 Top 50 words for the topic "Technology and applied sciences." . 44 13 Global Wikipedia word-cloud. 46 14 "Global bad-words data for English Wikipedia. 48 15 "Web services protocol stack" bad-words data. 50 16 "Sexuality in ancient Rome" bad-words data. 50 V Marco Chilese 1 INTRODUCTION 1 Introduction Note for the Reader Attention: This document may contain offensive and vulgar words, which could impress the most sensitive reader. These words are the result of a part of project analysis. The document is intended for an adult audience only. Negapedia was born in 2016 as open-source project conceived by Professor Massimo Marchiori, and developed by Enrico Bonetti Vieno, with the aim of in- creasing the level of awareness of the public about the controversies beneath each argument in Wikipedia, providing also an historical views of the information bat- tles that shape public data (Marchiori & Bonetti Vieno, 2018b). It is important to sensitise people on this theme because nowadays Wikipedia is took as point of reference by a lot of people, who consider as true information what actually could be the result of clashes between two factions. In fact, every page in Wikipedia exists only thanks to people collaboration: by their nature people not always agree about something, they can be influenced by their politic, religious, commercial interest, etc. etc.; and this means that also what they write could be not impartial. Because of that, inside Wikipedia’s articles arise clashes, that could become battle, which can destroy or manipulate the whole page information. These levels of conflictuality, or negativity, can be measured in a lot different ways, but Negapedia born with the purpose of being accessible to everyone, so the amount of exposed data must be carefully dosed. So, these metrics have been deeply analised and summarized into two quantifiable concepts: • Conflict: this measure is representative of the quantity of negativity in a page. In fact, conflict is defined as the number of people involved in reverts (see §3.2), in an absolute way; • Polemic: this measure is, kind of, a complementary view of conflict. This one is based no more on quantity in an absolute meaning, but on "quality". In this way the relative negativity inside the social community of a page become measurable. For these reasons, polemic is defined in a more sophis- ticated way (taking inspiration from TF-IDF) as the product between two 1 1 INTRODUCTION Marco Chilese terms (Marchiori & Bonetti Vieno, 2018a): Conflict #articles Polemic = ∗ log Popularity #articles j >= Conflict ^ <= Popularity Each page in Negapedia has a report of these two indexes for the recent activity and the past one.

Load more