Department of Mathematics "Tullio Levi-Civita" Bachelor’s Degree in Computer Science University of Padua

Multilingual Analysis of Conflicts in

Bachelor’s Thesis

Author: Marco Chilese (Student ID: 1143012) Supervisor: Prof. Massimo Marchiori Co-Supervisor: Prof. Claudio Enrico Palazzi

A.Y. 2018-2019 September 26, 2019

“Per aspera ad astra” — Cicerone

Acknowlegments

I would first like to thank my supervisor Prof. Massimo Marchiori, for his continuous great ideas and for supporting and encouraging me.

I would particularly thank Enrico Bonetti Vieno for his precious advices, for his competence, for supporting me during the project development and for his work for integrating this project in Negapedia.

Statement of Original Authorship

The work contained in this thesis has not been previously submitted to meet requirements for an award at this or any other higher education institution. To the best of my knowledge and belief, the thesis contains no material previously published or written by another person except where due reference is made.

Marco Chilese

Padua, Italy 09.26.2019

Abstract

The aim of the stage is to analyze conflicts in Wikipedia, giving a qualitative analysis and not just a quantitative one. Wikipedia’s pages are the result of "edit wars", additions, removals generated by users which try to make their point of view prevail on others. To this conflict can be quantified, and it is actually done by the Negapedia project. So, the project purpose is to provide a complementary view of the conflict, which is going to complete the quantitative one: i.e. show the theme of the conflict in a page through its words.

Marco Chilese CONTENTS

Contents

1 Introduction 1 1.1 Internship Goals ...... 3

2 Technologies 5 2.1 General Considerations ...... 5 2.2 Repository Structure ...... 5 2.3 Back-end Development: Data Processing ...... 5 2.3.1 Code Quality and Testing ...... 7 2.4 I/O Performance: Python Vs. Golang ...... 9 2.4.1 Writing Files ...... 9 2.4.2 Reading Files ...... 12 2.5 Front-end Development: Data Presentation ...... 15 2.6 The Product ...... 16 2.6.1 Public API ...... 17 2.7 Stand-Alone Version ...... 18 2.8 Development Environment ...... 18 2.8.1 Processing Times ...... 19 2.8.2 Minimum and Recommended Requirements ...... 19

3 Wikipedia Dump: Structure and Content 21 3.1 Structure ...... 21 3.2 Reverts ...... 23

4 Dump Analysis 25 4.1 Dump Pre-processing ...... 25 4.1.1 Dump Parse ...... 25 4.1.2 Dump Non-Revert Reduction and Revision Filtering Method 27 4.2 Text Cleaning and Normalization ...... 28 4.2.1 WikiText Markup Cleaning ...... 28 4.2.2 Text Normalization ...... 28 4.2.3 Text Mapping ...... 31 4.3 Text Analysis: a Statistical Approach ...... 32 4.3.1 Global Pages File ...... 32 4.3.2 Global Words File ...... 32 4.3.3 TF-IDF: Attributing Importance to Words ...... 33 4.3.4 Global Pages File With TF-IDF ...... 34 4.3.5 De-Stemming ...... 34 4.3.6 Global Topics File ...... 36

I CONTENTS Marco Chilese

4.3.7 Top N Words Analysis ...... 36 4.3.8 Bad Language Analysis ...... 36

5 The Words Cloud 39 5.1 Pages Words Cloud ...... 42 5.2 Topic Words Cloud ...... 44 5.3 Wiki Words Cloud ...... 46 5.4 Bad-Words Cloud ...... 48 5.4.1 Global Bad-Words Cloud ...... 48 5.4.2 Pages Bad-Words Cloud ...... 49 5.5 Brief Considerations About Words Distribution ...... 51

6 Integration in Negapedia 53 6.1 Current Integration ...... 53 6.1.1 Data Exporters ...... 53 6.2 The State of Art ...... 54

7 Analysis of the Results 57 7.1 Amount of Data ...... 57 7.2 Considerations About Pages Data ...... 57

8 Conclusions 61 8.1 Requirements ...... 61 8.2 Development ...... 61 8.3 About the Future ...... 61

A Available Languages 63 A.1 Project Handled Languages ...... 63 A.2 Bad Language: Handled Languages ...... 64 A.3 Add Support for a New Language ...... 64

References 69

II Marco Chilese LIST OF FIGURES

List of Figures

1 Wikipedia page in Negapedia...... 2 2 Cloud word of page of Wikipedia...... 3 3 Python vs Golang: sequential writing timing...... 9 4 Python vs Cython vs Golang: parallel timing writing...... 10 5 Python vs Cython vs Go: I/O performance comparison...... 11 6 Python vs Cython vs Go: sequential reading timing...... 12 7 Python vs Cyhton vs Go: parallel timing reading...... 13 8 Python vs Cython vs Go: I/O reading performance comparison. . 14 9 "Computer" page word cloud...... 15 10 "Microsoft" page word cloud...... 15 11 "Apple Inc." page word cloud...... 16 12 "University" page word cloud...... 16 13 Example of page revision history...... 23 14 High level representation of whole analysis process...... 25 15 High level representation of parse and dump reduction process. . . 27 16 Wikimedia Markup Cleaning Process...... 28 17 Text Stemming and Stopwords Removal Process...... 30 18 Linear word size interpolation...... 40 19 Word cloud of "Cold War" page...... 42 20 Word distribution "Cold War" page...... 43 21 Top 50 words for the topic "Technology and applied sciences." 44 22 Word distribution for top 50 words for the topic "Technology and applied sciences"...... 45 23 Cloud word of most popular 50 words in English Wikipedia. . . . 46 24 Word distribution for global Wikipedia words...... 47 25 Global bad-words words cloud for English Wikipedia...... 48 26 Global bad-words distribution for English Wikipedia...... 49 27 Bad-words from "Web services protocol stack" page...... 49 28 Bad-words from "Sexuality in ancient Rome" page...... 50 29 Kubernetes Architecture...... 55 30 State of art system representation using Kubernetes...... 56 31 Distribution of words in topics...... 58

III LIST OF TABLES Marco Chilese

List of Tables

2 Requirements description...... 4 3 Python vs Cython vs Go: sequential writing timing results. . . . . 10 4 Python vs Cython vs Go: sequential writing speed results. . . . . 10 5 Python vs Cython vs Go: parallel writing timing results...... 10 6 Python vs Cython vs Go: parallel writing speed results...... 11 7 Golang vs Cython vs Python: sequential time speedup...... 11 8 Python vs Cython vs Go: parallel time speedup...... 12 9 Python vs Cython vs Go: sequential reading timing results. . . . . 12 10 Python vs Cython vs Go: sequential reading speed results. . . . . 13 11 Python vs Cython vs Go: parallel reading timing results...... 13 12 Python vs Cython vs Go: parallel reading speed results...... 14 13 Python vs Cython vs Go: sequential time speedup...... 14 14 Python vs Cython vs Go: parallel time speedup...... 15 17 English Wikipedia, last 10 Revert of August 2019 dumps: amount of data...... 57

IV Marco Chilese LISTINGS

Listings

1 Wikipedia pages-meta-history dump XML structure...... 22 2 JSON Page Data Format...... 26 3 JSON Page Data after Revert removal...... 27 4 JSON page data format after word mapping...... 31 5 Global Page File JSON data format...... 32 6 Global Words File JSON data format...... 32 7 Global Page File with TF-IDF JSON data format...... 34 8 Stemming and De-Stemming dictionary built algorithm...... 35 9 Global Topics File JSON data format...... 36 10 Bad words report JSON data format...... 37 11 "Cold War" word-cloud page...... 42 12 Top 50 words for the topic "Technology and applied sciences." . . 44 13 Global Wikipedia word-cloud...... 46 14 "Global bad-words data for English Wikipedia...... 48 15 "Web services protocol stack" bad-words data...... 50 16 "Sexuality in ancient Rome" bad-words data...... 50

V

Marco Chilese 1 INTRODUCTION

1 Introduction

Note for the Reader

Attention: This document may contain offensive and vulgar words, which could impress the most sensitive reader. These words are the result of a part of project analysis. The document is intended for an adult audience only.

Negapedia was born in 2016 as open-source project conceived by Professor Massimo Marchiori, and developed by Enrico Bonetti Vieno, with the aim of in- creasing the level of awareness of the public about the controversies beneath each argument in Wikipedia, providing also an historical views of the information bat- tles that shape public data (Marchiori & Bonetti Vieno, 2018b). It is important to sensitise people on this theme because nowadays Wikipedia is took as point of reference by a lot of people, who consider as true information what actually could be the result of clashes between two factions. In fact, every page in Wikipedia exists only thanks to people collaboration: by their nature people not always agree about something, they can be influenced by their politic, religious, commercial interest, etc. etc.; and this means that also what they write could be not impartial. Because of that, inside Wikipedia’s articles arise clashes, that could become battle, which can destroy or manipulate the whole page information. These levels of conflictuality, or negativity, can be measured in a lot different ways, but Negapedia born with the purpose of being accessible to everyone, so the amount of exposed data must be carefully dosed. So, these metrics have been deeply analised and summarized into two quantifiable concepts:

• Conflict: this measure is representative of the quantity of negativity in a page. In fact, conflict is defined as the number of people involved in reverts (see §3.2), in an absolute way;

• Polemic: this measure is, kind of, a complementary view of conflict. This one is based no more on quantity in an absolute meaning, but on "quality". In this way the relative negativity inside the social community of a page become measurable. For these reasons, polemic is defined in a more sophis- ticated way (taking inspiration from TF-IDF) as the product between two

1 1 INTRODUCTION Marco Chilese

terms (Marchiori & Bonetti Vieno, 2018a):

Conflict  #articles  Polemic = ∗ log Popularity #articles | >= Conflict ∧ <= Popularity

Each page in Negapedia has a report of these two indexes for the recent activity and the past one. From these values are assigned awards (in a chronological sense or in an absolute negativity sense) which can help the visitors to immediately un- derstand how much a page is conflictual, or negative.

Figure 1: Wikipedia page in Negapedia.

However, all these information give to the user and idea of "how much" but not an idea of "what" and "where". Indeed in Negapedia, in its original version, does not offer the possibility to know about what people are fighting. This reflection was the starting point of this project: give to users an idea about what editors are fighting on. To give the idea of "about what", the conflict theme, has been designed the words cloud: a cloud containing words of different size, based on importance, which represent the theme of the clash.

2 Marco Chilese 1 INTRODUCTION

Figure 2: Cloud word of English Wikipedia page of Wikipedia.

In particular, the image above is about the 50 most relevant words, considering the latest 10 reverts.

1.1 Internship Goals

The internship aim is about to write a tool which has to be able to analise Wikipedia pages history in order to make a statistical analysis of the text. Through this analysis, will be generated a sort of a synthesis of the conflict for each page based on words absolute occurrence or importance. These data will be inserted in Negapedia pages allowing so a qualitative measure in addiction to the quantitative one. The goals to achieve are ordered by importance:

Ob Obligatory requirement, binding target because required by the customer;

De Desirable requirement, not strictly necessary but with recognizable added value;

Op Optional requirement, added value but not strictly competitive.

3 1 INTRODUCTION Marco Chilese

Goals are so described:

Obligatory Ob1 Creation of the tool (Wikipedia Conflict Analyser) for the quantitative analysis of conflicts in Wikipedia Ob2 Multilingualism managing: the tool must be able to process any Wikipedia national version Ob3 Data insertion in Negapedia: elaborated data must be incorporated in Negapedia pages to admit online visualization Ob4 Use of open-source technologies, or alternatively, free ones Desirable De1 Analysis management through time frame selection De2 Analysis not only for each page, but also by a set of page, for example by topic Optional Op1 Data test visualization through JavaScript

Table 2: Requirements description.

4 Marco Chilese 2 TECHNOLOGIES

2 Technologies

2.1 General Considerations

One goal of the project was to use open-source technologies, or, alternatively, free ones. The whole project was developed under Git versioning on a repository on the Negapedia GitHub account.

2.2 Repository Structure The repository reported above is structured like this: / cmd internal badwords destemmer dumpreducer structures textnormalizer tfidf topicwords topwordspageextractor utils wordmapper wikitfidf.go exporter.go Dockerfile where:

• cmd: contains the file called by the Dockerfile for the stand-alone execution;

• internal: contains all the implementative packages used by the public interface exposed by wikitfidf.go and by exporter.go.

2.3 Back-end Development: Data Processing

At the beginning the designed programming language was Python (at its 3.7 ver- sion) thanks its great availability of libraries seemed to be the best choice. In particular, for Python is available the NLTK (Natural Language Toolkit) library which allows to analyse and normalize text (see §4.2), moreover, NLTK is the toolkit of its kind which support the most large number of languages by default. Furthermore, another needed library was about Wikitext cleaning: a process which its aim is about clean up normal text from the Wikipedia’s article markup. For this particular task a very popular library is mwparserfromhell, an open

5 2 TECHNOLOGIES Marco Chilese source Python module available on GitHub and installable by pip. The Python version of the project was stable, but, its processing time was not acceptable, even if parallel processing was used. In fact, for the complete elabo- ration of the Italian Wikipedia, was valued about 10-12 days of processing, which means that the entire elaboration of the English Wikipedia, which is approxi- mately ten times bigger, can exceed the 100-120 days. This cannot be contem- plated: Wikipedia dumps are published monthly, so the elaboration times must stay under this threshold. The slowness of the Python after code profiling1, has been attributed to three main causes:

• Dump parse;

• Wikitext cleaning;

• Text normalization;

These are the parts with the highest computational load. Anyway, this slowness is caused by two main factors: the language itself and an intense I/O activity which can involve very large number (order of magnitude 6, or 106) of large files. In fact, the used programming language is a dynamically typed interpreted lan- guage: these characteristics, with a large code and a huge amount of data, can cause a considerable slowness. After these results, has been decided to replicate the project using another pro- gramming language: Golang, a compiled strongly typed language. That’s why Negapedia backend is built in Golang, so it seemed a natural choice, and also a convenient choice for integration. In fact, by its nature Golang is oriented to concurrency, and this feature has been deeply exploited inside Negapedia (Enrico Bonetti Vieno, 2016), and inside this project. To clarify the I/O performance difference between Python and Golang has been designed tests, which results justify this change (see §2.4). Even if the intent was about replacing the whole Python code, it was not com- pletely possible because of the absence of some specific library for Golang. In particular, for this language libraries about Wikitext cleaning and text process- ing are not available. So, Wikipedia markup text cleaning has been delegated to a Java library named wikiclean, which has been forked and adapted to needs of the case. Contrariwise, the text processing part remained implemented in Python and called by Golang code. This decision has been taken because does not exits something similar (in term of supported languages and use possibilities) to NLTK

1Profiling is a dynamic analysis which measure some metrics during the execution, e.g. memory usage, time per call and so on. In this case had been measured time spent per each function call.

6 Marco Chilese 2 TECHNOLOGIES library in Python or other languages. However, to speedup the Python code, and so the entire project, has been utilized Cython: a static compiler which compile in C the Python code. This approach remove partially the ambiguity due to the interpreted nature of Python, speeding it up. So both these parts have been optimized to execute with the maximum obtainable grade of parallelism: during execution every available CPU thread is working. In order to make the project independent by the system on which is running and to make it portable, has been designed a Docker image described by a Dockerfile.

2.3.1 Code Quality and Testing

Since then the project is developed in three main languages, and that two of them (Go and Python) has been used to develop from scratch something, has been decided to adhere to the de facto standards for both.

Go As style guide for Golang Code has been used the official guides CodeReviewComments and Effective Go which describe some best practices to consider while coding. Adhesion to these rules is granted by the use of tools set in the used IDE (Jet- Brains Goland), in particular: golangci-lint which is a bundle of tools, the three main ones are:

1. golint: tool which individuates code mistakes;

2. gofmt: tool which checks code formatting;

3. govet: tool which checks correctness of Go programs.

Adhering to this best practices admits the automatic generation of code docu- mentation and the code evaluation, which are respectively available at:

https://godoc.org/github.com/negapedia/wikitfidf and at:

https://goreportcard.com/report/github.com/negapedia/wikitfidf which generate a report with the evaluation of "A+".

Python As style guide for Python Code has been used PEP 8. It defines the best practices to adopt while coding in Python. Adhesion to this standard is granted by the use of checkers in the IDE (JetBrains Goland and JetBrains PyCharm).

7 2 TECHNOLOGIES Marco Chilese

Code Static Analysis: SonarCloud SonarCloud is a code quality cloud service provided by SonarSource. It is based on a repository which is continuously analysed whenever code is committed. The use of this kind of tool help to individuate vulnerabilities, bugs, code smell2 and security lack. In particular, it measures:

• Reliability: this analysis marks the presence of code whose behaviour could be different from what expected;

• Security: this analysis marks potential weakness to hackers;

• Maintainability: this analysis marks code which could be difficult to update;

• Coverage: the percentage of lines of code covered by tests (see §2.3.1);

• Duplication: the percentage of duplicated code;

• Size: metrics about code as: number of lines of code, percentage of com- ments and so on;

• Complexity: this analysis calculate the cyclomatic3 and the cognitive4 com- plexity.

Go Code Testing and Continuous Integration with Travis CI In order to assure the quality of the developed tool, has been designed a series of tests. These tests are continuously run whenever something is committed in the repository. This service has been delegated to Travis CI, a continuous integration service optimized for GitHub repository. Its behaviour is described inside its configuration file. In particular, it runs tests and produces the test report, which is used by SonarCloud to calculate code coverage.

2"Code smell" is used to indicate those characteristics which could highlight design weak- nesses which reduce code quality 3Cyclomatic complexity is about the number of independent paths through a program’s source code 4Cognitive complexity is a metric about how hard code is to understand

8 Marco Chilese 2 TECHNOLOGIES

2.4 I/O Performance: Python Vs. Golang

This section reports the results of read-write benchmarks comparing Python5 against compiled Python with Cython6 against Golang7 for different sizes of data. The source code used for these tests is available at: https://github.com/ MarcoChilese/I-O-Python-vs-Go. The tests have been executed on the same machine used for the project develop- ment (see §2.8).

2.4.1 Writing Files

The tests below use different data sizes which will be written 100 times on different files, created ex novo. Every test has been repeated 10 times, and the reported values are the resultant times averages and valued speed.

Sequential Processing In this test files are written one by one.

Figure 3: Python vs Golang: sequential writing timing.

53.7 version 60.29.10 version 71.12.6 version

9 2 TECHNOLOGIES Marco Chilese

File Size Python (s) Cython (s) Go (s) 1MB 0.64 0.62 0.14 10MB 5.73 5.72 1.50 100MB 48.55 50.11 14.67 500MB 209.66 192.85 73.22

Table 3: Python vs Cython vs Go: sequential writing timing results.

File Size Python (MB/s) Cython (MB/s) Go (MB/s) 1MB 156.05 161.30 718.72 10MB 174.37 174.82 667.61 100MB 205.99 199.56 681.51 500MB 238.49 259.27 682.83

Table 4: Python vs Cython vs Go: sequential writing speed results.

Parallel Processing In this test parallelism is used: every CPU thread is used for writing files.

Figure 4: Python vs Cython vs Golang: parallel timing writing.

File Size Python (s) Cython (s) Go (s) 1MB 0.73 0.73 0.13 10MB 6.06 5.20 1.40 100MB 58.45 55.42 14.74 500MB 308.62 282.47 77.74

Table 5: Python vs Cython vs Go: parallel writing timing results. 10 Marco Chilese 2 TECHNOLOGIES

File Size Python (MB/s) Cython (MB/s) Go (MB/s) 1MB 136.20 136.20 766.1 10MB 165.15 192.31 711.75 100MB 171.08 180.44 678.49 500MB 162.01 177,00 643.14

Table 6: Python vs Cython vs Go: parallel writing speed results.

Comparison

Figure 5: Python vs Cython vs Go: I/O performance comparison.

File Size Py (s) Cy (s) Go (s) CyVsPy Speedup GoVsPy Speedup 1MB 0.64 0.62 0.14 1.03x 4.61x 10MB 5.73 5.72 1.50 1.00x 3.83x 100MB 48.55 50.12 14.67 0.97x 3.31x 500MB 209.66 192.85 73.22 1.09x 2.86x

Table 7: Golang vs Cython vs Python: sequential time speedup.

11 2 TECHNOLOGIES Marco Chilese

File Size Py (s) Cy (s) Go (s) CyVsPy Speedup GoVsPy Speedup 1MB 0.73 0.73 0.13 1.00x 5.62x 10MB 6.06 5.20 1.40 1.17x 4.31x 100MB 58.45 55.42 14.74 1.05x 3.97x 500MB 308.62 282.47 77.74 1.09x 3.97x

Table 8: Python vs Cython vs Go: parallel time speedup.

2.4.2 Reading Files

The tests below use different data sizes which will be read 100 times and stored in a variable. Every test has been repeated 10 times, and the reported values are the resultant times averages and valued speed.

Sequential Processing In this test the same file is read 100 times, singularly.

Figure 6: Python vs Cython vs Go: sequential reading timing.

File Size Python (s) Cython (s) Go (s) 1MB 0.17 0.12 0.033 10MB 1.74 1.34 0.33 100MB 15.58 15.15 2.65 500MB 82.92 89.89 13.39

Table 9: Python vs Cython vs Go: sequential reading timing results.

12 Marco Chilese 2 TECHNOLOGIES

File Size Python (MB/s) Cython (MB/s) Go (MB/s) 1MB 580.69 833.33 2.991.04 10MB 574.09 746.27 3.031.33 100MB 641,84 660.07 3.780.51 500MB 603.00 556.24 3.733.31

Table 10: Python vs Cython vs Go: sequential reading speed results.

Parallel Processing In this test parallelism is used: every CPU thread is used for reading files.

Figure 7: Python vs Cyhton vs Go: parallel timing reading.

File Size Python (s) Cython (s) Go (s) 1MB 0.23 0.22 0.03 10MB 1.86 1.75 0.28 100MB 19.58 18.83 23.55 500MB 104.61 99.53 307.82

Table 11: Python vs Cython vs Go: parallel reading timing results.

13 2 TECHNOLOGIES Marco Chilese

File Size Python (MB/s) Cython (MB/s) Go (MB/s) 1MB 440.75 454.55 3268.29 10MB 536.26 571.43 3628.36 100MB 510.61 531.07 424.62 500MB 477.97 502.36 162.43

Table 12: Python vs Cython vs Go: parallel reading speed results.

Comparison

Figure 8: Python vs Cython vs Go: I/O reading performance comparison.

File Size Py (s) Cy (s) Go (s) CyVsPy Speedup GoVsPy Speedup 1MB 0.17 0.12 0.03 1.41x 5.15x 10MB 1.74 1.34 0.33 1.30x 5.28x 100MB 15.58 15.15 2.65 1.03 5.89x 500MB 82.92 89.89 13.39 0.92x 6.19x

Table 13: Python vs Cython vs Go: sequential time speedup.

14 Marco Chilese 2 TECHNOLOGIES

File Size Py (s) Cy (s) Go (s) CyVsPy Speedup GoVsPy Speedup 1MB 0.23 0.22 0.03 1.05x 7.42x 10MB 1.86 1.75 0.28 1.06x 6.77x 100MB 19.58 18.83 23.55 1.04x 0.83x 500MB 104.61 99.53 307.82 1.05x 0.34x

Table 14: Python vs Cython vs Go: parallel time speedup.

2.5 Front-end Development: Data Presentation

The project calculates an incredible amount of data and they must be carefully dosed when exposed to the user. So, with this in mind, has been designed a way to easily communicate them: the words cloud. A words cloud is a cloud containing words of different size, based on importance or absolute number of occurrence. Follow a series of examples extracted from the last English Wikipedia calculation, considering the latest 10 revert8:

Figure 9: "Computer" page word Figure 10: "Microsoft" page word cloud. cloud.

8See §4.1.2 for definition.

15 2 TECHNOLOGIES Marco Chilese

Figure 11: "Apple Inc." page word Figure 12: "University" page word cloud. cloud.

These clouds contain the 50 most important words (in term of TF-IDF value) for the specified page. To build this kind of words cloud has been selected a popular JavaScript open source library, available on GitHub: d3-cloud. In order to be completely transparent, the data used to build the words cloud are reported inside the Negapedia page, inside the tag as a variable (a JSON dictionary). In this way, data are visible to those users which want to use them.

2.6 The Product

The developed product is available in an open-source repository:

https://github.com/negapedia/wikitfidf which contains code and resources for the project. As said in §2.3, the final product consists on a tool developed mostly in Go, which uses two separated parts developed in Java and Python. This two, have a different core business: Wikitext cleaning and text normalization, respectively. The Java code, through Apache Maven has been packed with its dependency in an executable jar package which is called at the right time by the Go code, through a system call. What concern the Python code is analogous but slightly different: its code is also called by Go through a system call, but what is called it’s not the real Python developed code but the Cython compiled one. That’s why before execution, as re- ported in §2.3, Python code is compiled with Cython to improve it’s performance. So, what is called is a Python module which references the Cython compiled mod-

16 Marco Chilese 2 TECHNOLOGIES ule, which in turn execute the task originally developed.

2.6.1 Public API

In order to make transparent to the user all the low-level implementation, has been designed an API described inside the files wikitfidf.go and export.go which has the aim of offering to the user only the essentials operations:

• CheckAvailableLanguage: allows the user to check if the required language is handled by the project, if not an error is returned;

• Preprocess: is about pre-processing pages: taking in input a channel9 of parsed pages from Wikibrief library and reducing pages information;

• Process: called after Preprocess is the core part of the project, performs all the operations from the start to the end;

• GlobalWordsExporter: allows to export the set of words, with their occur- rence value, in the analysed Wikipedia for the scope of integrating the data inside Negapedia web page (see §6.1.1);

• PagesExporter: like the previous one, but this is about the list of pages with their list of words and TF-IDF values (see §6.1.1);

• TopicsExporter: like the previous one, but this is about words in each topic (see §6.1.1);

• BadwordsReportExporter: like the previous one, but this is about the bad- words report (see §6.1.1).

In addition to this, the package becomes go gettable, which means that can easily be downloaded and installed in a Go environment through the native Go package manager simply through:

go get github.com/negapedia/wikitfidf

That, is particularly important because simplify the integration process (see §6) inside Negapedia beckend.

9Channels are a pipes utilized by Golang to connect different goroutines (a lightweight thread of execution): are a typed conduit in which can be sent and received data. They are particularly optimized for concurrency.

17 2 TECHNOLOGIES Marco Chilese

2.7 Stand-Alone Version

The final product has not only been thought as an integrable product, but also as a full working stand-alone project10, and that is possible thank to Docker. In fact, through a designed Docker image, defined in a Dockerfile the project can start run into a container. The execution is customizable through command line flags, which are:

• l: Wikipedia language;

• d: Result directory path;

• s: Revert starting date to consider8;

• e: Revert ending date to consider8;

• specialList: Special page list to consider8;

• rev: Number of revert to consider8;

• topPages: Number of top words per page to consider;

• topWords: Number of top words of global words to consider;

• topTopic: Number of top words per topic to consider;

• verbose: If true, logs are shown (default: true).

So an example of execution could be:

docker run -v /path/2/out/dir:/data wikitfidf dothething -lang it which will calculate for the Italian Wikipedia, the most important 50 words for each page, the 100 most frequent word in each topic, the 100 most frequent word in the entire Wikipedia, considering the last 10 revert of each page.

2.8 Development Environment

The whole project has been developed on a Apple MacBook Pro 11,1 (Mid 2014) with the following specifications:

• Intel(R) Core(TM) i5-4278U CPU @ 2.60GHz (2 cores-4 threads);

• 8GB RAM DDR3 @ 1600MHz;

• 256GB PCIe SSD;

10With the purpose of only process data and save output

18 Marco Chilese 2 TECHNOLOGIES running Apple MacOS 10.14.6. As Integrated Development Environment (IDE) has been used JetBrains Py- Charm and Goland. The project has been tested on Negapedia servers, hosted by cloudveneto, with the following specifications:

• Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz (8 cores-8 threads);

• 16GB RAM and 16GB swap memory;

• 500GB/1TB HDD; running Ubuntu 16.04 LTS/Ubuntu 18.04 LTS.

2.8.1 Processing Times

Considering the latest calculation made on servers, considering the latest 10 re- verts, on average the required times for complete the processing are:

English Wikipedia 3 days and 21 hours Italian Wikipedia 14 hours

2.8.2 Minimum and Recommended Requirements

The minimum requirements which are needed for executing the project in rea- sonable times are:

• At least 4 cores-8 thread CPU;

• At least 16GB of RAM;

• At least 300GB of disk space.

However the recommended requirements are:

• 32GB of RAM or more;

• Swap memory area enabled.

19

Marco Chilese 3 WIKIPEDIA DUMP: STRUCTURE AND CONTENT

3 Wikipedia Dump: Structure and Content

Wikipedia is an online encyclopedia based on open collaboration, launched in En- glish in 2001. Since then, Wikipedia has grown to date and now count more than 300 languages, 40 million pages and over 85 million registered users (Wikipedia, 2019). For its collaborative nature, everyone can add contents to Wikipedia, modifying or writing pages, adding resources, updating data and so on. Exactly for these reasons everything is under versioning. In fact, every page in Wikipedia exposes its own history, giving the possibility to compare every revision. As you can guess, these activities generate an incredible amount of data which, for safety reason, are backuped periodically: every month the generates a series of complete Wikipedia database backups (Database Dump), divided into categories. The most relevant are:

• pages-articles-multistream: Articles, templates, media/file descriptions, and primary meta-pages;

• pages-meta-history: All pages with complete history, the one useful for the project purpose;

• pages-logging: All pages with page edit complete history;

• pages-meta-current: Log events to all pages and users;

• pages-articles: All pages at their current version.

During these years, Wikipedia’s size is getting bigger: all the pages-meta-history dump of English Wikipedia uncompressed at February 2013 reached 14 TB and at October 2018, the size grew up to 17959415517241 bytes, or 17,96 TB (Wikimedia Foundation, 2018), (Wikimedia Foundation, 2019). All Wikipedia dumps are public and everyone can download them for free from the dedicated website: https://dumps.wikimedia.org/.

3.1 Structure

The type of dumps which have to be analysed is the pages-meta-history one. As said, in this kind of files is available the whole history of a page: in fact, in it can be found page by page a list in chronological order of every revision that the page has undergone. The page data structure is pretty much typical, though some revision can have some more information, but their presence, or absence, it’s not a problem because

21 3 WIKIPEDIA DUMP: STRUCTURE AND CONTENT Marco Chilese

they do not convey relevant information for the project subject. Then follows the simplified XML structure:

1

2

3

4

5

6

7

8

9

10

11 Page title

12 0

13 ...

14

15 ...

16 ...

17 yyyy-mm-ddThh:mm:ssZ

18

19 ...

20 ...

21

22 ......

23 ...

24

25

26

27

28

29

30

Listing 1: Wikipedia pages-meta-history dump XML structure.

In the namespaces declaration are listed 25 namespaces, and the only one which is relevant is the namespace "0", that identify normal pages (articles) in Wikipedia. In fact, during the dump analysis every which not belongs to this category is excluded. For the purpose of our analysis the most relevant tags, relatively to a , are:

• Page ID;

22 Marco Chilese 3 WIKIPEDIA DUMP: STRUCTURE AND CONTENT

• Revision:

– Timestamp;

– Text;

– SHA1.

In particular, admits us to easily individuate which revisions are to be considered revert.

3.2 Reverts

A revert is a revision which has been considered by other editors useless, van- dalic, wrong in form or content or both, or not appropriate. So, these revisions are simply removed rolling the page back to a revision which is considered "sta- ble". Nevertheless, can exists reverts which are vandalic too: revisions which are reverted with the aim of reverse the page status. Vandalism in Wikipedia is analysed by Wikitrust APIs (Thomas Adler, de Alfaro, & Pye, 2010) which is able to individuate vandalic act inside edits that identify vandalism with a recall of 83.5%, a precision of 48.5%, and a false positive rate of 8%. As said, can be understood which revision is considered revert from the SHA1 analysis. In fact, where a SHA1 appears repeated means that what is between the two signs has to be considered a revert. Considering the image below:

Figure 13: Example of page revision history.

23 3 WIKIPEDIA DUMP: STRUCTURE AND CONTENT Marco Chilese

In this case, the first revision’s sign appears also in the fourth one, this means that an editor has considered (rightly, or maliciously) that everything between the first and the fourth revision inside the blue rectangle was useless, so to be removed. With this choice, the editor rolls the page back to the state of the first revision.

24 Marco Chilese 4 DUMP ANALYSIS

4 Dump Analysis

Before examining every step in detail, let’s have a brief view of the complete analysis process through an activity diagram:

Figure 14: High level representation of whole analysis process.

4.1 Dump Pre-processing

4.1.1 Dump Parse

The first step for starting the processing pipeline is to download the page-meta-history dumps from the Wikimedia repository for the selected language and the selected

25 4 DUMP ANALYSIS Marco Chilese

date. These dumps are available in two compressed formats: 7z and bz2. The most convenient is the 7z one, thanks to its performance and best compression rate that allows using less space on disk while processing. The dumps size is not fixed, typically it’s variable from 100MB to 400MB in the 7z version, but it is not strange to come across "abnormal" dumps which can overcome the Gigabyte. The decompressed size for a 200MB dump is on aver- age around 34GB, so it is important to define a strategy of decompression and content reading. In fact, the full extraction before processing can be prohibitive for the space on disk. Consequently for reading, is prohibitive loading the entire extracted file on RAM, so incremental extraction and reading technique are the key. The technique that has been used in the project is the live extraction on termi- nal standard output and contemporary reading for incremental parse, which is needed for the same reason above. The XML parser is fed line by line and it is triggered by the starting tag , from which page data are collected in structure, skipping, as said, those pages which do not belong to namespace 0. The data collection is the first moment where useless data are discarded, in fact, the only data which are collected are:

• Page ID;

• Revisions:

– Revision timestamp;

– Revision text;

– Revision SHA1 digital sign.

Those structures can be imagined as JSON dictionaries which will be saved on disk during the next steps:

1 {"PageID": 123456,

2 "TopicID":y,

3 "Revision":[

4 {"Timestamp":"yyyy-mm-ddThh:mm:ssZ",

5 "Text":"Revision Text",

6 "Sha1":"ikhwgq0sxctzqpl8aomp6gelk5p539e"},

7 { ... },

8 ...

9 ]}

Listing 2: JSON Page Data Format.

where y is the topic code assigned by Negapedia system.

26 Marco Chilese 4 DUMP ANALYSIS

4.1.2 Dump Non-Revert Reduction and Revision Filtering Method

During the page parse, for reducing the page file size, are done some elaborations with the task of removing those revisions which are not considered reverts. These operations admit decreasing considerably the file size that will be written in JSON format (same as above) at the end of the page analysis. To reduce much more the amount of data, and consequently, the required time for processing, can be applied some filters to revisions list:

• Revision date: can be applied a temporal filter which can exclude revisions that are not in a specified time frame;

• Special Page List: are considered only the page in the list with their com- plete history;

• Number of Reverts: are considered only the latest n revert in the complete page history.

The idea of writing a file per page may seem not particularly convenient in terms of I/O, but, it will admit processing pages in parallel in the next steps of pro- cessing.

Figure 15: High level representation of parse and dump reduction process.

After this process, the data are saved in JSON, and they look like:

1 {"PageID": 123456,

2 "TopicID":y,

3 "Revision":[

4 {"Timestamp":"yyyy-mm-ddThh:mm:ssZ",

27 4 DUMP ANALYSIS Marco Chilese

5 "Text":"Revision Text",

6 "Sha1":"ikhwgq0sxctzqpl8aomp6gelk5p539e"},

7 { ... },

8 ...

9 ]}

Listing 3: JSON Page Data after Revert removal.

4.2 Text Cleaning and Normalization

4.2.1 WikiText Markup Cleaning

Inside Wikipedia’s articles is used a markup language to define layouts, text style, links, tables, images and so on; and it is called Wikitext, also known as Wiki Markup or Wikicode. For the project purpose these extra information available in the revisions’ text are useless: they do not convey any information about the text meaning and the page’s argument, so they have to be removed. The cleaning process ensure that the analysed text is only pure text and not text filled with an overstructure used to present text itself. Follow an activity diagram which represent the Wikitext cleaning process.

Figure 16: Wikimedia Markup Cleaning Process.

4.2.2 Text Normalization

Text normalization is one of the most important steps and it is made of several sequential actions, implemented in the NLTK library, let’s enumerate them and then analyse them afterwards:

1. Tokenization;

28 Marco Chilese 4 DUMP ANALYSIS

2. Stopwords cleaning;

3. Stemming.

Tokenization Tokenization is about splitting text into a list of single words, like this:

"Wikipedia is a multilingual online encyclopedia."

["wikipedia", "is", "a", "multilingual", "online", "encyclopedia"]

In addiction to splitting process, also punctuation is removed, and text is con- verted to lowercase. Having single words, and not more the entire text, allows to consider them sin- gularly, and enables the following steps.

Stopwords Cleaning Stopwords cleaning is an important passage whose purpose is about removing those words which are too common to be relevant during the statistical analysis. For doing this is used a list of stopwords for that language: if a word is in the stopwords list, is immediately removed from the text. The lists of stopwords which have been used are the NLTK ones, but they have been enriched in terms of words quantity and in supported languages11. By now, 45 languages are handled (see §A). So, considering the previous example:

["wikipedia", "is", "a", "multilingual", "online", "encyclopedia"]

["wikipedia", "multilingual", "online", "encyclopedia"]12

This, in addiction to being a way for having better text to analyze, is also a way for reducing the amount of information, and so the elaboration weight.

11To do this, has been used https://www.ranks.nl/stopwords/ 12Considering the stopwords list used in the project.

29 4 DUMP ANALYSIS Marco Chilese

Stemming The last step has really to do with normalization. In fact, Stemming, is about normalizing text from verbal forms:

Playing, plays, played, play → play

Doing this, we increase the statistical relevance of a word, otherwise, the four words in the example above are considered like different words with their own statistical relevance even if they mean the same thing. It’s noteworthy that during the stemming phase a "de-stemming" dictionary is built. That admit to rebuild a meaningful word after the statistical analysis (see §4.3.5). So, let’s consider a little bit more complex example to sum up the entire process:

"Wikipedia is a multilingual online encyclopedia, based on open collaboration through a wiki-based content editing system. It is the largest and most popular general reference work on the World Wide Web, and is one of the most popular websites ranked by Alexa as of June 2019.13"

["wikipedia", "multilingu", "onlin", "encyclopedia", "base", "open", "collabor", "through", "wiki", "base", "content", "edit", "system", "largest" "popular", "gener", "refer", "work", "world", "wide", web", "popular", "websit", "rank", "alexa", "june", "2019"]

Follow an activity diagram which represent text normalization process.

Figure 17: Text Stemming and Stopwords Removal Process.

13Text from the Wikipedia page of Wikipedia (11/07/2019).

30 Marco Chilese 4 DUMP ANALYSIS

So, formalizing what has been said: let be R the revision text, and T ok the function which split the text in single words, then:

T = T ok(R) (1)

where T is the the list of single words, and so, Ti is a single word. Now, let SC be the function which performs the stopwords cleaning, and let S(w, l) be the function which returns 1 if the word w, for the language l, is in the stopwords list, otherwise 0. Then:

|T |  [ Ti if S(Ti, l) is 0 SC(T ) = (2) i=1 ∅ otherwise

And so, Stem(T ) is the function which performs the stemming for each word in T , which has been cleaned up in (2):

T = Stem(T ) (3)

4.2.3 Text Mapping

Text mapping is a key part of the process. In this step the stemmed text is analysed again and summarized in the form of dictionary:

"Term": x

where x is the absolute term occurrency value in the all collected reverts text of the current page. So, now the page data are represented like this:

1 {"PageID": "123456",

2 "Tot":t,

3 "Words":[

4 {"Word1": occurr1},

5 {"Word2": occurr2},

6 ...

7 ]

8 }

Listing 4: JSON page data format after word mapping.

where t represents the total number of word in the page reverts and occurr1, occurr2 the word absolute occurency value of word1 and word1, respectively. This particular elaboration, in addition of being the starting point for the sec- ond part of the project, the statistical one, is relevant because the amount of

31 4 DUMP ANALYSIS Marco Chilese

information decrease sensitively, thanks of the nature of this new page data rep- resentation.

4.3 Text Analysis: a Statistical Approach

4.3.1 Global Pages File

At this point, data are structured in the format shown in §4.2.3 and are ready for getting aggregated in a single file. This approach helps to process data. In fact, has been designed a way for writing page and reading them from a single file: every page is written in a single line, and so every line represents a page. In this way, the file can be read incrementally without loading it completely in memory. This approach must be applied because of the size of the resultant document. So, the global page file, appears like this:

1 {"123456": {"Tot": t1,"Words": [{"Word1": occurr1},{"Word2": occurr2}, ...]},

2 "7890123":{"Tot": t2,"Words": [{"Word3": occurr3}, ...]},

3 ...

4 }

Listing 5: Global Page File JSON data format.

4.3.2 Global Words File

For the statistical phase which follows, it is required a dictionary containing the whole wiki words mapped with their absolute occurency value, and the number of documents which contain them. So exploiting the Global Pages File previously built, the new Global Words File is built. In it will be included also two "jolly" counter: the number of processed pages and the number of words in them; these two, in particular, will help during the calculation of TF-IDF, as will be seen later. It looks like:

1 {"123456": {"Tot":t,"Words": [{"Word1": occurr1,"in":x}, {"Word2": occurr2,"in":y}, ...]},

2 ...

3 "@Total Page":p,

4 "@Total Words":w

5 }

Listing 6: Global Words File JSON data format.

where x and y represent the number of documents that contain the term word1 and word2, respectively; p and w, the number of page analysed and the number

32 Marco Chilese 4 DUMP ANALYSIS of words inside them.

4.3.3 TF-IDF: Attributing Importance to Words

TF-IDF (Term Frequency - Inverse Document Frequency) is an information re- trieval function used to assign importance to a word relatively to a document or a set of documents. In this way, a term increases its importance by the number of appearance in the considered documents set. The TF-IDF calculation is compound by three separated parts: the TF calcula- tion, the IDF calculation and then the TF-IDF calculation. Let’s analyse them separately.

TF The TF value represents the frequency of a term in a single document, considered relatively to the amount of word in the document. So:

ni,j tfi,j = (4) |dj| where tfi,j is the term frequency of term i in a document j, ni,j is the number of occurrence of the the term i in a document j, and |dj| is the size of the document (number of words).

IDF IDF represents the importance of a term inside the collection of documents. It considers the number of documents and the number of documents that contain the analysed term. So: |D| idf = log (5) i |{d : i ∈ d}| where idfi is the inverse document frequency of term i, |D| is the cardinality of the collection, and |{d : i ∈ d}| indicate the number of documents which contain the term i.

TF-IDF So these two factors take part to the final calculation for the TF-IDF of a single term:

tfidfi,j = tfi,j ∗ idfi (6)

Now, let’s see an example. Considering a collection of 1000 documents where each document contains 100 words. Let w be a word which appears 50 times in the whole collection, and 5 times in the document j. So, the TF for w in the

33 4 DUMP ANALYSIS Marco Chilese

5 1000 document j would be: 100 = 0.05; the IDF would be: log 50 = 1.30. So, from these results TF-IDF could be calculated as: 0.05 ∗ 1.30 = 0.065.

4.3.4 Global Pages File With TF-IDF

Thanks to the previous process of Global Word and Global Pages file, the elab- oration of TF-IDF for every page and every term result to be particularly easy. In fact, the most important data for the function are already been calculated: as mentioned in §4.3.2 we have the numerosity of documents (pages) set, the whole amount of words inside the set, for each page the amount of words in it, and for each term its absolute occurrency value. So, from these premises it is only a matter of implementation. As mentioned in §4.3.1, Global Pages File is read incrementally and so it is pro- cessed page-by-page, vice versa Global Word File must be entirely read because a global vision of it is needed by the nature of the process. After the statistical calculation described in §4.3.3, the Global Pages File is re- placed with an updated version including the just calculated results. Now, this file will look like:

1 {"123456": {"Tot":t,"Words": [{"Word1":{"abs": x1,"tfidf": y1}}, ...],

2 ...

3 } Listing 7: Global Page File with TF-IDF JSON data format.

where xi represents the number of occurrences of the term i in that page, and

yi the TF-IDF value for that term in that page, considering the whole pages collection.

4.3.5 De-Stemming

As said in §4.2.2, stemming is a key part for the statistical calculations (count word by their root and not in all their forms), but this means that words have been truncated or reduced to a form which is not always meaningful. Considering that our aim is to show the most relevant words, is not acceptable having not well-formed words, so, during the stemming process is built a "de-stemming" dictionary based on key-value:

"stemmedWord": "realWord"

where realWord is continuously updated choosing, as strategy, the shortest word which stemmed give stemmedWord. In this way when is found a word which its stemmed version is equal to the real word, it is inserted in the dictionary assuring the correct reconstruction. This solution, that can be called "brute-force" and

34 Marco Chilese 4 DUMP ANALYSIS

admits consistency during de-stemming process. The algorithm which has been used is:

1 def _stemming(revert_text, stemmer_reverse_dict):

2 ps= PorterStemmer()# NLTK stemmer

3 text = []

4

5 for word in revert_text:

6 stemmed_word= ps.stem(word)

7 if stemmed_word in stemmer_reverse_dict.keys() and len(word) < len( stemmer_reverse_dict[stemmed_word]):

8 stemmer_reverse_dict[stemmed_word] = word

9 elif stemmed_word not in stemmer_reverse_dict.keys():

10 stemmer_reverse_dict[stemmed_word] = word

11

12 text.append(stemmed_word)

13 return text, stemmer_reverse_dict

Listing 8: Stemming and De-Stemming dictionary built algorithm.

Let’s consider an example: Phrase 1 :"chain chained chains" → de-stem dictionary: {chain: chain}; Phrase 2 :"chained" → de-stem dictionary: {chain: chained}. So, aggregating this two dictionaries in a single one, is considered the following rule (considering that the entries in de-stem dictionary are named as above): for each stemmedWord is kept the realWord which is shorter. So, in this case, the global de-stem dictionary would be like: {chain: chain} causing the correct de-stemming.

After the TF-IDF calculations and results writing (§4.3.3, §4.3.4), the de- stemming process is performed, allowing the words reconstruction. Considering the short example in §4.2.2:

["wikipedia", "multilingu", "onlin", "encyclopedia", "base", "open", "collabor", "through", "wiki", "base", "content", "edit", "system", "largest" "popular", "gener", "refer", "work", "world", "wide", web", "popular", "websit", "rank", "alexa", "June", "2019"]

during its stemming process would has generated also its de-stemming dictionary, which looks like:

1 {"multilingu":"multilingual",

2 "onlin":"online",

3 "collabor":"collaboration",

4 "gener":"general",

5 "refer":"reference",

35 4 DUMP ANALYSIS Marco Chilese

6 "websit":"website"}

So, using the above dictionary the stemmed text can be de-stemmed having back a more complete meaning:

["wikipedia", "multilingual", "online", "encyclopedia", "base", "open", "collaboration", "through", "wiki", "base", "content", "edit", "system", "largest" "popular", "general", "refer", "work", "world", "wide", web", "popular", "website", "rank", "alexa", "June", "2019"]

4.3.6 Global Topics File

Exploiting the data field "TopicID" that every page has, can be simply built a global file which contains for each topic every words with its absolute occurrence value. So, this file will look like:

1 {"TopicID1":{"word1": freq1,"word2: freq2", ...},

2 "TopicID2":{"word1": freq1, ...}

3 ...

4 }

Listing 9: Global Topics File JSON data format.

4.3.7 Top N Words Analysis

The files previously built have a lot of data in them, so, for the use described in §5 a smaller version are required. For this purpose has been written a simple Python module which from the files (Global pages TF-IDF, Global words and Global Topics) builds a dictionary whence are extracted the N (defined as a parameter) most important or frequent words, and then write a copy of the original files with a reduced data set. In this case, to reduce space on disk, files are written with the maximum level of compression made available by gzip.

4.3.8 Bad Language Analysis

Once de-stemming has been performed, can be also run another sort of analysis: a bad language analysis for each page. This analysis is pretty similar to stopwords cleaning but caught words are not deleted but collected. In fact, using bad words lists is built a report which includes the bad words used for each page that has them. Those lists are available for 24 languages (see §A.2), including the most commons,

36 Marco Chilese 4 DUMP ANALYSIS

and have been built combining lists of different sources14. This report uses the same JSON incremental writing trick described in §4.3.1, and it looks like:

1 {"123456":{"Abs":x,"Rel":y,"BadW":{"badword1":occurr1,"badword2":occurr2}},

2 ...

3 }

Listing 10: Bad words report JSON data format.

where x represents an absolute counter of bad words found in the examined page, y represents the relative grade of page vulgarity based on the number of words

V ulgabs in it, calculated as V ulgrel = |W ords| .

14http://www.bannedwordlist.com/, https://www.frontgatemedia.com/a-list-of-723-bad-words-to-blacklist-and-how-to -use-facebooks-moderation-tool/, http://aurbano.eu/blog/2008/04/04/bad-words-list/, https://github.com/chucknorris-io/swear-words, https://github.com/pdrhlik/sweary/. (Last access August 5, 2019)

37

Marco Chilese 5 THE WORDS CLOUD

5 The Words Cloud

As briefly described in §2.5 a Word Cloud is a words container, where every words has a weight on which depends its font size. Let’s consider a simple example:

1 {"Words":{

2 "wikipedia": 10,

3 "negapedia": 15,

4 "golang": 7,

5 "python": 8,

6 "NLTK": 2}

7 }

This dictionary is populated by words associated by their weight, or in other words, by their absolute occurrence. So, this example could be represented by the following cloud:

From this, in a single sight, can simply be understood that the most relevant word in the set is "negapedia". In fact, the idea was to have a simple structure which could represent a discrete amount of data without overloading the user of information.

Word Size Calculation The values associated to a word are of two kinds:

• Absolute occurrence value: it is a value, said v, with v > 0;

• TF-IDF value: it is a value, said v, with 0 < v < 1.

These values represent the popularity and the importance, respectively, of a single word. To produce a word cloud, they must be converted to a font-size value.

39 5 THE WORDS CLOUD Marco Chilese

As interpolating function has been chosen a linear one. To build this function let’s consider:

Figure 18: Linear word size interpolation.

where the couple of point are: • (wordMin; fontSizeMin) represents the couple compound by the word with the lowest value and minimum font size that has been chosen;

• (wordMax; fontSizeMax) represents the couple compound by the word with the highest value and maximum font size that has been chosen. Considering therefore that all other words will necessarily be within [wordMin; wordMax], the interpolating function can be built as a line passing through two points. This line is defined by: x − x y − y 0 = 0 (7) x1 − x0 y1 − y0 or: x − x0 y = ∗ (y1 − y0) + y0 (8) x1 − x0 which in our case for the word i becomes:

valuei − wordMin fontSizei = (fontSizeMax − fontSizeMin) + fontSizeMin wordMax − wordMin where valuei is the word’s TF-IDF value (or absolute occurrence value), and obviously, fontSizei is the associated font size.

40 Marco Chilese 5 THE WORDS CLOUD

Summing up, after the analysis phase the output files are:

F.1 Gobal Pages file: this file contains, for each page, the whole list of words with their absolute occurrence value and tf-idf value;

F.2 Global Pages file with only the most important N words: this file contains, for each page, a restricted list of N most important words with their tf-idf value;

F.3 Global Words file: this file contains all the words inside the analysed Wikipedia, associated with their absolute occurrence value;

F.4 Global Words file with only the most popular N words: this file contains the most N popular inside the analysed Wikipedia, with their absolute occurrence value;

F.5 Global Topic Words: this file contains all the words inside each topic of the analysed Wikipedia, associated with their absolute occurrence value;

F.6 Global Topic Words with only the most popular N words: this file contains the most N popular inside each topic of the analysed Wikipedia, with their absolute occurrence value;

F.7 Bad Words Report: this file contains, for each page which has them, a list of bad words with their absolute occurrence value.

These files, in particular the top N words ones, will be used to produce the words cloud.

41 5 THE WORDS CLOUD Marco Chilese

5.1 Pages Words Cloud

Using the file F.2, for each available page is built the word cloud. An example could be:

Figure 19: Word cloud of "Cold War" page.

Based on:

1 {"Tot": 92792,"Words":{"state": 0.0056,"american": 0.004,"germany": 0.0054,"soviet": 0.0456,"superpower": 0.0037,"west": 0.0032," khrushchev": 0.0068,"missil": 0.0052,"presid": 0.0032,"crisi": 0.004," party": 0.0043,"eastern": 0.0054,"afghanistan": 0.0038,"communist": 0.0157,"nuclear": 0.0082,"cambodia": 0.0032,"conflict": 0.0034,"power" : 0.0039,"cuban": 0.0037,"govern": 0.0048,"vietnam": 0.0048,"gorbachev ": 0.0057,"relat": 0.0044,"invase": 0.0035,"policy": 0.0046,"churchil" : 0.0033,"unit": 0.0048,"berlin": 0.0052,"cold": 0.0154,"bloc": 0.008, "moscow": 0.0042,"europ": 0.0069,"brezhnev": 0.0041,"country": 0.0033, "nato": 0.004,"reagan": 0.0066,"regim": 0.0039,"union": 0.0089,"ussr" : 0.0055,"arms": 0.0039,"revolut": 0.0039,"khmer": 0.0042,"eisenhowe": 0.0034,"econom": 0.0049,"western": 0.0053,"military": 0.0084,"stalin" : 0.0129,"peac": 0.0036,"truman": 0.0054,"ally": 0.0041},"TopicID": 2147483637}

Listing 11: "Cold War" word-cloud page.

In this kind of clouds appear, so, the most controversial words in the page. In this case is clearly "soviet". The words distribution could be represented like this:

42 Marco Chilese 5 THE WORDS CLOUD

Figure 20: Word distribution "Cold War" page.

43 5 THE WORDS CLOUD Marco Chilese

5.2 Topic Words Cloud

Using the file F.6, for each topic is built the word cloud. An example could be:

Figure 21: Top 50 words for the topic "Technology and applied sciences."

Based on:

1 {"include": 73809,"design": 68304,"time": 64639,"oper": 63041,"system": 62784,"develop": 61680,"base": 60235,"gener": 54488,"history": 51326, "well": 49214,"servic": 48574,"year": 48518,"allow": 48206,"work": 48057,"provid": 47996,"origin": 47836,"unit": 47747,"call": 47442," state": 47339,"number": 46568,"company": 46226,"built": 45130,"high": 44734,"engin": 44562,"three": 44222,"product": 44120,"featur": 42388, "addit": 42042,"open": 41977,"requir": 41920,"type": 40866,"power": 40532,"current": 40279,"support": 39323,"form": 39185,"complete": 39093,"creat": 38800,"control": 38712,"locat": 38564,"produc": 38372, "chang": 37964,"standard": 37448,"start": 37334,"second": 37241," version": 37124,"larg": 36303,"early": 36042,"exampl": 35511,"area": 35022,"singl": 34968}

Listing 12: Top 50 words for the topic "Technology and applied sciences."

In this cloud appear, so, the most controversial words for this topic. The words distribution could be represented like this:

44 Marco Chilese 5 THE WORDS CLOUD

Figure 22: Word distribution for top 50 words for the topic "Technology and applied sciences".

45 5 THE WORDS CLOUD Marco Chilese

5.3 Wiki Words Cloud

Using the file F.4 is built the Wikipedia global words word cloud. An example could be:

Figure 23: Cloud word of most popular 50 words in English Wikipedia.

Based on:

1 {"army": 4082388,"victory": 1715700,"effects": 2463275,"germany": 2066031, "dies": 3395774,"society": 2906176,"universe": 11036215,"property": 1780786,"study": 4765130,"industry": 3229320,"history": 7515945," defense": 1739903,"days": 2930201,"mary": 1931143,"july": 6258655," release": 10611562,"energy": 1902029,"committee": 1940316,"goed": 1482858,"degree": 1938587,"maked": 2745303,"acts": 2047435,"movy": 2329648,"represe": 3827545,"decise": 1883253,"rights": 2337223," extense": 1473604,"commercy": 1850363,"body": 3101547,"presents": 2193304,"professione": 2749855,"ends": 2123210,"company": 7138777," academy": 1843645,"suggests": 1651599,"memory": 1958650,"founds": 2689593,"parts": 1749895,"military": 3315229,"financie": 1513415,"arms ": 1694792,"policy": 2035518,"televise": 4072533,"primary": 1866713," increase": 3628379,"marry": 2482972,"sets": 1583546,"century": 5748435, "chinese": 1954766,"named": 5909838}

Listing 13: Global Wikipedia word-cloud.

This cloud is pretty interesting because it represents the most conflictual words in the entire English Wikipedia. The words distribution could be represented like this:

46 Marco Chilese 5 THE WORDS CLOUD

Figure 24: Word distribution for global Wikipedia words.

47 5 THE WORDS CLOUD Marco Chilese

5.4 Bad-Words Cloud

Attention: the following words can impress the most sensitive reader. The document is intended for an adult audience only.

5.4.1 Global Bad-Words Cloud

Using the file F.7, can be built the global words cloud of bad words. For the latest English Wikipedia, is:

Figure 25: Global bad-words words cloud for English Wikipedia.

Based on:

1 {’Tot’: 2017551, ’Badwords’: {’organ’: 348666, ’kill’: 223682, ’murder’: 102368, ’sexual’: 64185, ’strip’: 56930, ’erect’: 45400, ’bone’: 43373, ’ blow’: 40205, ’dick’: 37758, ’virgin’: 35187, ’hell’: 34975, ’slave’: 34831, ’nazi’: 34386, ’stroke’: 27928, ’oral’: 25989, ’crack’: 24835, ’ rape’: 23898, ’beer’: 23684, ’escort’: 22597, ’breast’: 19899, ’bloody’: 18586, ’bang’: 17444, ’drunk’: 17155, ’wang’: 16225, ’prostitute’: 15976, ’heroin’: 15646, ’hitler’: 15579, ’negro’: 12710, ’butt’: 12632, ’climax’: 12248, ’thrust’: 12191, ’ugly’: 11651, ’fuck’: 11613, ’damn’: 11444, ’ stupid’: 11102, ’woody’: 10912, ’beaver’: 10902, ’suck’: 10685, ’screw’: 9950, ’reich’: 9923, ’dong’: 9353, ’sexy’: 9296, ’nude’: 9136, ’paddy’: 9030, ’weed’: 8821, ’playboy’: 8415, ’cocain’: 7891, ’babe’: 7706, ’sniper ’: 7449, ’shit’: 7007}}

Listing 14: "Global bad-words data for English Wikipedia.

The words could be represented like this:

48 Marco Chilese 5 THE WORDS CLOUD

Figure 26: Global bad-words distribution for English Wikipedia.

5.4.2 Pages Bad-Words Cloud

Using the file F.7, for each page who has them, the bad language word cloud is built. A couple of examples could be:

Figure 27: Bad-words from "Web services protocol stack" page.

49 5 THE WORDS CLOUD Marco Chilese

Which is, as reported in §7.2, the page with the highest number of bad-words. That are:

1 {"anus":1,"arse":1,"asshole":1,"babes":1,"balls":1,"bigblack":1,"bloody":1," blowjobs":1,"bondage":1,"boobs":1,"booty":1,"breasts":1,"bukkake":1,"busty ":1,"butthole":1,"clitoris":1,"clits":1,"cocks":1,"cocksucker":1,"creampie ":1,"cumshots":1,"cunilingus":1,"cunts":1,"dildos":1,"ejaculate":1,"femdom ":1,"fucks":1,"gays":1,"groupsex":1,"hardcore":1,"hentai":1,"hooters":1," horny":1,"intercourse":1,"juggs":1,"kinky":1,"lesbians":1,"lesbos":1," lezbians":1,"mams":1,"masturbate":1,"naked":1,"niggers":1,"nipple":1," nudity":1,"orgy":1,"panty":1,"penetrate":1,"pornography":1,"prostitute":1, "pussy":1,"seduce":1,"sexy":1,"shemale":1,"shiteater":1,"shits":1,"sleazy" :1,"sluts":1,"sucks":1,"testicle":1,"threesome":1,"tits":1,"titty":1," tranny":1,"transsexual":1,"twats":1,"whores":1}

Listing 15: "Web services protocol stack" bad-words data.

Figure 28: Bad-words from "Sexuality in ancient Rome" page.

Based on:

1 {"anus":1,"asshole":1,"balls":1,"bawdy":1,"bondage":1,"breasts":1,"clitoris" :1,"clits":1,"cocks":1,"cunnilingus":1,"cunts":1,"dildos":1,"ejaculate":1, "fellate":1,"fondle":1,"fucks":1,"glans":1,"homoerotic":1,"intercourse":1, "lesbians":1,"lesbos":1,"loins":1,"masturbate":1,"menstruate":1,"naked":1, "nipple":1,"nudity":1,"orgy":1,"penetrate":1,"pornography":1,"prostitute" :1,"seduce":1,"steamy":1,"stoned":1,"testicle":1,"threesome":1,"titty":1," ugly":1,"urine":1,"uterus":1,"whorehouse":1,"whores":1}

50 Marco Chilese 5 THE WORDS CLOUD

Listing 16: "Sexuality in ancient Rome" bad-words data.

5.5 Brief Considerations About Words Distribution

From the reported distribution can be made some observations. In first place, can be noticed that the kind of distribution is not linear, but "sweked" superlinear: very few words that appear a lot of times, and the others less and less. There is, also, an interesting difference between the distribution of "normal" words and badwords: in the badwords distribution the skew is more accentuated than the other. There are so a sort of "super champions" words that catalyze their use during clashes.

51

Marco Chilese 6 INTEGRATION IN NEGAPEDIA

6 Integration in Negapedia

6.1 Current Integration

Current integration in Negapedia consists mainly of an extension of the original Overpedia code. The key of the integration is the capability of the developed tool to be go gettable, as said in §2.6, or recognized as automatically downloadable set of go package. However, as said, the tool does not consist only on Golang code, but also on Python code and on a jar executable file. To get these files a simple solution is about cloning the repository inside the docker container in charge of making the calculations. With this premise, integration stays pretty simple: the only thing to do inside Overpedia code is about using the API (§2.6.1) made available by the developed tool. Thanks to Exporter functions defined in the API, is simple to get the data to integrate into the Negapedia web page, a task that is delegated to Overpedia.

6.1.1 Data Exporters

To simply export calculated data is exploited an important feature made avail- able by the Go language: channels. Channels are pipes utilized by Golang to connect different goroutines (a lightweight thread of execution): are a typed conduit in which can be sent and received data. They are particularly optimized for concurrency. So, with this definition in mind, calculated data are read and encapsulated in structures that are sent through the specific channel. As said in §2.6.1, the API has an exporter function for each type of data:

• Pages with its words data;

• Topic with its words data;

• Pages with its badwords data;

• analysed Wikipedia words data.

Through this functions, data can be simply integrated inside Negapedia pages: this streams of data are merged with Negapedia data coming from database and, consequently, thanks to a template are injected in HTML pages. At the end of the process, every page is compressed in gzip and then grouped in tarball of file compressed in gzip.

53 6 INTEGRATION IN NEGAPEDIA Marco Chilese

6.2 The State of Art

About integration, has been designed also a system which, without resources lim- itation, could reach the state of art. This system consists on the use of Kubernetes (also knows as K8s). It is an open-source orchestrator and manager system for Docker containers, particularly optimized for scaling application and deploy. Kubernetes allows managin Docker containers inside clusters of multiple hosts, make better use of resources, automatically manages application deploy and up- dates, mount and add storage for running stateful application and easily manage application scalability. E.g. Kubernetes admits distributing (scale) a huge computational weight on mul- tiple machines which run designed containers at the same time. This is the exact idea who made Kubernetes optimal in our case.

Kubernetes Fundamentals Kubernetes system is based on some fundamentals concepts:

• Master: the machine which controls Kubernetes nodes, is the starting point of all assigned activities;

• Nodes: machines which perform the assigned activities from the Master;

• Pods: a group of one or more container distributed in a single node. All the containers share the same IP, host name and other resources. Pods abstract the network and storage from the container below, allowing to move containers inside the cluster easily;

• Replication controller: controls the number of pod replicas inside a spe- cific point of the cluster;

• Services: uncouple job definitions from the pods. Kubernetes Services’ proxy send automatically requests to the correct pod, independently from the movements inside the cluster, even in the case it has been repositioned;

• Kubelet: this service is executed by the nodes, it reads the container manifest and ensures that containers start;

• Kubectl: Kubernetes command-line configuration tool.

So with this definitions in mind, Kubernetes architecture can be represented like this:

54 Marco Chilese 6 INTEGRATION IN NEGAPEDIA

Figure 29: Kubernetes Architecture.

Credit: https://blog.newrelic.com/engineering/what-is-kubernetes/

The idea which led to consider a system based on Kubernetes is the following one. As said in §3, the pages’ history is collected in a certain number, let say N, of dumps. Each dump, so, contains the full history of a group of pages, and this makes the smallest item to process in our case. Every dump must go across to three main phases:

download → pre-processing → processing. So these steps must be repeated N times on a node context. But, what if are used N machines (or nodes) instead of a single one? After these parts, the project requires a final stage of elaboration but it must be executed sequentially, so, let say it requires a time k to complete. In term of times: let T be the list of entire processing times for each dump, and th so, Ti the time required to process the i dump. In a single machine context the global amount of time to complete the processing would be:

N−1 X T ot = Ti + k i=0

Contrariwise, having N machines, the amount of time required for complete the processing would simply be:

T ot = max(T ) + k or rather, the processing time of the biggest dump in the collection, and not more the sum of all processing times. From this point of view, is clear that bigger is the number of available nodes, smaller is the required time for processing.

55 6 INTEGRATION IN NEGAPEDIA Marco Chilese

With this in mind has been designed a fully scalable system which can run on N nodes, where N, in the best case, match every time the number of dump to process15 reducing to the minimum the required time for processing. The system is so structured:

Figure 30: State of art system representation using Kubernetes.

where:

• RESTful API: is a specifically designed REST web service with the aim of programming and starting the execution from the specified parameters (e.g. Wikipedia language, etc.);

• Shared Data Volume: is a storage volume shared between the nodes;

• Node i: is the node which face up with three main processing stages, it saves its results in the shared data volume;

• Final Node: is a "special" node which starts when all the others nodes have finished. His task is about to complete the processing aggregating the results calculated by the others nodes.

15E.g. about 645 for English Wikipedia, about 70 for Italian Wikipedia, etc.

56 Marco Chilese 7 ANALYSIS OF THE RESULTS

7 Analysis of the Results

The considerations expressed in this section are based on the last calculation on English Wikipedia, considering the last 10 revert of August 2019 data.

7.1 Amount of Data

Managed data for the calculation in object are:

Name Quantity Number of processed dumps 648 Number of pages 2,810,522 Number of words 1,508,753,550 Number of bad-words 610,032 Size of data archive 24.2GB Size of 7z data archive 3.2GB

Table 17: English Wikipedia, last 10 Revert of August 2019 dumps: amount of data.

7.2 Considerations About Pages Data

From the calculated data, can be made simple consideration about quantity, and is simple to sort some kind of data. An example of statistical approach could be the following. Thanks to the structure of the file F.1 can be simply retrieved the page with the highest number of words, that is:

Page ID 49801965 Title 1918 New Year Honours Number of words 919,105

From F.7 can be retrieved the page with the highest number of bad-words, which is:

Page ID 1302413 Title Web services protocol stack Number of bad-words 67

and in relative terms, the most vulgar page is:

57 7 ANALYSIS OF THE RESULTS Marco Chilese

Page ID 18623985 Title GFY Vulgarity ratio 1

From F.5 can be simply extracted the distribution of words in topics:

Figure 31: Distribution of words in topics.

Relatively to words, from F.4 can be simply extracted the list of 10 most popular words, which are (expressed as: "word: number of occurrence"):

1. Origin: 8,053,018;

2. Govern: 7,704,371;

3. Form: 7,424,495;

4. Life: 7,144,896;

5. Public: 7,033,300;

6. Opere: 6,925,410;

7. North: 6,688,197;

8. Design: 6,494,640;

58 Marco Chilese 7 ANALYSIS OF THE RESULTS

9. Start: 6,373,290;

10. Power: 6,326,000.

This list is attention worthy, because these words represent the most controversial words of the Wikipedia in object. Moreover, they are particularly meaningful because they came from the last 10 revert of each page, so are they are about recent page history. Instead, from F.7 can be drawn up the 10 most popular bad-words list (expressed as: "word: number of occurrence"):

Attention: the following words can impress the most sensitive reader. The document is intended for an adult audience only.

1. Stoned: 132,058;

2. Balls: 78,725;

3. Naked: 20,951;

4. Penetrate: 20,903;

5. Breasts: 19,940;

6. Bloody: 19,183;

7. Prostitute: 16,003;

8. Lesbian: 13,454;

9. Hardcore: 12,017;

10. Fuck: 11,668.

59

Marco Chilese 8 CONCLUSIONS

8 Conclusions

This project started with the purpose of clarifying the world of conflict beneath each page in Wikipedia, and it completely reached the goal. Now thanks to the developed tool is easy to understand about what people are fighting on. Through words cloud clashes can took form and be easily repre- sented. In fact, before that in Negapedia was missing a complementary view to quanti- tative one (defined by conflict and polemic indexes). Beyond that, the project offers the possibility of exposing bad language inside pages reverts. The bad-words analysis could be a useful tool for studying people aggressiveness. Like shown in §5.4, bad words could appear in the most unex- pected pages, confirming the community unpredictability.

8.1 Requirements

In term of requirements coverage has reached 100%, and exceeded this number by providing to the project unplanned feature, like: global Wikipedia words data and bad words analysis.

8.2 Development

Project development has been slowed down by the necessity of changing main programming language. As deeply described in §2.3, the change was necessary to considerably decrease the calculation time, that contrarily would largely exceed the month.

8.3 About the Future

As proposed in §6.2 the project execution times could benefit of using a Kuber- netes cluster. In this way, the whole processing weight would not longer burden on a single machine but on N machines (where N possibly match to dumps num- ber for the required Wikipedia). Project would also take advantages of developing a new Wikitext cleaner, possibly in Golang. Thanks to a new implementation performance could improve and the cleaning process would be smarter and more precise. Having a better Wikimedia markup cleaner would mean better text to analyse, and as consequence, better results.

61 8 CONCLUSIONS Marco Chilese

This journey inside Wikipedia and its data was particularly instructive: is possible to experience in first-hand that what we usually treat as "safe informa- tion" or "correct information", is actually something which is shaped by people, and so with them by their interests and ambitions. Luckily what is published on Wikipedia is checked by a lot of people, and the most of the times this is enough for fight disinformation.

62 Marco Chilese A AVAILABLE LANGUAGES

A Available Languages

The developed tool can handle 45 languages. Here follows the complete list of languages for the main part of the project, and for the bad language report.

A.1 Project Handled Languages

• English; • Bengali;

• Arabic; • Bulgarian;

• Danish; • Catalan;

• Dutch; • Chinese;

• Finnish; • Croatian; • French; • Czech; • German; • Galician; • Greek; • Hebrew; • Hungarian; • Hindi; • Indonesian; • Irish; • Italian; • Japanese; • Kazakh; • Korean; • Nepali; • Latvian; • Norwegian; • Lithuanian; • Portuguese; • Marathi; • Romanian; • Persian; • Russian; • Polish; • Spanish;

• Swedish; • Slovak;

• Turkish; • Thai;

• Armenian; • Ukrainian;

• Azerbaijani; • Urdu;

• Basque; • Simple-english.

63 A AVAILABLE LANGUAGES Marco Chilese

A.2 Bad Language: Handled Languages

Bad language analysis handle 22 languages: if the language used on the main part is not available, this analysis is simply skipped.

• English; • Spanish;

• Arabic; • Swedish;

• Danish; • Chinese;

• Dutch; • Czech;

• Finnish; • Hindi;

• French; • Japanese;

• German; • Korean;

• Hungarian; • Persian;

• Italian; • Polish;

• Norwegian; • Thai;

• Portuguese; • Simple-english.

A.3 Add Support for a New Language

Adding support for new Wikipedia language is simple, the only thing to do is add the data for the new required language. To do so, the things to do are:

• Adding a stopwords language (project core):

1. The list of stopwords mustmust be formed in the following way:

1 stopwords1

2 stopwords2

3 ...

4 stopwordsN

and the file must be named like "english" without extension; 2. fork Negapedia NLTK repository from https://github.com/negapedia/ nltk; 3. push the new file into the forked repository, inside /stopwords/corpora/stopwords folder;

64 Marco Chilese A AVAILABLE LANGUAGES

4. fork the project repository from https://github.com/negapedia/ wikitfidf; 5. add the language to the function CheckAvailableLanguage in the forked repository, inside /wikitfidf.go; 6. propose to Negapedia team a pull request for both repository specify- ing the change you made.

• Adding a badwords language:

1. The list of stopwords must be formed in the following way:

1 badwords1

2 badwords2

3 ...

4 badwordsN

and the file must be named like "english" without extension; 2. fork Negapedia Badwords repository from https://github.com/negapedia/ badwords; 3. add the new file to the forked repository inside /badwords folder; 4. fork the project repository from https://github.com/negapedia/ wikitfidf; 5. add the language to the function AvailableLanguage in the forked repository, inside /internal/badwords/badwords.go; 6. propose to Negapedia team a pull request for both repository specify- ing the change you made.

65

Marco Chilese Glossary

Glossary

API An Application Programming Interface is a set of subroutines, communica- tion protocols and tool for building software. 17, 56

Database Dump A database dump contains the structure of a database and, optionally, all the data in it. This type of files are generated for backing up data, so that everything could be restored in case of data or structure corruption. 21

Dockerfile A Dockefile is a text document that contains all the commands used to build an image, which will run in a container. 7, 18

Git Git is an open-source software of version control invented by Linus Torvalds in 2005. 5

GitHub GitHub it’s a hosting service for software project under git versioning. 5 gzip GNU zip is an open-source data compression software born in 1992. 36

IDE An IDE (Integrated Development Environment) is a software which helps programmers writing code. It usually includes also debugging functionality, and other services. 19 jar Jar (Java ARchive) is a package file used to aggregate, Java classes, depen- dencies, metadata and resources into a single file for distribution. 16

JSON JavaScript Object Notation is a lightweight data format, easy to generate and parse. It is based on the concept of key: value and lists of element. 16, 26

Maven Apache Maven is project Java management and build automation soft- ware. 16 pip Pip (Python Package Index) is a software repository for Python. It allows to easily install modules. 6

REST Representational State Transfer (REST) is a software architectural style which defines a set of constraints to be used for creating web services. Web services which are conform to the REST architectural style are called RESTful Web services. 56

67 Glossary Marco Chilese

SHA1 Secure Hash Algorithm is a cryptographic function used for the message digest developed by American NSA (National Security Agency) since 1993. The algorithm works with blocks of 512bit which are processed to form a 160bit message digest. 23

TF-IDF Term Frequency-Inverse Document Frequency is an information re- trieval function for measuring the importance of a term inside a document or a collection of document. 1

Wikitext Wikitext is markup language used in MediaWiki software to format a page. Reference guide at: https://en.wikipedia.org/wiki/Help:Wikitext. 5, 6, 16, 28

68 Marco Chilese REFERENCES

References

Ben Kurtovic. (2019). mwparserfromhell repository. Retrieved from https:// github.com/earwig/mwparserfromhell/

Docker Inc. (2019). Docker Documentation. Retrieved from https://docs .docker.com/

Enrico Bonetti Vieno. (2019). Wikibrief Library. Retrieved from https:// github.com/negapedia/wikibrief

Enrico Bonetti Vieno, M. M. (2016). Overpedia: Analisi di Wikipedia dal Punto di Vista Sociale (BsC Thesis). University of Padua.

Google. (n.d.). Golang Documentation. Retrieved from https://golang.org/ doc/

Google. (2019a). CodeReviewComments Go Code Style Guide. Retrieved from https://github.com/golang/go/wiki/CodeReviewComments

Google. (2019b). Effective Go: Code Style Guide and Best Practices. Retrieved from https://golang.org/doc/effective_go.html

Google. (2019c). Kubernetes Documentation. Retrieved from https:// kubernetes.io/docs/home/

Jason Davis. (2019). d3-cloud GitHub Repository. Retrieved from https:// github.com/jasondavies/d3-cloud

Jimmy Lin. (2019). Wikiclean repository. Retrieved from https://github.com/ lintool/wikiclean

Marchiori, M., & Bonetti Vieno, E. (2018a, 08). The battle for information: Exposing wikipedia. In (p. 958-965). doi: 10.1109/DASC/PiCom/DataCom/ CyberSciTec.2018.000-3

Marchiori, M., & Bonetti Vieno, E. (2018b, 08). Negapedia: The negative side of wikipedia. In (p. 1-4). doi: 10.1109/ASONAM.2018.8508406

NLTK Project. (2019). NLTK Documentation. Retrieved from https://www .nltk.org/

Oracle Corporation. (2019). Java Documentation. Retrieved from https:// docs.oracle.com/en/java/javase/12/

69 REFERENCES Marco Chilese

Python Software Foundation. (2019a). PEP8: Python Code Style Guide. Re- trieved from https://www.python.org/dev/peps/pep-0008/

Python Software Foundation. (2019b). Python Documentation. Retrieved from https://docs.python.org/3/

Robert Bradshaw, Stefan Behnel, et al. (2019). Cython Documentation. Re- trieved from https://cython.org/#documentation

SonarSource SA. (2019). SonarCloud. Retrieved from https://sonarcloud.io

Thomas Adler, B., de Alfaro, L., & Pye, I. (2010, 01). Detecting wikipedia vandalism using wikitrust: Lab report for pan at clef 2010. In (Vol. 1176).

Travis CI, GMBH. (2019). Travis CI. Retrieved from https://travis-ci .org/

Wikimedia Foundation. (2018). Wikipedia xml dumps faq monthly up- date. Retrieved 2019-07-08, from https://lists.wikimedia.org/pipermail/ xmldatadumps-l/2018-October/001435.html

Wikimedia Foundation. (2019). Wikipedia xml dumps faq. Retrieved 2019-07- 08, from https://meta.wikimedia.org/wiki/Data_dumps/FAQ#How_big_are _the_en_wikipedia_dumps_uncompressed.3F

Wikipedia. (2019). Wikipedia. Retrieved 2019-07-08, from https://en .wikipedia.org/wiki/Wiki

70 September 26, 2019

To mum and dad for always being there and for making this journey possible, To Aurora for being always by my side, To my grandma for always believing in me, To my grandpa who dreamed of this day,

To Monica, my best friend, To my friends Giova, Seba, Ale, Dina, Ago, Samu and Pego because we shared the laughter and the lessons, To Mirko and my colleagues for the projects,

To the people I met along my way during these three years,

This is the end of an incredible journey and the person here today is changed much from three years ago. This is the end of a journey and the beginning of a new adventure 26 Settembre 2019

A mamma e papà per esserci sempre stati e per aver reso possibile questo viaggio, Ad Aurora per essere sempre stata al mio fianco, A mia nonna per aver sempre creduto in me, A mio nonno che sognava questo giorno,

A Monica, l’amica di sempre, Ai miei amici Giova, Seba, Ale, Dina, Ago, Samu e Pego per tutte le risate e le lezioni, A Mirko ed ai miei colleghi per i progetti,

Alle persone che ho conosciuto durante i tre anni di questo percorso,

Questa è la fine di un viaggio incredibile e la persona qui oggi è cambiata molto rispetto a tre anni fa. Questa è la fine di un viaggio e l’inizio di una nuova avventura