Implementation and Evaluation of a Framework to Calculate Impact Measures for Wikipedia Authors

Implementation and Evaluation of a Framework to calculate Impact Measures for Wikipedia Authors Bachelor Thesis by Sebastian Neef [email protected] Submitted to the Faculty IV, Electrical Engineering and Computer Science Database Systems and Information Management Group in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science at the arXiv:1709.01142v1 [cs.DL] 26 Aug 2017 Technische Universität Berlin June 29, 2017 Thesis Advisors: Moritz Schubotz Thesis Supervisor: Prof. Dr. Volker Markl Prof. Dr. Odej Kao Eidesstattliche Erklärung Hiermit erkläre ich, dass ich die vorliegende Arbeit selbstständig und eigenhändig sowie ohne unerlaubte fremde Hilfe und ausschließlich unter Verwendung der aufgeführten Quellen und Hilfsmittel angefertigt habe. Die selbständige und eigenständige Anfertigung versichert an Eides statt: Berlin, den 29. Juni 2017 Sebastian Neef III Zusammenfassung Die Wikipedia ist eine offene, kollaborative Webseite, welche von jedermann, auch anonym, bearbeitet und deswegen Opfer von unerwünschten Änderungen werden kann. Anhand der kontinuierlichen Versionierung aller Änderungen und mit Ranglisten, basierend auf der Berechnung von impact measures, können uner- wünschte Änderungen erkannt sowie angesehene Nutzer identifiziert werden [4]. Allerdings benötigt das Verarbeiten vieler Millionen Revisionen auf einem einzel- nen System viel Zeit. Der Autor implementiert ein quelloffenes Framework, um solche Ranglisten auf verteilten Systemen im MapReduce-Stil zu berechnen, und evaluiert dessen Performance auf verschieden großen Datensätzen. Die Nachim- plementierung der contribution measures von Adler u. a. sollen die Erweiterbarkeit und Nutzbarkeit, als auch die Probleme beim Handhaben von riesigen Datensätzen und deren mögliche Lösungsansätze demonstrieren. In den Ergebnissen werden die verschiedenen Optimierungen diskutiert und gezeigt, dass horizontale Skalierung die gesamte Verarbeitungsdauer reduzieren kann. Abstract Wikipedia, an open collaborative website, can be edited by anyone, even anony- mously, thus becoming victim to ill-intentioned changes. Therefore, ranking Wikipedia authors by calculating impact measures based on the edit history can help to identify reputational users or harmful activity such as vandalism [4]. However, processing millions of edits on one system can take a long time. The author implements an open source framework to calculate such rankings in a distributed way (MapReduce) and evaluates its performance on various sized datasets. A reimplementation of the contribution measures by Adler et al. demonstrates its extensibility and usability, as well as problems of handling huge datasets and their possible resolutions. The results put different performance optimizations into perspective and show that horizontal scaling can decrease the total processing time. IV Contents 1 Introduction1 1.1 Background . .1 1.2 Related work . .2 1.3 Problem statement . .2 1.4 Keywords and abbreviations . .4 1.5 Setup . .7 2 Implementation 11 2.1 Investigating compressed datasets . 11 2.2 Plan of implementation . 19 2.3 Functionality and extensibility . 41 2.4 Problems and solutions . 47 3 Evaluation 59 3.1 WikiTrust vs. framework . 59 3.2 Performance comparison . 63 3.3 Framework optimizations . 68 3.4 Pageviews parsing . 76 4 Discussion 79 4.1 Performance and ranking results . 79 4.2 Framework use cases . 93 4.3 Future work . 97 5 Conclusion 100 6 References 102 7 Appendix 107 7.1 Source code on CD and GitHub . 107 7.2 Java, Flink, WikiTrust and framework installation . 109 7.3 WikiTrust and framework bash-loop . 111 7.4 WikiTrust and framework author reputations . 113 7.5 List of figures, tables and listings . 117 V 1 Introduction 1 Introduction In this first section the thesis’ topic is introduced. Starting with the background and related work then continuing with the problem statement and goal. Important keywords, formulas and abbreviations are defined, followed by general information about the used software and hardware setup. 1.1 Background People publish articles or contribute to open source projects without an immedi- ate reward, but hope for indirect rewards like extending their skill set, own mar- ketability or peer recognition [37, p. 253]. This may be one of the reasons for using platforms like ResearchGate1. It uses an unknown algorithm [30] to calculate an individual’s RG Score2 and position her in the community, leading to said indirect rewards. Such impact measures have been used to accept or reject applicants [15, p. 391]. Another well known measure of the quality of a scientist’s work is the h-index (see 1.4.1)[15, p. 392]. Unlike ResearchGate, the free encyclopedia Wikipedia3 does not display author performance measures besides edit counts. Thus making it harder for them to turn their contributions into indirect rewards. In fact, Wikimedia relied on the edit count and votes to nominate new moderators [39]. Another point which makes researching this topic interesting are the high user and edit counts. In 2008, Wikipedia had more than 300.000 authors with at least ten edits and the numbers have been growing by 5 - 10 percent per month since then [37, p. 243 - 244]. Even at the time of writing in March 2017, Wikipedia has more than 10.000 active authors with more than 100 edits in that month [44]. 1https://www.researchgate.net/ 2https://www.researchgate.net/publicprofile.RGScoreFAQ.html 3https://wikipedia.org 1 1.2 Related work 1.2 Related work Previous research focused only on the edit count [46, 29, 40] or length of changed text [38,3]. Schwartz (2006) discovered high discrepancies between some users’ edit and text count. He noticed that top contributors by edit count are not necessarily on the top by text count [38]. Adler et al. conclude that the quantitative measures edit or text count can be manipulated easily. Therefor, more weight should be put on the content’s quality with qualitative measures. One measure they introduced was the longevity of a change [4, p. 1f]. Their concepts have been used in more recent publications like the WikiTrust program [22, p. 38] or as a consideration for another rating system [36, p. 75]. A service that makes use of Wikipedia’s history trying to detect vandalism by clas- sifying an edit’s quality, is the “Objective Revision Evaluation Service (ORES)“ by MediaWiki [32]. Its approach focuses on machine learning, but manual classifi- cation work by the community is needed to get accurate results. Furthermore, it appears that only a limited amount of Wikipedia sites have this service enabled [32]. 1.3 Problem statement The work by Adler et al. is a promising starting point on the impact measure topic, due to their proposal of several formulas that calculate an author ranking based on the impact of an author’s edit. Adler et al. call those formulas “contribution measures“ [4]. The introduced terminology will be adapted to reduce confusion and credit their work. Unfortunately the authors don’t discuss how to efficiently analyze the around 2.6 billion4 edits. This is the bachelor thesis’ starting point. It will introduce and discuss an open source framework, which prepares Wikipedia’s edit history and page views and facilitates the development of distributed impact measures and rankings for Wikipedia authors. Such a framework could lead to better reproducibility of 4https://tools.wmflabs.org/wmcounter 2 1.3 Problem statement author rankings. The thesis will try to provide reasonable information to keep all tests and results reproducible. Therefore the following research objective was defined: Implement and evaluate a framework for distributed calculation of impact measures for Wikipedia authors. The following tasks were set to achieve this objective: • Task 1: Analyze the input datasets in regard to their layout and format. • Task 2: Design and implement the distributed framework. • Task 3: Implement Adler et al. contribution measures as described in [4]. • Task 4: Evaluate the framework’s processing speed by comparing it to Adler et al. WikiTrust program [18]. The first problem to address is the parsing of Wikipedia XML dumps (see 1.4.2; [21]). For example, the English Wikipedia is more than 0.5 TB compressed and extracts to multiple (>= 10) TB of XML data. The student cluster (see 1.5.1) does not have enough disk space to store such amounts of data, thus processing the data in its compressed format will be explored in section “Investigating compressed datasets“. This amount of data has to be transformed from its XML state into an usable representation for further processing, for example objects with the attributes as member variables and references between correlated objects, so that distributed algorithms can operate on them. That also covers the calculation of differences between two revisions of a page. Different types of processing techniques will be tested on multiple datasets and evaluated based on the execution time. Wikipedia’s page views (see 1.4.3;[34, 27]) is an additional source of information which was not used by to Adler et al., but might be an important factor to rate authors. The data needs to be parsed as well and incorporated into the framework’s representation of a Wikipedia page. The key functionality of the framework is to hide the initial Wikipedia XML or page views data processing from the user, who uses the framework to calculate author rankings for Wikipedia sites. It should also try to hide the parallel data processing mechanisms as far as it is possible. Another important point is the extensibility which will make the framework customizable in mostly every aspect. 3 1.4 Keywords and abbreviations 1.4 Keywords and abbreviations This section explains and defines important abbreviations and keywords which will be used throughout the thesis. 1.4.1 H-index The h-index: An author A has a non-empty set PA = fp1; : : : ; png of n 2 N publications. Let cA : PA ! N be a function which returns a publication’s citation count. Let PAi be the tuple of publications, sorted by decreasing citation count.

Load more