A Vision for Performing Social and Economic Data Analysis Using Wikipedia’S Edit History
Total Page:16
File Type:pdf, Size:1020Kb
A Vision for Performing Social and Economic Data Analysis using Wikipedia’s Edit History Erik Dahm Moritz Schubotz Norman Meuschke Bela Gipp Dept. of Computer and Information Science University of Konstanz Universitaetsstr. 10, 78457 Konstanz, Germany fi[email protected] ABSTRACT visitors2. Wikipedia's openness that allows virtually every- In this vision paper, we suggest combining two lines of re- one to contribute and edit content is a key factor that en- search to study the collective behavior of Wikipedia contrib- sures the breadth, diversity, and currentness of Wikipedia's utors. The first line of research analyzes Wikipedia's edit content, which in turn is a driving force of Wikipedia's suc- history to quantify the quality of individual contributions cess. However, Wikipedia's open and collaborative editing and the resulting reputation of the contributor. The second process is also a source of doubt regarding the quality and line of research surveys Wikipedia contributors to gain in- reliability of Wikipedia content sights, e.g., on their personal and professional background, Assessing the reputation, i.e. "quality", of Wikipedia con- socioeconomic status, or motives to contribute to Wikipedia. tributors and the quality of Wikipedia content are problems While both lines of research are valuable on their own, we that have attracted much research attention in recent years. argue that the combination of both approaches could yield Having reviewed the literature on approaches that analyze insights that exceed the sum of the individual parts. Link- the editing and revision process of Wikipedia (see Section ing survey data to contributor reputation and content-based 2), we found two distinct lines of research that are cur- quality metrics could provide a large-scale, public domain rently independent of each other (Figure 1). The first line data set to perform user modeling, i.e. deducing interest of research includes content-based approaches that analyze profiles of user groups. User profiles can, among other ap- Wikipedia's edit history to assess contributor reputation and plications, help to improve recommender systems. The re- content quality (see Section 2.1). The edit history represents sulting dataset can also enable a better understanding and a persistent, fine-grained record of any change to an article improved prediction of high quality Wikipedia content and and the originator of the change. Content-based approaches successful Wikipedia contributors. Furthermore, the dataset analyze Wikipedia's edit history to assess or predict the can enable novel research approaches to investigate team trustworthiness of contributors, the quality of their contri- composition and collective behavior as well as help to iden- butions, and the overall quality of Wikipedia articles. All tify domain experts and young talents. We report on the content-based investigations of quality issues in Wikipedia status of implementing our large-scale, content-based analy- that we found rely on IP addresses or user account names sis of the Wikipedia edit history using the big data process- to distinguish individual contributors. These investigations ing framework Apache Flink. Additionally, we describe our allow for little conclusions regarding the individuals the ac- plans to conduct a survey among Wikipedia contributors to counts represent. Content-based analysis approaches using enhance the content-based quality metrics. Wikipedia's edit history yield valuable results for assessing and ensuring content quality in Wikipedia, yet do not allow linking this data to individuals. Keywords The second line of research comprises user surveys studying Wikipedia; Author Reputation; Article Quality; Editor Types contributor motivation, contributor interaction, and other factors that influence the quality of contributions to Wiki- 1. INTRODUCTION pedia (see Section 2.2). While some surveys investigated Wikipedia is the largest collaboratively maintained in- socioeconomic questions in regard to Wikipedia users, this formation repository on the Web. The Wikipedia contains data is not linked to accounts or IP addresses, which would more than 40 million articles1 and attracts billions of annual allow to model the behavior of the individuals. We suggest that analyzing Wikipedia's edit history and 1 https://en.wikipedia.org/wiki/Wikipedia:Statistics linking this data to individual characteristics of contribu- tors collected through surveys could provide a large-scale, open source dataset offering tremendous potential for user- c 2017 International World Wide Web Conference Committee centered and content-centered research. The data set would (IW3C2), published under Creative Commons CC BY 4.0 License. enable to investigate questions such as: WWW’17 Companion, April 3–7, 2017, Perth, Australia. ACM 978-1-4503-4914-7/17/04. • How do user characteristics, e.g., demographics, influence http://dx.doi.org/10.1145/3041021.3053363 the relevance of topics to the user? 2Wikipedia currently holds rank 6 in the Alexa traffic rank- . ing http://www.alexa.com/siteinfo/wikipedia.org 1627 2.1 Content-based Assessments of Contribu- tor Reputation and Article Quality Reputation is typically defined as the public opinion to- wards a person, a group of persons, or an organization de- rived from the social evaluation of a set of criteria [17]. The increasing amount and of user-generated content on the Web has increased the importance of establishing and quantifying Figure 1: Existing research either analyses the qual- reputation for Web users. Reputable users can be charac- ity of Wikipedia`s content using automated pro- terized as users who regularly provide high quality content cedures or investigates the behavior of Wikipedia that is useful for many other users. Reputable users are an users with the help of traditional surveys. Perform- essential asset for many Web sites, such as online forums, ing the types of social and economic data analysis blogs, and wikis [10]. Being the largest collaborative infor- we envision requires linking the two data sources. mation repository, determining user reputation is of special importance to Wikipedia. The task has attracted much re- search, which we briefly review in the following. • How does user-specific relevance develop over time? Key components of the approaches that have been proposed • Can one derive user models that predict the relevance and to measure contributor reputation in Wikipedia correspond relatedness of topics for users and user groups? to well-established factors used to quantify reputation in academia. Quantifying the productivity, quality, and im- • How can user models improve information retrieval sys- pact of research contributions for researchers or research tems, such as content and item recommender systems [7]? institutions is a well-established process. Academic qual- • How does the interaction of user accounts observable in ity metrics are important input data for numerous decision the edit history relate to interaction patterns of individu- making processes, such as the hiring and promotion of re- als in real-world situations known from sociology [32]? searchers, the funding of research projects, or the ranking of research institutions. The most widely-used indicators of • Can one predict the career paths of young contributors academic reputation consider bibliometric data, i.e. data based on the edits they perform? on the published research works and the number of cita- • Can one identify domain experts by analyzing the Wiki- tions these works have received. Bibliometric data is at the pedia edit history? heart of indicators quantifying the reputation of individual • Can one estimate socio-economic properties of individuals researchers or research institutions, such as the h-index [16]. such as education, profession, or social status by analyzing This index assigns a high value to researchers or institu- Wikipedia's edit history? tions who publish many research works that are highly cited by other researchers. Bibliometric data is also the base for To explain our vision of how these and other research ques- computing indicators to quantify the reputation of academic tions could be answered, we structure the remainder of this venues, such as the impact factor [12]. Several researchers paper as follows. In Section 2, we review existing research question the informative value of such measures [22, 27] as that investigates Wikipedia's edit process and Wikipedia well as the transparency [21] and fraud-resilience of their contributors. In Section 3, we explain the potential bene- computation [6]. However, thus far, no better approach fits of linking data from Wikipedia's edit history and survey for quantifying reputation and productivity in academia has data to enable novel research approaches in several areas of found wide-spread use. the social sciences, business and economics, and computer Some use cases allow transferring bibliometric approaches to science. In Section 4, we present the current status of our Wikipedia by equating the concepts 'publication', 'author', technical solution for analyzing Wikipedia's edit history and and 'citation' with the corresponding concepts 'article', 'con- our efforts to perform tailored user surveys to complement tributor', and 'intra-wiki-link'. However, for the use case and extend the insights derived from our automated content- of reputation analysis, bibliometric indicators can only par- based analysis. Section 5 concludes the paper with an out- tially be transferred to Wikipedia