Detecting Undisclosed Paid Editing in Wikipedia

Detecting Undisclosed Paid Editing in Wikipedia Nikesh Joshi, Francesca Spezzano, Mayson Green, and Elijah Hill Computer Science Department, Boise State University Boise, Idaho, USA [email protected] {nikeshjoshi,maysongreen,elijahhill}@u.boisestate.edu ABSTRACT 1 INTRODUCTION Wikipedia, the free and open-collaboration based online ency- Wikipedia is the free online encyclopedia based on the principle clopedia, has millions of pages that are maintained by thousands of of open collaboration; for the people by the people. Anyone can volunteer editors. As per Wikipedia’s fundamental principles, pages add and edit almost any article or page. However, volunteers should on Wikipedia are written with a neutral point of view and main- follow a set of guidelines when editing Wikipedia. The purpose tained by volunteer editors for free with well-defned guidelines in of Wikipedia is “to provide the public with articles that summa- order to avoid or disclose any confict of interest. However, there rize accepted knowledge, written neutrally and sourced reliably” 1 have been several known incidents where editors intentionally vio- and the encyclopedia should not be considered as a platform for late such guidelines in order to get paid (or even extort money) for advertising and self-promotion. Wikipedia’s guidelines strongly maintaining promotional spam articles without disclosing such. discourage any form of confict-of-interest (COI) editing and require In this paper, we address for the frst time the problem of identi- editors to disclose any COI contribution. Paid editing is a form of fying undisclosed paid articles in Wikipedia. We propose a machine COI editing and refers to editing Wikipedia (in the majority of the learning-based framework using a set of features based on both cases for promotional purposes) in exchange for compensation. The the content of the articles as well as the patterns of edit history guidelines set by Wikipedia are based on good faith, and malicious of users who create them. To test our approach, we collected and editors who earn a living through paid editing Wikipedia choose to curated a new dataset from English Wikipedia with ground truth on ignore the requirement to disclose they are paid. Moreover, these undisclosed paid articles. Our experimental evaluation shows that malicious editors often use sockpuppet accounts to circumvent a we can identify undisclosed paid articles with an AUROC of 0.98 block or a ban imposed on the person’s original account. A sock- and an average precision of 0.91. Moreover, our approach outper- puppet is an “online identity used for the purpose of deception.” 2 forms ORES, a scoring system tool currently used by Wikipedia to Usually, several sockpuppet accounts are controlled by a unique automatically detect damaging content, in identifying undisclosed individual (or entity) called puppetmaster. paid articles. Finally, we show that our user-based features can The frst discovered paid editing case was the “Wiki-PR editing also detect undisclosed paid editors with an AUROC of 0.94 and an of Wikipedia”, in 2013. 3 Wiki-PR is a company, which still exists but average precision of 0.92, outperforming existing approaches. is banned by Wikipedia, whose core business is to ofer consulting services to create, edit and monitor “your” Wikipedia page. The 2013 CCS CONCEPTS investigation found out that more than 250 sockpuppet accounts were related to and controlled by the company. On August 31, 2015, • Information systems → Wikis; Data mining. Wikipedia community uncovered a bigger set of 381 sockpuppet accounts, as part of an investigation nicknamed “Orangemoody” 4, KEYWORDS operating a secret paid editing ring where participants extorted Wikipedia, Detection of abusive content, Malicious editors, Sock- money from businesses who had articles about themselves rejected. puppet accounts. The Orangemoody accounts themselves may have been involved in the deletion of some articles. ACM Reference Format: When undisclosed paid articles or editors are identifed, such Nikesh Joshi, Francesca Spezzano, Mayson Green, and Elijah Hill. 2020. De- pages are removed from Wikipedia, and accounts are blocked. How- tecting Undisclosed Paid Editing in Wikipedia. In Proceedings of Proceedings ever, the Wikipedia community still relies on administrators who of The Web Conference 2020 (WWW ’20). ACM, New York, NY, USA, 7 pages. manually track down editors and afected articles. The diferences https://doi.org/10.1145/3366423.3380055 between good faith editing and spam can be hard for even experi- enced editors to see, and, with hundreds of articles to be examined each month, the review process can be tedious, inefcient, and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed possibly unreliable. for proft or commercial advantage and that copies bear this notice and the full citation In this paper, we focus, for the frst time, on automatically detect- on the frst page. Copyrights for components of this work owned by others than ACM ing Wikipedia undisclosed paid contributions, so that they can be must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specifc permission and/or a quickly identifed and fagged for removal. We make the following fee. Request permissions from [email protected]. WWW ’20, April 20–24, 2020, Taipei, Taiwan 1https://en.wikipedia.org/wiki/Wikipedia:Confict_of_interest © 2020 Association for Computing Machinery. 2https://en.wikipedia.org/wiki/Sockpuppet_(Internet) ACM ISBN 978-1-4503-7023-3/20/04. 3https://en.wikipedia.org/wiki/Wiki-PR_editing_of_Wikipedia https://doi.org/10.1145/3366423.3380055 4https://en.wikipedia.org/wiki/Orangemoody_editing_of_Wikipedia WWW ’20, April 20–24, 2020, Taipei, Taiwan Nikesh Joshi, Francesca Spezzano, Mayson Green, and Elijah Hill contributions. (1) We propose a machine learning-based framework Table 1: Size of Positive and Negative Data. Positive data to classify undisclosed paid articles that uses a set of features based refers to newly created paid articles or known undisclosed on both article content, metadata, and network properties, as well paid editors (UPEs). as the patterns of edit behavior of users who create them. (2) To test Positive Data Negative Data our framework, we built a curated English Wikipedia dataset con- taining 73.9K edits by undisclosed paid editors (including deleted Newly Created Articles 748 6,984 edits) and 199.2K edits by genuine editors, with ground truth on Editors 1,104 (UPEs) 1,557 undisclosed paid articles. (3) Through our experimental evaluation, Total Num. of Edits 73,931 199,172 we show that our proposed method can efciently identify undisclosed paid articles with an AUROC of 0.98 and an average precision of 0.91. We also show that our approach outperforms ORES, 5 the quality of Wikipedia articles. Specifcally, given an article, ORES state-of-the-art machine learning service created and maintained by evaluates the content of the article according to one of the follow- the Wikimedia Scoring Platform team to detect content damage on ing classes: spam, vandalism, attack, or OK. Thus, we compare our Wikipedia. Finally, we demonstrate that our proposed user-based proposed approach to detect undisclosed paid articles with ORES features can be used to detect undisclosed paid editors as well, in Section 5.2.2. achieving an AUROC of 0.94 and an average precision of 0.92 and As explained in the Introduction, undisclosed paid editors typi- outperforming other existing approaches for sockpuppet detection cally act as a group of sockpuppet accounts. In the literature, several in Wikipedia. works have analyzed and detected sockpuppet accounts in online social networks and discussion forums [4, 7, 10, 15]. Specifcally 2 RELATED WORK to Wikipedia, Solorio et al. [12, 13] have addressed the problem Diferent forms of content damage on Wikipedia have been of detecting whether or not two accounts are maintained by the studied in the literature, including vandalism, hoaxes, and spam. same user by using text authorship identifcation features. Other Wikipedia vandalism is “the act of editing the project in a mali- approaches have focused on classifying sockpuppet vs. genuine cious manner that is intentionally disruptive”, e.g., through text accounts by using non-verbal behavior and considering editing that is humorous, nonsensical, or ofensive. 6 Detecting vandalism patterns [14, 19]. was the very frst problem studied in the context of Wikipedia content deception. Research shows that linguistic, metadata and 3 DATASET user reputation features are all important to detect vandal edits in This section describes the dataset we used to perform this study. Wikipedia [1, 2, 11, 18]. Kumar et al. [8] addressed the problem of We collaborated with an English Wikipedia administrator 9 active detecting vandal users and proposed VEWS, a warning system to in reviewing articles that may have a confict of interest (especially early detect these users that leverages editor’s behavioral patterns. paid editing) to collect and curate a dataset of newly created posi- Kumar et al. [9] studied the characteristics and impact of Wiki- tive articles, created by known undisclosed paid editors, and newly pedia hoaxes, articles that deceptively present false information as created negative articles, created by genuine users who are not fact.

Detecting Undisclosed Paid Editing in Wikipedia

An Analysis of Contributions to Wikipedia from Tor

Why Medical Schools Should Embrace Wikipedia

Wikimania Stockholm Program.Xlsx

NNLM's Nationwide Online Wikipedia Edit-A-Thon

Critical Point of View: a Wikipedia Reader

ORES Scalable, Community-Driven, Auditable, Machine Learning-As-A-Service in This Presentation, We Lay out Ideas

UCSF Medical Education

Editing Wikipedia for Medical School Credit – Analysis of Data from Three Cycles of an Elective for Nothingfourth- Yearto Studentsdisclose

Package 'Wikipedir'

Towards 'Health Information for All': Medical Content on Wikipedia

The Production and Consumption of Quality Content in Peer Production Communities

Community, Cooperation, and Conflict in Wikipedia