An Analysis of Contributions to Wikipedia from Tor

Are anonymity-seekers just like everybody else? An analysis of contributions to Wikipedia from Tor Chau Tran Kaylea Champion Andrea Forte Department of Computer Science & Engineering Department of Communication College of Computing & Informatics New York University University of Washington Drexel University New York, USA Seatle, USA Philadelphia, USA [email protected] [email protected] [email protected] Benjamin Mako Hill Rachel Greenstadt Department of Communication Department of Computer Science & Engineering University of Washington New York University Seatle, USA New York, USA [email protected] [email protected] Abstract—User-generated content sites routinely block contributions from users of privacy-enhancing proxies like Tor because of a perception that proxies are a source of vandalism, spam, and abuse. Although these blocks might be effective, collateral damage in the form of unrealized valuable contributions from anonymity seekers is invisible. One of the largest and most important user-generated content sites, Wikipedia, has attempted to block contributions from Tor users since as early as 2005. We demonstrate that these blocks have been imperfect and that thousands of attempts to edit on Wikipedia through Tor have been successful. We draw upon several data sources and analytical techniques to measure and describe the history of Tor editing on Wikipedia over time and to compare contributions from Tor users to those from other groups of Wikipedia users. Fig. 1. Screenshot of the page a user is shown when they attempt to edit the Our analysis suggests that although Tor users who slip through Wikipedia article on “Privacy” while using Tor. Wikipedia’s ban contribute content that is more likely to be reverted and to revert others, their contributions are otherwise similar in quality to those from other unregistered participants and to the initial contributions of registered users. In particular, we make use of the fact that Wikipedia’s mechanism of blocking Tor users has been imperfect to identify and extract 11,363 edits made by Tor users to English I. INTRODUCTION Wikipedia between 2007 and 2018. We analyze how some When a Wikipedia reader using the Tor Browser notices a Tor users managed to slip through Wikipedia’s block and stylistic error or missing fact and clicks the “Edit” button to describe how we constructed our dataset of Tor edits. We use fix it, they see a message like the one reproduced in Fig. 1. this dataset to compare the edits of people using Tor (Tor arXiv:1904.04324v3 [cs.SI] 15 Feb 2020 Wikipedia informs would-be Tor contributors that they, like editors) with three different control sets of time-matched edits others using open proxy systems to protect their privacy, have by other Wikipedia contributor populations: (1) non-logged been preemptively blocked from contributing. Wikipedia is not in users editing from non-Tor IP addresses whose edits are alone in the decision to block participation from anonymity- credited to their IP address (IP editors), (2) people logged seeking users. Although service providers vary in their ap- into accounts making their first edit (First-time editors), and proaches, users of privacy-enhancing technologies are unable (3) people logged into accounts with more than one edit using to participate in a broad range of online experiences [21]. the same account (Registered editors). In this work, we attempt to measure the value of contribu- Using a combination of quantitative and qualitative tech- tions made by the privacy-seeking community and compare niques, we find that while Tor editors are more likely to these contributions to those by other users. We focus on the revert someone’s work and to be reverted, other indicators of users of a single service, Wikipedia, and a single privacy- quality suggest that their contributions are similar to those of protecting technology, Tor, to understand what is lost when a IP editors and First-time editors. In an exploratory analysis, user-generated content site systematically blocks contributions we fit topic models to Wikipedia articles and find intriguing from users of privacy-enhancing technologies. differences between the kinds of topics that Tor users and other Wikipedia editors contribute to. We conclude with a discussion are best understood as largely similar to other users. Studies of how user-generated sites like Wikipedia might accept contri- of anonymous behaviors on Quora have found that answers butions from the millions of daily users of privacy-enhancing from anonymous contributors are no worse than answers given technologies like Tor1 in ways that benefit both the websites by registered users and the only significant difference is that and society. “with anonymous answers, social appreciation correlated with the answer’s length” [25]. Furthermore, Mani et al.’s study II. RELATED WORK of the domains visited by Tor users showed that 80% of the Most people seek out anonymity online at some time or websites visited by Tor users are in the Alexa top one million, another [12]. Their reasons for doing so range from seeking giving further evidence that Tor users are similar to the overall help and support [1], exploring or disclosing one’s identity Internet population [24]. [2, 18, 36], protecting themselves when contributing to online Although the tradeoffs between anonymity’s benefits and projects [14], seeking information, pursuing hobbies, and threats have been investigated and discussed from many per- engaging in activities that may violate copyright such as file spectives, the question of what value anonymous contributions sharing [20]. might bring to contexts where they are disallowed is difficult Anonymity can confer important benefits, not just for the to answer. How does one estimate the value of something that individual seeking anonymity but also for the collective good is not happening? By examining the relatively small number of of online communities [20, 31]. The use of anonymity in col- Tor edits that slipped through Wikipedia’s restriction between laborative learning has been demonstrated to improve equity, 2007–2018, we hope to begin doing just that. In the next participation rates, and creative thinking [7]. Research suggests sections, we explain the context of our data collection and that anonymity can support self-expression and self-discovery analysis as well as the methods we used to identify a dataset among young people [13]. For instance, researchers found of Wikipedia edits from Tor. that anonymity helps users discuss topics that are stigmatized [1, 8]. III. EMPIRICAL CONTEXT Despite the range of legitimate reasons that people adopt A. Tor anonymity to interact on the Internet and the benefits to col- The Tor network consists of volunteer-run servers that allow laborative communities, many websites systematically block users to connect to the Internet without revealing their IP traffic coming from anonymity-seeking users of systems like address. Rather than users making a direct connection to a Tor.2 According to Khattak et al., at least 1.3 million IP destination website, Tor routes traffic through a series of relays addresses blocked Tor at the TCP/IP level as of 2015, and that conceal the origin and route of a user’s Internet traffic. “3.67% of the top 1,000 Alexa sites are blocking people using Within Tor, each relay only knows the immediate sender and computers running known Tor exit-node IP addresses” [21]. the next receiver of the data but not the complete path that the Of course, websites do not block anonymity tools like Tor data packet will take. The destination receives only the final for no reason. Research has shown that online anonymity is relay in the route (called the “exit node”), not the Tor user’s sometimes associated with toxic behaviors that are hard to original IP address. The list of all Tor nodes is published so control [23]. Traffic analysis of Tor in 2010 found a substantial that Tor clients can pick relays for their circuits. This public portion of network activity is associated with peer-to-peer list also allows the public to determine whether or not a given applications such as BitTorrent [5]. Another report made by 3 IP address is a Tor exit node at a given point in time. Some Sqreen, an application protection service, claims that “a websites, including Wikipedia, use these lists of exit nodes to user coming from Tor is between six and eight times more restrict traffic from the Tor network. likely to perform an attack” on their website, such as path scanning and SQL/NoSQL injection. Tor exit node operators B. Wikipedia often receive complaints of “copyright infringement, reported As one of the largest peer production websites, Wikipedia hacking attempts, IRC bot network controls, and web page receives vast numbers of contributions every day. While defacements” [27]. The most frequent complaints about Tor Wikipedia is available in many languages, English Wikipedia users’ negative behavior are DCMA violations, which made up is the largest edition with the most articles, active users, 99.74% of the approximately three million email complaints and viewers.4 As of February 2019, the English language sent to exit operators from Torservers.net from June, 2010 to Wikipedia “develops at a rate of 1.8 edits per second” with April, 2016 [33]. more than 136,000 registered editors who contribute each A third perspective suggests that anonymity seeking behav- month.5 When these registered editors change something, their ior is neither “good” nor “bad” and that anonymous users username is credited with that edit. Wikipedia also allows 1https://metrics.torproject.org/userstats-relay-country.html (Archived: https: people to contribute without asking them to sign up or log //perma.cc/B5W4-UG7C) 2https://trac.torproject.org/projects/tor/wiki/org/doc/ 4https://en.wikipedia.org/wiki/List of Wikipedias (Archived: https://perma.

An Analysis of Contributions to Wikipedia from Tor

A Topic-Aligned Multilingual Corpus of Wikipedia Articles for Studying Information Asymmetry in Low Resource Languages

Modeling Popularity and Reliability of Sources in Multilingual Wikipedia

Omnipedia: Bridging the Wikipedia Language

Title of Thesis: ABSTRACT CLASSIFYING BIAS

Iclab: a Global, Longitudinal Internet Censorship Measurement Platform

End of News:Internet Censorship in Turkey

Florida State University Libraries

Parallel Creation of Gigaword Corpora for Medium Density Languages – an Interim Report

Does Wikipedia Matter? the Effect of Wikipedia on Tourist Choices Marit Hinnosaar, Toomas Hinnosaar, Michael Kummer, and Olga Slivko Discus Si On Paper No

Flags and Banners

Detecting Undisclosed Paid Editing in Wikipedia

Post-Digital Cultures of the Far Right