Understanding Tor Usage with Privacy-Preserving Measurement
Total Page:16
File Type:pdf, Size:1020Kb
Understanding Tor Usage with Privacy-Preserving Measurement Akshaya Mani•∗ T Wilson-Brown‡∗ Rob Jansen| Aaron Johnson| Micah Sherr• •Georgetown University zUNSW Canberra Cyber, University of New South Wales am3227,micah.sherr @georgetown.edu [email protected] f g |U.S. Naval Research Laboratory rob.g.jansen, aaron.m.johnson @nrl.navy.mil f g ABSTRACT store data about clients, traffic patterns, sites visited, The Tor anonymity network is difficult to measure because, etc.|information that may otherwise not be collected if not done carefully, measurements could risk the privacy for privacy reasons. Even if the entities collecting this (and potentially the safety) of the network’s users. Recent information are honest, the exposure of the data (e.g., work has proposed the use of differential privacy and secure through machine compromise or compulsion from gov- aggregation techniques to safely measure Tor, and prelimi- ernment authorities) could pose serious privacy risks. nary proof-of-concept prototype tools have been developed Additionally, the dissemination of even highly aggre- in order to demonstrate the utility of these techniques. In this gated measurement information risks users' privacy since work, we significantly enhance two such tools—PrivCount adversaries could potentially combine this data with and Private Set-Union Cardinality—in order to support the background knowledge (e.g., when a tweet was posted, safe exploration of new types of Tor usage behavior that have ISP logs showing user accesses to the Tor network, etc.) never before been measured. Using the enhanced tools, we to deanonymize users [16]. conduct a detailed measurement study of Tor covering three In addition to user privacy and safety, monitoring major aspects of Tor usage: how many users connect to Tor communication is antithetical to the mission of the Tor and from where do they connect, with which destinations do network. For this reason, the Tor Metrics Portal [43], users most frequently communicate, and how many onion a data repository that archives information about the services exist and how are they used. Our findings include anonymity network's size, makeup, and capacity, pro- that Tor has ∼8 million daily users (a factor of four more vides only a limited number of statistics, many of which than previously believed) while Tor user IPs turn over al- are based on indirect measurements and assumptions of most twice in a 4 day period. We also find that ∼40% of the Tor client behavior. We demonstrate in this paper that sites accessed over Tor have a torproject.org domain name, some of Tor's estimates do not correlate well with our ∼10% of the sites have an amazon.com domain name, and own direct observations of user behavior. ∼80% of the sites have a domain name that is included in Other efforts to measure Tor, such as those that record the Alexa top 1 million sites list. Finally, we find that ∼90% packets at Tor's ingress and egress points, can pose se- of lookups for onion addresses are invalid, and more than rious ethical and legal issues: they rely on unsafe [41] 90% of attempted connections to onion services fail. (and potentially unlawful [40]) techniques that would be difficult to repeat. As a result, such studies have arXiv:1809.08481v1 [cs.CR] 22 Sep 2018 1. INTRODUCTION been met with criticism by the privacy community [40]. Toward safe Tor measurement The Tor network has been in operation since 2003 [18], . There has been an enabling millions of daily users [43] to anonymously ac- exciting recent trend in the literature that proposes dif- cess the Internet. Despite its relative longevity, grow- ferentially private [21] statistics collection techniques ing popularity, and prominence as a privacy-enhancing for anonymity networks such as Tor. Systems such as communication tool, surprisingly little is known about PrivEx [23], PrivCount [27], HisTor [36], and PSC [24] who uses Tor and how they use it. enable useful statistics collection while providing for- mal guarantees about the privacy risks they impose Challenges in Measuring Tor. The primary chal- on the network's users. However, to date, researchers lenge in conducting measurements on the Tor network have (rightfully) focused on developing these privacy- is that, if not done with extreme care, they can im- preserving measurement techniques rather than on per- pose significant risk to the network's users. The col- forming exhaustive measurements of deployed anonymity lection of statistical information implies the need to networks such as Tor. ∗Equally credited authors. 1 Our Contributions. This paper presents the largest ethical validity of our study, and describe our experi- and most comprehensive measurement study of the Tor ences with institutional ethics boards and an indepen- network to date, performed using newly proposed differ- dent safety review board. entially private statistics collection techniques. To con- duct our study, we modify Tor and significantly enhance 2. BACKGROUND and improve PrivCount [27] and PSC [24] (described in We begin by presenting a brief overview of Tor and more detail in the following section) to support a larger reviewing the privacy definitions and privacy-preserving variety of measurements, and contribute our improve- measurement tools we use in our study. ments to the respective open source projects. We run 16 Tor relays that contribute approximately 3.5% of Tor's 2.1 Tor bandwidth capacity, and deploy our measurement sys- Tor provides anonymous TCP connections through tems across these relays in order to safely collect mea- source-routed paths that originate at the Tor client (i.e., surements of the live Tor network. We extend previ- the user) and traverse through (usually) three relays. ous statistical methodology to enable unique-counting Traffic enters the Tor network through guard relays. on Tor, and extend privacy definitions (action bounds) Middle relays carry traffic from guard relays to exit to cover new types of user activity. We also describe relays, where the traffic exits the anonymity network. the challenges and tactics for selecting appropriate pri- These anonymous paths through Tor relays are called vacy parameters to ensure Tor users' safety, as well as circuits. The unit of transport in Tor circuits is the cell: methods for inferring whole-network statistics given lo- Tor cells carry encrypted routing information and 498 cal observations at our relays. bytes of data [17]. The encryption of application-level Using our improved measurement tools and extended headers and payloads makes it difficult for an adversary techniques, we conduct a detailed measurement study to perform traffic analysis on circuits and learn the end- of Tor covering three major aspects of Tor usage: who points or content of communication. connects to Tor and from where do they connect; how Tor has a three-tiered communication architecture. is Tor used and with what services and destinations do A single circuit may carry multiple streams. A Tor Tor users communicate; and how many onion services stream is a logical connection between the client and exist and how are they used. Findings from our client- a single destination (e.g., a webserver), and is roughly based measurements suggest that Tor has ∼8 million analogous to a standard TCP connection. A request to daily users, which is a factor of four more than previ- retrieve a webpage may produce several streams (e.g., ously believed. We are the first to measure client churn to example:com and ads:example:com) that are mul- in Tor, and we find that Tor user IPs turn over almost tiplexed over a single circuit. Finally, Tor maintains twice in a 4 day period. We also find that ∼40% of the longstanding TLS-encrypted connections between re- sites accessed over Tor have a torproject.org domain lays that are adjacent on some Tor circuit. Communica- name, ∼10% of the sites have an amazon.com domain tion belonging to different circuits that share a common name, and ∼80% of the sites have a domain name that hop between two relays are multiplexed through a sin- is included in the Alexa top 1 million sites list. Fi- gle connection. Similarly, a user's Tor client maintains nally, we find that ∼90% of onion address lookups fail a single connection to each of its guards, through which because the address is missing or the request is mal- multiple circuits (and streams) may be formed. formed, and more than 90% of attempted connections Tor clients periodically download \directory updates" to onion services fail because the server never completes that describe the consensus view of the network. Clients its side of the connection protocol. Our findings both re- send all user data through one guard by default, but inforce existing beliefs about the Tor network (e.g., that directory updates are obtained through three guards the vast majority of its use is for web browsing) and by default. Tor clients avoiding censorship may connect elucidate many aspects of the Tor network that have to the network through bridges [37], which are guards previously not been explored. In some instances, our whose identities are not public and are disseminated by measurements|which are based on direct observations Tor to users requesting them. of behavior on Tor|suggest that existing heuristically- driven estimates of Tor's usage (including the number Onion services. Tor allows services to hide their loca- of Tor users) are highly inaccurate. tions through the use of onion services [42]. As a special Limiting the risk to Tor's users was paramount in (but common) case, when the onion service corresponds both the design and implementation of all of our mea- to a hidden website, we refer to it as an onionsite. surements, and a significant contribution of this work To operate an onion service, the operator selects six is the methodology for choosing system parameters to relays to serve as introduction points.