NEW PERSPECTIVES ABOUT THE TOR ECOSYSTEM: INTEGRATING STRUCTURE WITH INFORMATION
A Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy
by
MAHDIEH ZABIHIMAYVAN B.S., Ferdowsi University, 2012 M.S., International University of Imam Reza, 2014
2020 Wright State University Wright State University GRADUATE SCHOOL
April 22, 2020
I HEREBY RECOMMEND THAT THE DISSERTATION PREPARED UNDER MY SUPERVISION BY MAHDIEH ZABIHIMAYVAN ENTITLED NEW PERSPECTIVES ABOUT THE TOR ECOSYSTEM: INTEGRATING STRUCTURE WITH INFORMATION BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy.
Derek Doran, Ph.D. Dissertation Director
Yong Pei, Ph.D. Director, Computer Science and Engineering Ph.D. Program
Barry Milligan, Ph.D. Interim Dean of the Graduate School
Committee on Final Examination
Derek Doran, Ph.D.
Michael Raymer, Ph.D.
Krishnaprasad Thirunarayan, Ph.D.
Amir Zadeh, Ph.D. ABSTRACT
Zabihimayvan, Mahdieh. Ph.D., Department of Computer Science and Engineering, Wright State University, 2020. New Perspectives About The Tor Ecosystem: Integrating Structure With Infor- mation
Tor is the most popular dark network in the world. Its noble uses, including as a plat- form for free speech and information dissemination under the guise of true anonymity, make it an important socio-technical system in society. Although activities in socio-technical systems are driven by both structure and information, past studies on evaluating Tor inves- tigate its structure or information exclusively and narrowly, which inherently limits our understanding of Tor. This dissertation bridges this gap by contributing insights into the logical structure of Tor, the types of information hosted on this network, and the interplay between its structure and information. These insights arise from three studies:
(a) We perform a comprehensive crawl of the Tor dark Web and, through topic and network analysis, characterize the types of information and services hosted across a wide swath of Tor domains and their hyperlink relational structure.
(b) We study the potential for thought-to-be isolated information on the dark Web to be leaked into the public surface Web by providing a broad evaluation on the network of referencing from Tor to surface Web.
(c) We investigate the structural identity of Tor domains as an indicative of their neigh- borhood structure, independent of their either service type or location in the network.
Our studies unearth previously unknown properties of Tor, including the finding that Tor domain types can be categorized into nine groups defined by the information they host. We unveil how services for releasing and searching information do emerge as the dominant type of Tor domains. Our importance evaluation identifies Dream marketplaces and direc- tory domains as Tor core services and crucial entry points for probes, respectively. Connec-
iii tivity analyses reveal some patterns of cooperation and competition among Tor domains. We also present measurements that indicate how some types of domains intentionally silo themselves from the rest of Tor. The investigation on the dark-to-surface referencing network reveals this network as a single massive connected component where over 90% of Tor hidden services have at least one link to the surface world despite their interest in being isolated from surface Web tracking. This referencing puts Tor domains closer to each other and encourages them to cluster. However, it does not raise the domains’ contribution to either communication or information dissemination through the Tor network. Analyses on the Tor structural identity indicate that Tor domains can be categorized into eight groups based on their neighborhood structure. Dream market has its own class: an almost fully connected neighborhood structure which is robust against node removal. Link- ing structure of Tor domains can further make differences in their structural identities. The domains with direct links to the others with high out-degree centrality form the dominant structural identity on Tor. This identity makes the 2nd-degree neighborhood of Tor domains robust against node removal or targeted attack despite their tendency towards isolation.
iv Contents
1 Introduction1
2 Preliminary Knowledge5 2.1 Tor routing scheme...... 5 2.2 Tor security issues...... 8 2.2.1 Client-side attacks...... 8 2.2.2 Server-side attacks...... 12
3 Literature Review 14 3.1 Tor security and privacy...... 14 3.2 Tor structure characterization...... 16 3.3 Tor information characterization...... 17
4 Information Ecosystem Evaluation 20 4.1 Dataset collection and processing...... 23 4.1.1 Tor content discovery and labeling...... 24 4.2 Content evaluation...... 28 4.3 Domain relationships...... 32 4.3.1 Connectivity analysis...... 34 4.3.2 Importance analysis...... 39 4.4 Chapter summary...... 40
5 Information Leakage Assessment 42 5.1 Dataset collection and processing...... 45 5.2 Evaluation of dark-to-surface referencing...... 50 5.2.1 Linking process of Tor services to dark/surface resources...... 50 5.2.2 Analyzing reference view of Tor services...... 53 5.3 Chapter summary...... 60
6 Structural Identity Characterization 62 6.1 Structural identity analysis...... 63 6.1.1 Tor structural identity representation...... 64 6.1.2 Clustering Tor structural identities...... 67
v 6.2 Chapter summary...... 75
7 Conclusion 77
Bibliography 81
vi List of Figures
1.1 Relationship among the studies in this dissertation...... 2
2.1 Tor architecture...... 6 2.2 A browser attack using the codes included in a website. Executing the plugged-in program by the client’s browser opens a direct path to a mali- cious Web server which compromises the client’s anonymity...... 9 2.3 A browser attack using a malicious exit node. The client’s browser executes a code inserted into a Web page by the malicious exit node...... 10 2.4 Traffic manipulation to induct malicious entry into the circuit...... 11 2.5 Flow diagram of the off-path man-in-the-middle attack...... 13
4.1 Tor data collection process...... 22 4.2 LDA topic coherence score for different number of topic and minimum lengths of document; bold trend representing scores using 9 topics..... 25 4.3 Services provided by news, multimedia, forum, and shopping domains... 29 4.4 Topic distribution of Tor domains...... 31 4.5 Network of domains with degree > 0...... 33 4.6 The Tor domain network. Panels 1 through 9 each highlights the incom- ing and outgoing edges of a particular domain. The figure is best viewed digitally and in color...... 34 4.7 In/out degree distribution of each community...... 35 4.8 Intra-relations within domains...... 37 4.9 Centrality distributions...... 39
5.1 Flowchart of the data collection and processing...... 45 5.2 Network of data collected during crawling. Red nodes indicate Tor do- mains and green nodes represent surface websites...... 48 5.3 Number of references to surface Vs. dark Web; both axes in logarithmic scale...... 51 5.4 Distribution of number of Dark Web neighbors for Tor domains in the first category; both axes in logarithmic scale...... 52 5.5 Frequency Distribution of Network Parameters...... 54
6.1 Average Silhouette width Vs. different number of clusters...... 69
vii 6.2 Dendrogram of the Hierarchical Clustering...... 70 6.3 Label Distributions in the resulted clusters...... 71 6.4 Examples of neighborhood structure for Dream market domains...... 71 6.5 CDF plot of out-degree centrality for Tor domains...... 72 6.6 Examples of 1st-deree neighborhood structure for domains in clusters 1 to 4 73 6.7 Examples of 2nd-degree neighborhood structure for domains in clusters 1 to 4 74 6.8 Examples of neighborhood structure for domains in clusters 7 and 8.... 75
viii List of Tables
4.1 List of 10 most probable words per topic and their label...... 28 4.2 Summary statistics of the domain network...... 32 4.3 Modularity score of each topic community...... 38
5.1 Dark/surface resources used to collect initial seeds...... 46 5.2 Basic parameters of the network...... 49 5.3 Onion domains with more than 200 references to surface and no link to Tor 53 5.4 Summary statistics of network features...... 53 5.5 List of edges with Edge betweenness Centralities greater than 10,000.... 55 5.6 20-top Tor Services with high Stress centrality...... 57 5.7 20-top Tor Services without-degree centrality greater than 400...... 58 5.8 10-top Tor Services with In-degree centrality greater than 100...... 59
6.1 Basic statistics of the out-degree centrality for Tor domains...... 72
ix Acknowledgments
I would like to express my thanks to my supervisor, Dr. Derek Doran, who has continu- ously supported me throughout this long journey. His patience, advocacy, and constructive comments helped me improve in different aspects of my academic life. His caring about my freedom to choose the projects of my research interest and devise my own research questions notably helped me be the researcher I am today. I sincerely appreciate his belief in my technical abilities and all his effort to make me a better person. My thanks also go to my great committee members, Drs. Michael Raymer, Krishnaprasad Thirunarayan, and Amir Zadeh, who significantly improved my research by their helpful advice and supported me by their encouraging feedbacks. I am also thankful to the College of Engineering and Computer Science and all its staff members who were so generous with their time in guiding me. And my biggest thanks go to my husband, Dr. Reza Sadeghi, without whom I would have stopped these studies a long time ago. He put up with my stresses and moans, and he was amazingly encouraging and patient. I cannot also forget to thank my parents for all the support they have shown me through these years. I sincerely thank my beloved brother, Mahdi, and sisters, Mahtab and Monireh, who have always heartened me with their selfless kindness. Finally, I thank my great labmates and friends for all the unconditional support especially during this very intense academic year.
x 1
Introduction
Socio-technical systems can be viewed as a linked structure of a set of entities which pro- vide and spread information through the network [26]. A set of linked Web pages, so- cial network of acquaintance or other connections between individuals, and set of linked business companies are common samples of these systems. The linked structure and in- formation propagated through such networks have significant influence on each other. In social systems, for instance, the information owned by an individual has a strong influ- ence on social structure [28]. On the other hand, predefined social structures impact the distribution of information, and hence the amount of value one obtains by participating in the network. Thus, investigating the behavior of socio-technical systems needs considering both the structure and information together. Tor is an important socio-technical system which has attracted much attention after attempts to control or suppress the Internet by countries like China and Russia [48]. Tor is also considered as the most popular dark network which requires unique application layer protocols and authorization schemes to access. It provides anonymous communication for both senders and receivers using an encryption scheme similar to onion routing [22] that prevents traffic analysis and network activity monitoring by complicating any possible tracking or tracing of users’ identity. Tor is used as a tool for circumventing government
1 Structure & Structural Information Content Identity Leakage Characterization
Structure Information Interactions Figure 1.1: Relationship among the studies in this dissertation.
censorship [32], releasing information to the public [48], sensitive communication between parties [56], and as a private space to trade goods and services [50]. Although activities in socio-technical systems, and hence Tor, are driven by both struc- ture and information, present art in evaluating Tor studies its structure or information exclu- sively and narrowly. Current empirical evidence on the Tor information is limited to studies that merely investigate the type of hosted information through crawling, extracting, and an- alyzing particular content of Tor such as drug trafficking [18], homemade explosives [29], terrorist activities [15], or forums [56]. Also, although the Tor structure is only beginning to be studied, the related work focuses on how Tor is logically organized. Perhaps our best understanding of the Tor structure is from Bernaschi et al. that present a characterization study on Tor network graph and investigate the persistence of hidden services and their hyperlinks [6]. They also compare Tor with other social networks and surface Web graphs using well-known network analysis metrics. Their results indicate that in Tor network, edges are more volatile than vertices and the graph presents some similarities with other types of networks while it has particular properties such as huge number of nodes with no out-going edges. The objective of this dissertation is to perform a series of characterization studies, out- lined in Figure 1.1, that explicitly incorporates the interplay between structure and informa- tion in Tor to gain a new understanding of this network. The work collectively contributes
2 a new perspective on the broad make-up of Tor, its structure, the information it hosts, and the potential for thought-to-be isolated information on the dark Web to be leaked into the public surface Web. The dissertation specifically examines the followings: (i) In the first step, we perform a comprehensive crawling of the Tor dark Web and, through both topic and network analysis, present a broad characterization on the types of information hosted on Tor domains and their hyperlink relational structure. The purpose is to reveal the main applications of Tor domains and to better our understanding of the connectivity and importance of the domains, conditioned on their service type. Through the lens of various analyses and among the other findings, we identify the interest of Tor domains in being isolated. (ii) In the second step, we emphasize the information, and regarding the domains’ tendency to isolation, we broadly evaluate Tor-to-surface reference network and analyze how Tor domains are vulnerable against information leakage caused by linking to the sur- face world. To consider any interplay between the information and structure, the analyses also consider to what extent this linking can change the overall hyperlink structure of Tor domains. Moreover, they provide reports regarding the type of information and services provided by Tor domains. (iii) In the third step, by emphasizing the structure and to further investigate any inter- action between the structure and information of Tor, we study the structural identity of Tor domains. Through this evaluation, we identify extremely important patterns in the neigh- borhood structure of Tor domains, and reveal the dominant structural identity of the Tor network.
Chapter Overview
The remainder of this dissertation is organized as follows.
3 Chapter2: Preliminary Knowledge discusses the background knowledge required to effectively follow the subject and purpose of this dissertation. Specifically, it covers Tor routing scheme in Section 2.1 and Tor security issues in Section 2.2 in more details.
Chapter3: Literature Review details the related work on the main components used in this dissertation. That is, in Section 3.1, we focus on understanding the security and privacy issues on Tor. In Section 3.2, we discuss the work on characterizing the structure of Tor. Finally, Section 3.3 reviews the work on characterizing the information hosted on Tor hidden services.
Chapter4: Information Ecosystem Evaluation presents our comprehensive topic and network analysis over Tor which characterizes both the types of content hosted across Tor domains and their hyperlink relational structure.
Chapter5: Information Leakage Assessment presents our investigation on network of references from dark domains to surface websites to assess the potential for leaking information on Tor and to what extent such a referencing can change the hyperlink structure of Tor domains.
Chapter6: Structural Identity Characterization discusses our study on the structural identity of Tor domains to reveal any interaction between information and structure on Tor and whether there is any dominant structural identity on Tor.
Chapter7: Concluding Remarks and Future Directions summarizes the contributions of this dissertation and gives directions for future work.
4 2
Preliminary Knowledge
This chapter discusses the knowledge required to follow the subject and purpose of this dissertation. Specifically, it presents details on Tor routing scheme in Section 2.1, and Tor security issues in Section 2.2.
2.1 Tor routing scheme
Tor is a low-latency anonymity network of over 6500 volunteer relays/routers. It provides anonymous communication for both senders and receivers using an encryption scheme sim- ilar to onion routing [22]. Small set of Tor authorities regularly monitor all relays in the network and authorize them if they are active and at most two per single IP address. The relays can be also run and advertised by Tor users. A consensus (list) on the authorized re- lays is maintained and published by Tor authorities among users. To start a communication with a server, client needs to incrementally establish a path or circuit of three (by default) active relays and send the message encrypted in successive layers through the circuit. The first and last relays are respectively referred to as the entrance (or guard) and exit. All con- nections between client and the guard, intermediary relays in the path, and from exit to the receiver utilize transmission control protocol (TCP).
5 Unencrypted link Entry Encrypted link
Client Destination
Exit Tor routers Figure 2.1: Tor architecture
To build the circuit, client first contacts the Tor authorities to get a list of available relays and a public key to communicate with the entry. Employing asymmetric key cryp- tography technique, client establishes a path and shares a private key with entry and sends it the encrypted message. Entry decrypts the message using the private key shared with the client. The encrypted link between the entry and the client is used to relay an encrypted message to the middle router. In each step, the message can be decrypted only by the receiver relay not its preceding or following routers. After decrypting the message, the sec- ond router builds a path with the entry without knowing whether it is the client or a regular router. This repeats until the destination receives the whole data. While the destination sends back a message, the path is used again but in a reverse direction. In this method, entry is the only one that is able to observe the identity of the initiator of the message while the intermediaries only know their previous and next routers. Exit is also the only router which can see the completely decrypted message and hence, know the location of destina- tion. Technically, one single router cannot infer the location of both initial sender and final receiver of a message. Figure 2.1 presents the flow diagram of the Tor architecture. Tor hidden services (HS) are features added to Tor network in 2004 to provide privacy for users who run Internet services on Tor. The architecture used for Tor hidden services
6 establishes a routing between the client and the service which is comprised of the client, introduction point (IP), hidden service directory (HSDir), Rendezvous point (RP), and the hidden service. IP is a random relay that is selected by the hidden service as its contact point. To be a HSDir, a relay must be active for 25 hours to receive HSDir flag. HSDir is responsible for HS descriptor. RP is a relay randomly selected by the client to transmit all the data between the client and the server. From the hidden service view, hidden service Onion Proxy (OP) first needs to be configured on the machine hosting the HS. OP then automatically generates a RSA key pair which is sent to IP to build a private connection with it. Hence, IP only receives the HS public key instead of its Internet Protocol address. After establishing a circuit with the IP, OP builds the HS identifier that is a combination of its public and private key. This identifier is then sent to the IP through the circuit. In the next step, OP generates two HS descriptors (with different IDs), selects two HSDirs, and uploads the descriptors in their hash tables. Each descriptor is comprised of an ID, list of IPs, and the HS’s public key. Technically, configuring a new hidden service automatically generates a string of 16 or 56 characters as the hostname of the service onion address which is also considered by the client as the service pointer. The string can contain any letter or decimal digits from 2 to 7 and is base-32 encoded identifier of the hidden service. From the client perspective, using the pointer of a hidden service, the client OP computes the descriptors IDs, HS’s responsible HSDirs, and then obtains its descriptors. To establish a private path with the HS, the client OP randomly selects relays and requests them to be RP. For each attempt, OP randomly selects a 20-byte value as Rendezvous Cookie (RC) and sends it to the relay. The router which associates the RC with the connection that sent it is selected as RP. The client OP then establishes a new connection with IPs of the hidden service and sends them an introduction message containing information about rendezvous point, RC, and hash of the service public key. If the IPs recognize the public key of the service they are respon- sible for, they allow transmitting the message to the hidden service OP. In this step, the
7 hidden service OP decrypts the message using its private key and observes the information regarding the chosen RP. Then, the hidden service OP establishes a circuit with the RP. Fi- nally, RP informs the client OP that the path has been built successfully and can be used for transmitting all the other data which should be sent between client and the hidden service. Interactive applications such as email, Web browsing, file sharing, and remote termi- nal access utilize TCP network standard since it provides reliable data transferring with guarantee that all messages eventually arrive the destination. As previously mentioned, all circuits built in Tor employ TCP which leads this network to be ideal for interactive appli- cations. However, there are few clients who consume notably high amount of traffic on Tor using non-interactive applications such as BitTorrent [34]. As it is in conflict with the basic purpose of Tor project in providing low latency and high throughput besides anonymity, default policy of exit relays blocks TCP ports for file sharing protocols.
2.2 Tor security issues
Based on [14], the attacks on Tor can be categorized into three classes of client-side, server- side, and information leakage. This section presents information on the first two categories while Chapter5 details the information leakage on Tor and investigate to what extent Tor domains are vulnerable against this issue.
2.2.1 Client-side attacks
There have been several attempts to hack Tor clients and associate their identities (i.e. IP address) to the data transmitted. Related studies propose solutions to remove these vulnerabilities through updates on different versions of Tor. Browser-based attack: this attack [1] exploits Web browser code execution strategy to conduct an end-to-end traffic analysis to identify visitors of websites. Although all con-
8 Plugged-in codes
Malicious Web Server
Figure 2.2: A browser attack using the codes included in a website. Executing the plugged-in program by the client’s browser opens a direct path to a malicious Web server which compromises the client’s anonymity
nections in Tor follow an onion routing scheme to provide anonymity, software programs such as Flash, Adobe, and Java that plug into the browser do not have to use this proxy. Hence, downloading and executing their codes build connections through the regular Web instead of Tor anonymous circuits. Embedding a malicious server on the surface Web, as illustrated in Figure 2.2, can reveal the identity of the website user in Tor. However, this attack does not provide third parties to deanonymize users of a Tor Web page. In another attempt denoted in Figure 2.3, employing a malicious exit relay to manipulate the traffic and conduct a man-in-the-middle attack allows parties to associate the user’s identity to the data requested. Insecure protocols: statistical analyses on protocols used in Tor communication indi- cate that a notable proportion of exit traffics use protocols such as File Transfer Protocol (FTP), Internet Message Access Protocol (IMAP), Simple Mail Transfer Protocol (SMTP), and Post Office Protocol (POP). In such protocols, credential information like username and password of users is transferred as plain text message, which makes it easily identi- fiable. On the other hand, there are only small number of servers that provide SSL/TSL connection for their services to protect the transfer of data and information. However, their secure connection can be still threated as a result of using the insecure protocols. For
9 Plugged-in codes
Malicious Exit Node Web Server
Figure 2.3: A browser attack using a malicious exit node. The client’s browser executes a code inserted into a Web page by the malicious exit node.
instance, a user simultaneously initiates connections using both secure and insecure proto- cols. Since both connections belong to a same user, they are multiplexed over one circuit and thus, one exit relay. In the case of deciphering the insecure connection, the leaked confidential information of user can be simply associated with the secure connection. As a solution strategy, users can be warned about the security consequences protocols such as POP can cause. Also, a port-based strategy [3] can be simply employed and block these protocols at the client side. Both solutions have been provided in Tor 0.1.2.18 and later 1. Torben attack: this type of attack utilizes two limitations in the current Tor implemen- tation to deanonymize clients [2]. Web pages can be simply manipulated to download and render content from untrusted resources. Also, low-latency anonymization networks cannot hide some characteristics of the traffic such as size of the data transmitted. This attack tries to provide manipulated content such as advertisements on a website and deanonymize users by analyzing the indicators of the Web pages that are transmitted through an established side-channel. P2P information leakage: this attack exploits the connections a Tor user establishes to a P2P system [14]. In the case of BitTorrent network, a man-in-the-middle attack can deanonymize a Tor user by manipulating the message sent by a torrent tracker to her.
1The most current version at the time of study is Tor 0.4.2.1-alpha
10 The most probable circuit
Traffic Client manipulation Server
Figure 2.4: Traffic manipulation to induct malicious entry into the circuit
Tracker maintains and updates the information of peers that torrent users should contact in order to retrieve their requested resources. This information contains peers IP addresses and their listening ports. Since the communications with peers are not anonymous, an at- tacker can insert information of a malicious peer in the list of peers, monitor the traffic through the malicious peer, and observe the IP address of the Tor user. Malicious entry induction: since the entry is the only relay that can observe user’s identity, this attack tries to induce the user’s proxy to select a malicious entry instead of adopting other legitimate relays [31] (illustrated in Figure 2.4). To do so, attacker can block traffic to other routers by manipulating administrative policies or modifying their traffic statistics to reduce their probability to be chosen. In [34], another type of attack is discussed where user as an attacker attempts to con- duct malicious activities. Hence, operating an exit relay with default policy settings can make the router be identified as the source of several malicious activities such as allega- tions of copyright infringement, reported hacking attempts, IRC bot network controls, and Web page defacement. This can disappoint Tor users to voluntarily operate exit relays and makes it difficult for research groups to collect the data required for their study through monitoring the exit relay traffics. As a solution, the ports which cause most number of
11 complaints on the exit relay can be blocked. However, it significantly reduces the band- width required for a functional exit router and cannot be practical.
2.2.2 Server-side attacks
As discussed before, Tor provides anonymity for both clients and owners of hidden ser- vices. Attacks in this category focus on Tor hidden services with the aim of either deanonymiz- ing or weakening them. Padding attack: this attempt tries to find the IP address of a hidden service using malicious RP and entry. First, attacker sends a manipulated message to the IP of the hidden service which determines the RP. Consequently, IP forwards this message to the HS to build a circuit with the RP. When RP receives the reply, it sends the HS a message that contains a specific number of padding and then, terminates the circuit. This padding that will be discarded by the HS indicates the unique signature of the traffic and helps the entry identify the message sent by the HS. In other words, by analyzing the messages received by the entry, the attacker can infer whether this node has been selected as an entry router for a connection to the HS or not. Packet manipulation: this attack identifies a hidden service using a malicious user, entry, RP, and central server. First, the malicious user sends a manipulated message (which is not in compliance with the protocol) to the HS via a malicious RP. Simultaneously, she sends the timestamp of the message to a central server controlled by the attacker. As the manipulated message is received by the HS, it sends back a reply message to the user which indicates the received data was destroyed. This message should be first passed through the RP before arriving to the user. While receiving the message, RP extracts some useful information such as timestamp of the packet and sends it to the central server. Finally, using time correlation analysis at the central server, the attacker can infer the IP address of the targeted HS.
12 HS Directory
Client Server
Figure 2.5: Flow diagram of the off-path man-in-the-middle attack
Off-path man-in-the-middle: in this type of attack, the attacker owns the private key of a hidden service and mounts a man-in-the-middle attack with no need to be in the path between the user and the hidden service [44] (shown in Figure 2.5). To do so, the attacking method utilizes two relays: one as a ‘malicious’ hidden service and the other to modify the user to build a circuit with the ‘real’ hidden service and maintain the connection. The attacker first retrieves the descriptors of the real hidden service uploaded on the HSDirs and builds and maintains a circuit to it. Then, the attacker etablishes a malicious hidden service using the compromised private key and uploads its descriptor on the HSDirs. Please note that in this step, the same HSDirs are selected because the public key did not change. This consequently replaces the new descriptor with the previous real one. Now, the user’s requests to visit the real hidden service lead to retreiving the descriptor of the malicious hidden service. It is worth mentioning that Tor routing scheme dose not encrypt the con- nections to the hidden services. As a result, since the attcker still has the connections with the real hidden service, she can relay and monitor the traffic.
13 3
Literature Review
Previous research on Tor can be categorized into three classes: (1) work which has focused on understanding the security and privacy on Tor; (2) studies which focus on characterizing the Tor structure; and (3) studies on characterizing the information hosted on Tor.
3.1 Tor security and privacy
We consider these studies, operating at the traffic level, as studying Tor payloads that pass through the networks, rather than in understanding the content of these payloads or of the inter-connected structure of Tor domains. Towards understanding Tor security and privacy issues, Mohaisen et al. studied the possibility of observing Tor requests at global DNS infrastructure that could threaten the private location of servers hosting Tor services, and name/onion address of Tor domains [36]. Their characterization of the leakage indicated high volumes of leakage which are geographically distributed and target different types of hidden services. It also revealed various sharp increasing in onion request volumes which can be attributed to different geopolitical events around the world. Finally, various solutions were provided as remedy for each scenario. McCoy et al. tried to answer how Tor is (mis- )used and what clients and routers contribute to this usage [34]. They also proposed some
14 remedies to improve the implementation of Tor network. Results of traffic analyses over exit relays revealed that at the time of conducting the study, the major type of applications used over Tor is non-interactive. They also showed that protocols such as POP3, IMAP, and Telnet are regularly used in Tor. However, they are insecure as they transmit client credentials as plain text which helps malicious exit relays simply capture them. Hacking, and allegations of copyright infringement are some examples of malicious activities done through this network. Focusing on privacy of Tor hidden services, Biryukov et al. analyzed the traffic of services to evaluate their vulnerability against deanonymizing and take down attacks [9]. They demonstrated how current flaws in design and implementation of Tor hid- den services can help attackers find the popularity of a hidden service, harvest its descriptor in a short time, and find its guard relay. They further proposed a large-scale attacking tech- nique to disclose the IP address of notable number of Tor hidden services over one year. All the proposed techniques were evaluated over Silk Road, DuckDuckGo search engine, and a case of a botnet that utilizes Tor hidden services as command/control channels. Biryukov and Pustogarov indicated how using bitcoin over Tor can help man-in-the-middle attacks to fully observe information transmitted between Tor clients who use Bitcoin cryptocur- rency [7]. In particular, attacker can identify the Bitcoin blocks and transactions relayed to each user and delay or discard them. They also proposed a novel technique to fingerprint Bitcoin clients and identity them during different sessions by setting and keeping fresh an address cookie on their computers. Hence, the attacker can recognize the client even if she decides to use Tor hidden services to connect to the Bitcoin network. In [14], Cam- biaso et al. presented an exhaustive investigation on different types of security concerns over Tor network. Based on purpose of the attack, various threats were categorized into client-side, server-side, and network attacks. Bauer et al. investigated how flaws in Tor routing optimization approach can cause end-to-end traffic analysis attack. They indicated how a low-resource attacker can deanonymize a fairly large number of entry and exit re- lays by exaggerating the amount of bandwidth its routers can provide in the network [4].
15 They evaluated the proposed attacks on PlanetLab and proposed solutions to mitigate the severity of the threats. Sanatinia and Noubir demonstrated a man-in-the-middle attack on Tor hidden services that is mounted by an attacker who knows the private key of a hidden service [44]. The attack is called “off-path” as the attacker does not have to be present in the circuit through witch client communicating with the hidden service. They also pro- posed some detection solutions which compare hidden service descriptors in two different levels. They finally proposed techniques to mitigate the effects of this threat. In another work, Sanatinia et al. investigated the longevity of Tor hidden services with respect to the privacy of Tor users and services [45]. They indicated how it is possible to estimate the lifetime of hidden services with fairly high accuracy using a small percent of Tor HSDir relays. Results indicated that near half of Tor hidden services at the time of doing the study have longevity less than 10 days and 80% have the maximum lifetime of a month.
3.2 Tor structure characterization
The topological properties of Tor, at physical and logical levels, are only beginning to be studied. Xu et al. quantitatively evaluated the structure of four terrorist and criminal re- lated networks, one of which is from Tor [50]. They found such networks are efficient in communication and information flow, but are vulnerable to disruption by removing weak ties that connect large connected components. Sanchez-Rola et al. conducted a broader structural analysis over 7,257 Tor domains [46]. Their experiments indicated that domains are logically organized in a sparse network, and found a surprising relation between Tor and the surface Web: there are more links from Tor domains to the surface Web than to other Tor domains. They also reported evidence that suggests a surprising amount of user tracking performed on Tor. Bernaschi et al. presented a characterization study on topology of Tor network graph and investigated the persistence of hidden services and their hyper- links [6]. All analyses were conducted over three different snapshots of Tor captured during
16 a five-month period. They also compared Tor with other social networks and surface Web graphs using well-known metrics. Results indicated that in Tor network, edges are more volatile than vertices and the graph presents some similarities with other types of networks while it has particular properties such as huge number of nodes with no out-going edges. Well-known models like ER are also unable to accurately represent the Tor network. In another similar work [5], Bernaschi et al. investigated measurements to evaluate and char- acterize Tor hidden services data and topology of their network. They provided a critical discussion on possible data collection techniques for dark Web and conducted analyses on the relationship between Tor English content and its topology. Focusing on deployment and mirroring of Tor hidden services, Burda et al. provided an extensive investigation on redundancy of Tor services across time and space [13]. They developed a new tool called MASSDEAL that automatically evaluates the appearance of mirrors and estimates the in- frastructural redundancy of Tor domains. Results demonstrated that market services have the less number of mirrors in contrast to other Tor services due to trust issues for their customers. And, regarding the time at which mirrors become accessible, mirrors of some services behave very similarly. Similarly, Griffith et al. investigated the graph theoretic properties of Tor network and compared it with previous analyses conducted on the surface Web. The study considers bow-tie structure, robustness and fragility against node removal, and importance of reciprocal connections in the comparisons. Results indicated that the Tor network has significant differences with the graph of surface Web, and ‘Web’ in “dark Web” is a misnomer.[23]
3.3 Tor information characterization
Towards understanding types of content on Tor, Dolliver et al. used geovisualizations and exploratory spatial data analyses to analyze distributions of drugs and substances adver- tised on the Agora Tor marketplace [18]. Results demonstrated that drugs with European
17 sources are randomly distributed and six countries, with Canada and the United States at the top, have the major portion of drug dealing around the world. Geospatial analyses re- vealed that heroin and cocaine markets are exclusively retail-based. However, countries with pharmaceutical and chemical establishments significantly contribute to selling new psychoactive substances and prescription drugs. In another similar work [19], Dolliver and Kuhns conducted an investigation on type of new psychoactive substances sold on Agora during a fourth-month period. They also provided an in-depth analysis on the countries supporting trade of these substances on Agora. Experiments over the time revealed that in contrast to the number of total advertisements, increase in advertisements for new psy- choactive substances has slighter slope. Regarding the volume of trading new psychoactive substances on Agora, China and the U.S. are at the top. Also, over the time of conduct- ing the experiments, number of countries advertising these substances raised by over 40%. Chen et al. sought an understanding of terrorist activities by a method incorporating infor- mation collection, analysis, and visualization techniques from 39 Jihad Tor sites [15]. An expert evaluation on the proposed method indicated its high performance in investigating terrorist activities on the dark Web. Mörch et al. analyzed the nature and accessibility of in- formation related to suicide [37] by investigating the search results of nine popular search engines on Tor. Experiments depicted that in comparison with the surface Web, search- ing “suicide” and “suicide method” on Tor results in much less number of sites providing suicide-related content. And, over half of the search results are out-of-date, unreachable, or irrelevant to the suicide topic. Among the results, there are also several forum pages that discuss about pro-suicide activities which are blocked by most surface Web search engines. Dolliver crawled Silk Road 2 with the goal of comparing its nature in drug trafficking op- erations with that of the original site [17]. Experimental results revealed that Silk Road 2 is a much smaller dark market in contrast to the original Silk Road. Only near one fifth of all items for sale on Silk Road 2 are related to drug. However, 145 vendors contribute in drug dealing and it is approximately equal to three fourths of number of seller accounts
18 existing in this market. The geospatial distribution of dealers indicated that among nineteen countries performing drug trafficking on Silk Road 2, the United States is at the top of the list of both origin and destination countries. Biryukov et al. investigated the content and popularity of Tor hidden services by scanning their descriptors for open ports and looking at their request rate [8]. They also proposed a deanonymization method to identify clients of Tor hidden services. The results indicated that the content of over four fifths of Tor hidden services is in English and near half of them are devoted to drugs, adult, counterfeit, and weapon topics. Botnet services are also among the most popular hidden services on Tor. In [16], Christin presented an extensive study over Silk Road for eight months to in- vestigate type of goods sold and the revenues made by vendors and Silk Road operators. Results revealed that Silk Road is dominantly a market for substances and narcotics and most items advertised on Silk Road are available for at most three weeks. Similarly, most vendor accounts become deactivated after approximately three months while the rest that contains a small proportion of vendors at the time of experiments remain active. Estimation indicated that the total revenue of vendors is roughly more than USD 1.2 million per month and the estimated commission for the Silk Road operators is approximately USD 92,000 per month. They finally concluded the discussion by explaining the economic and policy implications of the results. There are also some studies that proposed tools to support the collection of specific information, such as a focused crawler by Iliou et al. [29], new crawling frameworks for Tor by Zhang et al. [56], crawling extremist content using dark crawling and sentiment analysis by Scrivens et al. [47], and advanced crawling and indexing systems like LIGHTS by Ghosh et al. [21].
19 4
Information Ecosystem Evaluation
There is an open question whether the fundamental protections that Tor provides its users is worth its cost. In other words, the same features that provide the privacy of users also make Tor an effective tools for illegal activities and evading law enforcement. A wide range of positions on this question has been documented [50], but empirical evidence is limited to work that has crawled, extracted, and investigate specific portions of Tor according to the type of content, detailed in the Chapter3. Surveying this body of past research cannot give a holistic view on how Tor is used and its structure as an information ecosystem. This is because previous studies concentrate on a particular part of Tor and take measurements that attempt to answer unique collections of hypotheses. But such a holistic insight into Tor’s utilization and ecosystem is necessary to answer broader questions, such as: How diverse is the content provided on Tor? Is it valid to argue that Tor is used to buy and sell illicit goods and services and to enable criminal activities? Is the service structure of Tor ‘siloed’? Answers to such questions can present an understanding of the types of services and information provided on Tor, and reveal the most popular and important (from a structural perspective) services it provides. To this end, we provide a quantitative characterization of the types of services avail- able across a large swath of English language Tor Web pages. We perform a massive
20 crawling of Tor starting from 20,000 seed addresses and harvest only the HTML page of each visited address1. Our crawling resulted in over 1 million onion addresses, of which 150,473 are hosted on Tor and the remaining 1,085,960 returns to the surface Web. We concentrate on 40,439 dark Web pages belonging to 3,347 English services and augment LDA with a topic-labeling method that utilizes DBpedia to assign semantically meaning- ful labels to the information of Web pages. We further extract and investigate a logical network of English Tor services connected by hyperlinks. Since the vast majority of Tor content (84%) is provided in English [8], our work focuses on the portion of Tor pages with information in English. We summarize our findings to the following research questions:
• RQ1: How diverse is the information provided on Tor?
– We reveal that Tor services can be cetegorized into nine types. Over half of all discovered services are either directories that provide information about other Tor domains, or serve as dark markets to buy and sell goods and services. Only 24% of all Tor services are used to post, send, or anonymously explore infor- mation. We, however, find that various types of domains relate to each other in unique ways: dark markets that enable payment by forcing games on a gam- bling site; Tor domains which involve money transactions have a surprisingly weak tie with Tor Bitcoin services.
• RQ2: Is there any core service on Tor?
– An analysis based on importance measurements using centrality metrics reveals the Dream market as the most structurally important, “core” service, Tor pro- vides. Directory sites that are used to find and access Tor services have the max- imum betweenness centrality, which makes them important sources to browse the Tor network. 1For privacy purposes no embedded resources of any type, including pictures, scripts, videos, or other multimedia files, were collected in our crawl.
21 Initial Seeds
Multi-thread Downloaded Crawling HTML Files
.onion?
Yes Top-k Topic Yes Keywords per Modeling and English? Domain Labeling
Figure 4.1: Tor data collection process
• RQ3: How siloed are Tor services?
– Analyses on connectivity of Tor domains reveals that Tor services intend to iso- late themselves form the others and it makes them difficult to discover by simple browsing. This implies the need to have a comprehensive seed list of services to collect data on this network. We also find patterns that suggest competitive and cooperative behavior between Tor services based on their domain type.
To the best of our knowledge, the reports of this work are on the largest measurements of Tor that has been taken to date, and is the first to consider the relationship between Tor services conditioned on the kind of information they provide. This chapter is organized as follows: Section 4.1 presents the process for data col- lection and processing. Section 4.2 reports a broad evaluation on Tor content. Section 4.3 presents the logical structure of the services and provides the evaluation results. Finally, Section 6.2 summarizes the main conclusions.
22 4.1 Dataset collection and processing
We conducted a broad crawling of the Tor network to collect data for this work. Figure 4.1 shows this data collection procedure. We performed a multi-threaded crawler to collect the HTML source of any Tor Web page accessible by a depth-first search (up to depth 4) starting from a initial list of 20,000 Tor onion addresses. The seed list was made by concatenating the list used in a recent work [46] along those extracted by the author’s manual exploration of Reddit, Quora, and Ahima, and other major surface Web resources in the time of crawling. Although a manual list of seeds can have the potential risk of a crawl that misses parts of Tor, the hidden nature of Tor content makes it unlikely for there to ever be a single authoritative source for Tor services. We are confident that our seed list leads to a broad crawling of Tor. The reason is that: (1) The surface Web directories used in this procedure are well-known for providing up-to-date information about Tor services and are commonly used by Tor users to begin their own browsing for information. This suggest that these seeds as entry points into Tor are at worst practically useful, and at best ideal starting points to browse and discover Tor services associated with the most common usages of the service; (2) The list adapted from [46] is noted to be source often used to discover current Tor onion addresses. Moreover, the crawlers are assigned to cover all hyperlinks up to depth 4 from every seed address to make our data collection as broad as possible. Due to the rapidly changing content and structure of Tor [46], including temporary downtime for some domains, two crawlings were executed in June and July 2018. To con- trol for some variability in the up and down time of services, the union of the Tor sites collected during the two crawlings were archived for the analysis. It is worth mention- ing that we only request HTML files and follow hyperlinks, and avoid downloading the full content of a Web page. This prevents any access control polices, and crawler block- ers [54][25][51] from blocking our data collection. A total of 1,236,433 unique pages were collected across both crawlings. The collected data was further processed to filter English pages and to classify the pages as being from the surface Web or from the Tor
23 dark Web. We classified any Web page with suffix .onion as a Tor page, while the rest was considered to be from the surface. A language detection technique proposed by [20] was utilized to remove non-English onion content regardless of the value in their HTML lan- guage tag. This filtering resulted in 40,439 English Tor pages. We only focused on English pages to facilitate our content analysis. However, an evaluation of non-English pages will be the topic of our future study.
4.1.1 Tor content discovery and labeling
We performed an unsupervised content discovery and labeling procedure on the corpus of English Tor pages. Defining the content as any string outside of a markdown tag, the proce- dure runs the content of every Web page through the Latent Dirichlet Allocation (LDA) [10] and Graph-based Topic Labeling (GbTL) [27] methods to find a set of semantic labels as the broad topics of information on Tor. Each Tor service is then assigned a label by the dominant topic present across the set of all Web pages crawled in the domain.
Topic Modeling
Topic modeling is a method to uncover hidden structures within a collection of documents. By defining a topic as a group of words that often occur together, topic modeling creates se- mantic relationships among words within the same context, and differentiates words based on their meaning. LDA [10] is a well-known unsupervised learning technique for this end.
It models a topic tj (1 ≤ j ≤ T ) as a probability distribution p(wi|tj) over words taken from a corpus, D = {d1, d2, ··· , dN }, of documents. Words are drawn from a vocabulary
W = {w1, w2, ··· , wM }. The probability of observing word wi in document d is calculated PT as p(wi|d) = j=1 p(wi|tj)p(tj|d). Gibbs sampling is utilized to estimate the word-topic distribution p(wi|tj) and the topic-document distribution p(tj|d) from data. The entire gen- erative process can be summarized as follows where α and β are hyperparameters:
24 2000
1500 Number of topics
5 1000 10 Coherence 15
500
0 100 200 300 400 500 Minimum document length Figure 4.2: LDA topic coherence score for different number of topic and minimum lengths of document; bold trend representing scores using 9 topics
1. For each topic t ∈ {1, 2, ··· ,T }, define a word distribution ωt ∼ Dir(β)
2. For each document d ∈ {1, 2, ··· ,N},
(a) Draw a topic distribution τd ∼ Dir(α)
(b) For each word wi (1 ≤ i ≤ K, while K is number of words in document d),
i. Draw a topic ti ∼ Mult(τd),(1 ≤ i ≤ T )
ii. Draw a word wi ∼ Mult(ωti ),(1 ≤ i ≤ K)
An important hyperparameter of LDA is the number of topics T that needs to be modeled. We set T by considering the coherence [35] of a set of topics inferred for some T , selecting the T with the best coherence score C. C is defined as a function of the n
(t) words of each ti that have the highest probability P (wj|ti). Let W = {w1, ··· , wn} be
25 the set of top-n most probable words from P (w|t). Therefore, C is given by:
n i−1 (t) (t) (t) X X Fd(wi , wj ) + 1 C(t; W ) = log (t) i=2 j=1 Fd(wj )
(t) (t) (t) where wi , wj ∈ W , Fd(w) denotes the number of documents where w emerges, and (t) (t) Fd(wi, wj) gives the number of documents in which both words wi and wj occur. Values which are closer to zero show higher coherence for the corresponding topic. T is thus cho-
(t) sen as the one that results in the smallest average C over all topics (arg min PT C(t;W ) ) T t=1 T where the summation is ran over a model fitted to T topics. A final parameter of LDA is the minimum length of a document (e.g. Tor page) for it to be considered in the coherence calculation. We choose this length empirically by inspection of Figure 4.2, which gives topic coherence scores for varied values of T and various minimum lengths of the documents where n = 10 is considered for each topic. The trend for T = 9 yields the closest value to zero for document minimum length of 50 words.
Graph-based Topic Labeling with DBpedia
The word-topic distributions from a fitted topic model are representative of semantically related words that appear in common contexts. A human expert may subsequently assign a label to each topic using manual evaluation on these distributions; however, the manual ap- proach results in a subjective, possibly biased interpretation of the topics in a corpus. Also, manual investigation over huge number of Web pages requires large amount of time and effort. To avoid any bias towards a prior knowledge and to expedite labeling the pages, we incorporate the unsupervised knowledge graph-based labeling method called GbTL [27]. GbTL uses the DBpedia knowledge graph (KG) that codifies Wikipedia articles and their relationships as an ontology. GbTL finds a concept, ζ, from DBpedia that would serve as a suitable label to be representative of t.
26 To choose ζ, GbTL first considers a suitability measure γ. Before defining γ, it is worth noting that any optimization over all DBpedia concepts is infeasible because of the massive size of the DBpedia ontology. Instead, GbTL defines a candidate set of possible
labels for the topic t. The candidate set for t contains all vertices in the subgraph Gt =
(Vt,Et) of DBpedia where Vt indicates the set of concepts with labels identical to any word
(t) in W along with their directed first- and second-degree neighborhood and Et shows the
links among all concepts in Vt. Choosing the second-degree neighbors is based on [27] that found this setting to produce a sufficiently large candidate set of labels without adding any unrelated ones. Since labels of topics can be considered as assigning categories to the topics, we limit the subgraph relations to those with types rdfs:type, dcterms:subject, skos:broader, skos:broaderOf, and rdfs:sub-ClassOf. The suitability γ of each concept
ζ in Vt is computed based on its Focused Random Walk Betweenness Centrality [27] which measures the average amount of time it takes for a random walker to arrive at some node starting from any other node in a network. It is computed as below:
1. Assume L = D − A as the Laplacian matrix of Gt where A indicates the adjacency
matrix of Gt and D shows its diagonal degree matrix.
2. Arbitrarily remove a row and its corresponding column from L and then invert it. Define T as this inverse with a row and column vector of zeroes inserted at the same index the row and column was removed from L.
3. Define γ(ζ, t) as below:
P xy Ii (t) vx,vy∈W ,x