New Perspectives About the Tor Ecosystem: Integrating Structure with Information
Total Page:16
File Type:pdf, Size:1020Kb
NEW PERSPECTIVES ABOUT THE TOR ECOSYSTEM: INTEGRATING STRUCTURE WITH INFORMATION A Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy by MAHDIEH ZABIHIMAYVAN B.S., Ferdowsi University, 2012 M.S., International University of Imam Reza, 2014 2020 Wright State University Wright State University GRADUATE SCHOOL April 22, 2020 I HEREBY RECOMMEND THAT THE DISSERTATION PREPARED UNDER MY SUPERVISION BY MAHDIEH ZABIHIMAYVAN ENTITLED NEW PERSPECTIVES ABOUT THE TOR ECOSYSTEM: INTEGRATING STRUCTURE WITH INFORMATION BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy. Derek Doran, Ph.D. Dissertation Director Yong Pei, Ph.D. Director, Computer Science and Engineering Ph.D. Program Barry Milligan, Ph.D. Interim Dean of the Graduate School Committee on Final Examination Derek Doran, Ph.D. Michael Raymer, Ph.D. Krishnaprasad Thirunarayan, Ph.D. Amir Zadeh, Ph.D. ABSTRACT Zabihimayvan, Mahdieh. Ph.D., Department of Computer Science and Engineering, Wright State University, 2020. New Perspectives About The Tor Ecosystem: Integrating Structure With Infor- mation Tor is the most popular dark network in the world. Its noble uses, including as a plat- form for free speech and information dissemination under the guise of true anonymity, make it an important socio-technical system in society. Although activities in socio-technical systems are driven by both structure and information, past studies on evaluating Tor inves- tigate its structure or information exclusively and narrowly, which inherently limits our understanding of Tor. This dissertation bridges this gap by contributing insights into the logical structure of Tor, the types of information hosted on this network, and the interplay between its structure and information. These insights arise from three studies: (a) We perform a comprehensive crawl of the Tor dark Web and, through topic and network analysis, characterize the types of information and services hosted across a wide swath of Tor domains and their hyperlink relational structure. (b) We study the potential for thought-to-be isolated information on the dark Web to be leaked into the public surface Web by providing a broad evaluation on the network of referencing from Tor to surface Web. (c) We investigate the structural identity of Tor domains as an indicative of their neigh- borhood structure, independent of their either service type or location in the network. Our studies unearth previously unknown properties of Tor, including the finding that Tor domain types can be categorized into nine groups defined by the information they host. We unveil how services for releasing and searching information do emerge as the dominant type of Tor domains. Our importance evaluation identifies Dream marketplaces and direc- tory domains as Tor core services and crucial entry points for probes, respectively. Connec- iii tivity analyses reveal some patterns of cooperation and competition among Tor domains. We also present measurements that indicate how some types of domains intentionally silo themselves from the rest of Tor. The investigation on the dark-to-surface referencing network reveals this network as a single massive connected component where over 90% of Tor hidden services have at least one link to the surface world despite their interest in being isolated from surface Web tracking. This referencing puts Tor domains closer to each other and encourages them to cluster. However, it does not raise the domains’ contribution to either communication or information dissemination through the Tor network. Analyses on the Tor structural identity indicate that Tor domains can be categorized into eight groups based on their neighborhood structure. Dream market has its own class: an almost fully connected neighborhood structure which is robust against node removal. Link- ing structure of Tor domains can further make differences in their structural identities. The domains with direct links to the others with high out-degree centrality form the dominant structural identity on Tor. This identity makes the 2nd-degree neighborhood of Tor domains robust against node removal or targeted attack despite their tendency towards isolation. iv Contents 1 Introduction1 2 Preliminary Knowledge5 2.1 Tor routing scheme..............................5 2.2 Tor security issues...............................8 2.2.1 Client-side attacks..........................8 2.2.2 Server-side attacks.......................... 12 3 Literature Review 14 3.1 Tor security and privacy............................ 14 3.2 Tor structure characterization......................... 16 3.3 Tor information characterization....................... 17 4 Information Ecosystem Evaluation 20 4.1 Dataset collection and processing....................... 23 4.1.1 Tor content discovery and labeling.................. 24 4.2 Content evaluation.............................. 28 4.3 Domain relationships............................. 32 4.3.1 Connectivity analysis......................... 34 4.3.2 Importance analysis......................... 39 4.4 Chapter summary............................... 40 5 Information Leakage Assessment 42 5.1 Dataset collection and processing....................... 45 5.2 Evaluation of dark-to-surface referencing.................. 50 5.2.1 Linking process of Tor services to dark/surface resources...... 50 5.2.2 Analyzing reference view of Tor services.............. 53 5.3 Chapter summary............................... 60 6 Structural Identity Characterization 62 6.1 Structural identity analysis.......................... 63 6.1.1 Tor structural identity representation................. 64 6.1.2 Clustering Tor structural identities.................. 67 v 6.2 Chapter summary............................... 75 7 Conclusion 77 Bibliography 81 vi List of Figures 1.1 Relationship among the studies in this dissertation..............2 2.1 Tor architecture................................6 2.2 A browser attack using the codes included in a website. Executing the plugged-in program by the client’s browser opens a direct path to a mali- cious Web server which compromises the client’s anonymity........9 2.3 A browser attack using a malicious exit node. The client’s browser executes a code inserted into a Web page by the malicious exit node.......... 10 2.4 Traffic manipulation to induct malicious entry into the circuit........ 11 2.5 Flow diagram of the off-path man-in-the-middle attack........... 13 4.1 Tor data collection process.......................... 22 4.2 LDA topic coherence score for different number of topic and minimum lengths of document; bold trend representing scores using 9 topics..... 25 4.3 Services provided by news, multimedia, forum, and shopping domains... 29 4.4 Topic distribution of Tor domains....................... 31 4.5 Network of domains with degree > 0..................... 33 4.6 The Tor domain network. Panels 1 through 9 each highlights the incom- ing and outgoing edges of a particular domain. The figure is best viewed digitally and in color.............................. 34 4.7 In/out degree distribution of each community................ 35 4.8 Intra-relations within domains........................ 37 4.9 Centrality distributions............................ 39 5.1 Flowchart of the data collection and processing................ 45 5.2 Network of data collected during crawling. Red nodes indicate Tor do- mains and green nodes represent surface websites............... 48 5.3 Number of references to surface Vs. dark Web; both axes in logarithmic scale...................................... 51 5.4 Distribution of number of Dark Web neighbors for Tor domains in the first category; both axes in logarithmic scale.................... 52 5.5 Frequency Distribution of Network Parameters............... 54 6.1 Average Silhouette width Vs. different number of clusters.......... 69 vii 6.2 Dendrogram of the Hierarchical Clustering.................. 70 6.3 Label Distributions in the resulted clusters.................. 71 6.4 Examples of neighborhood structure for Dream market domains...... 71 6.5 CDF plot of out-degree centrality for Tor domains.............. 72 6.6 Examples of 1st-deree neighborhood structure for domains in clusters 1 to 4 73 6.7 Examples of 2nd-degree neighborhood structure for domains in clusters 1 to 4 74 6.8 Examples of neighborhood structure for domains in clusters 7 and 8.... 75 viii List of Tables 4.1 List of 10 most probable words per topic and their label........... 28 4.2 Summary statistics of the domain network.................. 32 4.3 Modularity score of each topic community.................. 38 5.1 Dark/surface resources used to collect initial seeds............. 46 5.2 Basic parameters of the network....................... 49 5.3 Onion domains with more than 200 references to surface and no link to Tor 53 5.4 Summary statistics of network features.................... 53 5.5 List of edges with Edge betweenness Centralities greater than 10,000.... 55 5.6 20-top Tor Services with high Stress centrality................ 57 5.7 20-top Tor Services without-degree centrality greater than 400....... 58 5.8 10-top Tor Services with In-degree centrality greater than 100....... 59 6.1 Basic statistics of the out-degree centrality for Tor domains......... 72 ix Acknowledgments I would like to express my thanks to my supervisor, Dr. Derek Doran, who has continu- ously supported me throughout this long journey. His patience, advocacy, and constructive comments helped me improve in different