Florida State University Libraries

Electronic Theses, Treatises and Dissertations The Graduate School

2014 Enhancing of Anonymous P2P Content Sharing Systems Guanyu Tian

Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected] FLORIDA STATE UNIVERSITY

COLLEGE OF ARTS AND SCIENCES

ENHANCING ANONYMITY OF ANONYMOUS P2P CONTENT SHARING SYSTEMS

By

GUANYU TIAN

A Dissertation submitted to the Department of Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy

Degree Awarded: Spring Semester, 2014

Copyright c 2014 Guanyu Tian. All Rights Reserved. Guanyu Tian defended this dissertation on April 15, 2014. The members of the supervisory committee were:

Zhenhai Duan Professor Directing Thesis

Ming Ye University Representative

Gary Tyson Committee Member

Zhenghao Zhang Committee Member

Zhi Wang Committee Member

The Graduate School has verified and approved the above-named committee members, and certifies that the dissertation has been approved in accordance with university requirements.

ii I dedicate this dissertation to my family. Without your support and encouragement, none of this work would have been possible.

iii ACKNOWLEDGMENTS

I would like to acknowledge several people who have helped and guided me throughout my doctoral program. First of all, I would like to thank Dr. Zhenhai Duan for being my academic advisor and mentor. I am grateful that Dr. Duan has always been there for guiding me through this intensive training process. His constant support and encouragement play an important role in my academic growth. Besides, I would also like to thank my dissertation committee members, Dr. Gary Tyson, Dr. Zhenghao Zhang, Dr. Zhi Wang, and Dr. Ming Ye, for their reviews and comments. Special thanks to Daniel Clawson for helping me meet the deadlines of manuscipt submission and dissertation defense. Last but not least, I would thank all my professors who helped and prepared me with excellent training in the department of at Florida State University.

iv TABLE OF CONTENTS

ListofTables...... vii ListofFigures ...... viii Abstract...... ix

1 Introduction 1 1.1 Background and Motivation ...... 1 1.2 Contribution ...... 3 1.3 StructureoftheDissertation ...... 5

2 An Overview of Anonymous Networks 6 2.1 Anonymous System ...... 7 2.2 Anonymous Content Sharing System ,GNUnet,andOneSwarm...... 9 2.2.1 Freenet ...... 9 2.2.2 GNUnet ...... 13 2.2.3 OneSwarm ...... 15

3 Traceback Attack on Freenet 18 3.1 Introduction...... 18 3.2 Traceback Attack on Freenet ...... 20 3.2.1 ConnectingtoaFreenetNode...... 20 3.2.2 Querying a Neighbor ...... 22 3.2.3 Identifying All Nodes Seeing A Content Request Message ...... 24 3.2.4 Difficulties in Identifying Originating Machine ...... 25 3.2.5 Identifying Originating Machine ...... 28 3.3 Performance Evaluation ...... 32 3.3.1 ExperimentalStudies ...... 32 3.3.2 SimulationStudies ...... 38 3.4 Discussion...... 42 3.5 RelatedWork...... 44

4 DynID: Thwarting the Traceback Attack on Freenet 45 4.1 Introduction...... 45 4.2 Background ...... 47 4.3 DynID to Thwart Traceback Attack ...... 49 4.4 Performance Evaluation ...... 57 4.4.1 Simulation Set-up ...... 57 4.4.2 Results ...... 58

v 5 ROL: Reroute-On-Loop in Anonymous P2P Content Sharing Networks 61 5.1 Introduction...... 61 5.2 Background ...... 63 5.2.1 Freenet ...... 64 5.2.2 GNUnet ...... 65 5.2.3 OneSwarm ...... 67 5.3 Reroute On Loop ...... 67 5.4 Performance Evaluation ...... 74 5.4.1 Simulation Setup ...... 74 5.4.2 SimulationResults ...... 77 5.5 RelatedWork...... 86 5.6 Conclusion ...... 87

6 Related Work 88

7 Summary 91

Bibliography ...... 93 BiographicalSketch ...... 97

vi LIST OF TABLES

3.1 Resultsofexperimentalstudies...... 34

3.2 Classification of messages successfully traced back...... 34

3.3 Properties of message paths...... 35

3.4 Resultsofsimulationstudies...... 40

3.5 Classification of messages successfully traced back (simulation)...... 40

3.6 Properties of message paths (simulation)...... 41

4.1 Thenumberofsuccessfulcontentlookuprequests ...... 58

4.2 Properties of message forwarding paths...... 59

5.1 Properties of the networks used in simulations...... 77

5.2 Average routing path lengths of Freenet and ROL...... 78

5.3 Number of messages in loops...... 83

5.4 Average routing path lengths on hybrid networks with parameters of S2...... 83

5.5 Average routing path lengths on hybrid networks with parameters of S3...... 84

5.6 Average routing path lengths on hybrid networks with parameters of S11...... 84

vii LIST OF FIGURES

2.1 HowTorworks...... 7

2.2 Circuit creation on Tor ...... 8

2.3 Freenetroutingscheme ...... 11

2.4 Indirectingandforwarding ...... 16

3.1 Illustration of the traceback attack...... 23

3.2 Case 1: nj forwarding request to nk−1...... 24

3.3 Case 2: nk−1 forwarding request to nj, but backtracked from nj...... 24

3.4 Case 3: No message forwarding between nj and nk−1...... 24

3.5 Length distribution of linear paths...... 37

3.6 Length distribution of linear reverse paths...... 38

3.7 Length distribution of linear paths (simulation)...... 42

3.8 Length distribution of linear reverse paths (simulation)...... 43

4.1 Basic structure of the traceback attack ...... 47

4.2 A forwarding path with loop...... 51

4.3 Can j forward a message to node k if node h is more preferred? ...... 55

5.1 Forwarding of a content request message...... 70

5.2 Implication of HTL operation...... 71

5.3 Average routing path length (small-world networks)...... 79

5.4 Average routing path length (random networks) ...... 80

5.5 Distribution of routing path lengths (small-world networks)...... 81

5.6 Distribution of routing path lengths (random networks) ...... 82

5.7 Comparison of forwarding path lengths between Freenet and ROL (small-world net- works)...... 85

5.8 Comparison of returning path lengths between Freenet and ROL’s shortcut (small- worldnetworks) ...... 86

viii ABSTRACT

Anonymous networks play a critical role in supporting free speech and user privacy on the . Over the years, many fundamental algorithms and schemes have been proposed to facilitate the development of anonymous networks, including mix networks, , per-hop (source) address re-writing and message forwarding, and various cryptographic algorithms. In addition, many practical anonymous networks have been developed and some are deployed on the Internet. On the other hand, despite the adoption of these well-established high-level security schemes and algorithms in such networks, the fine-grained design and development decisions of such networks have not been thoroughly examined. As a consequence, vulnerabilities in existing anonymous networks have been continuously identified and existing anonymous networks have been constantly attacked. In this dissertation we take a pragmatic approach to investigate how fine-grained design and development decisions may affect the anonymity strength of anonymous networks, and more im- portantly, how we can develop proper fine-grained decisions to improve the anonymity strength of anonymous networks. Throughout the course, we focus on Freenet, a popular peer-to-peer anonymous content sharing network. In the first part of the work, we thoroughly investigate the fine-grained decisions made in the Freenet project, including methods to prevent routing loop of content request messages, the handling of various messages in Freenet, and mechanisms for a Freenet node to populate and update its routing table. An effective traceback attack has been developed that can identify the originating machine of a content request message. That is, the anonymity of a content retriever can be broken in Freenet, even if only a single request message has been issued from the corresponding machine. The traceback attack exploited a few fine-grained design and development decisions made in Freenet, including the unique identifier (UID) based mechanism to prevent routing loops of content request messages. In the second part of our work, we investigate mechanisms to improve the anonymity of Freenet. In particular, we have developed a simple and effective scheme named dynID to thwart the traceback attack on Freenet. In dynID, the UID associated with a content request message is dynamically changed at the beginning portion of the message forwarding path. As a consequence, an attacker can only trace back a content request message to the node where the UID value is last changed; it

ix cannot uniquely determine the originating machine of the message. Importantly, dynID only has negligible impacts on the performance of Freenet in locating content on the network. For example, our simulation studies based on the original Freenet source code show that, for all content requests, we can successfully locate the corresponding requested content. DynID prevents an attacker deterministically identifying the originator of a message request, but attackers can probabilistically trace back to the originator. In the third part of our work, we develop a solution, Reroute-On-Loop (ROL), to prevent leakage of routing information. ROL prevents an attacker from distinguishing a node that has seen a particular message from a node that has not seen the message. Our simulation studies show that this solution is effective and brings only minor performance impact on Freenet. We emphasize that if an attacker cannot distinguish a node that has seen a particular message from a node that has not seen the message, it will become extremely difficult for the attacker to carry out any kind of traceback attack. The Freenet project has released a patch to mitigate the traceback attack developed in this dissertation. The Reroute-On-Loop solution in our work has also been considered by the Freenet project.

x CHAPTER 1

INTRODUCTION

1.1 Background and Motivation

Free speech and user privacy are desired features on the Internet. User Internet access is often monitored by governments or private companies. In addition, governments around the world often block access to unsuitable or subversive contents, or to make service providers liable to host such materials [9]. Moreover, the controversy over and facilitates a strong demand for distributed anonymous file storage service [12]. Supporting free speech and user privacy on the

Internet is imperative and is a big challenge.

Anonymous networks, which tries to hide the identity of senders and receivers, or the commu- nication linkage between a sender and receiver, play an important role in supporting free speech and user privacy on the Internet. Over the years, many fundamental algorithms and schemes have been proposed to facilitate the development of anonymous networks. Representative examples in- clude mix networks, onion routing, per-hop (source) address re-writing and message forwarding, and various cryptographic algorithms. In addition, many practical anonymous networks have been developed and some are deployed on the Internet, which could be either centralized where all nodes are deployed and managed by a single entity, or decentralized (i.e., peer to peer) where nodes are contributed by individual voluteers. Centralized solutions require a high trust on the entities de- ploying the anonymous networks. In this prospectus we focus on peer to peer anonymous networks

(p2pANs). Typical examples of p2pANs are Tor and Freenet.

Despite the adoption of the well-established high-level security schemes and algorithms such as mix networks and per-hop source address rewriting in p2pANs, the fine-grained design and devel- opment decisions of such networks have not been thoroughly examined. Examples of fine-grained design and development decisions include, for example, how an arbitrary node joins a p2pAN, a

1 node identify neigbhors and populates its routing table, how to improve network performance (and hence attract more users) without deteriorating the user anonymity. As a consequence, vulnerabili- ties in existing p2pANs have been continuously identified and existing p2pANs have been constantly attacked, such as watermarking attacks [31] on Tor and Pitch Black attack [43] on Freenet. Freenet project has also documented a few potential attacks on Opennet [4], including node harvesting on

Freenet, mobile attacker source tracing, and routing table takeover.

In this dissertaion we take a pragmatic approach to investigate how fine-grained design and development decisions may affect the anonymity strength of anonymous networks, and more im- portantly, how we can develop proper fine-grained decisions to improve the anonymity strength of anonymous networks. Throughout the course, we focus on Freenet. Freenet is a distributed content sharing system, where users can both insert and retrieve files [2]. As a popular peer to peer anonymous network [7], Freenet aims to provide the anonymity of both content publishers and retrievers. (In this dissertation we use the two terms file and content interchangeably.) In

Freenet, users contribute a portion of their hard disk space to form a global distributed storage sharing system. Global file operations such as insertion, retrieval, and deletion are all managed by the Freenet system itself. The location where a file is stored in Freenet is determined by a unique routing key associated with the file. Each node in Freenet only knows the information of their im- mediate neighbors. Mechanisms such as hop-by-hop forwarding of user messages, and rewriting the

(source) address of the messages at each node, are employed in Freenet to support user anonymity.

Freenet supports two operational modes— and Opennet. In Darknet, only trusted friends can get connected to each other, where in Opennet, anyone can get connected on Freenet.

In this dissertation we focus on the Opennet mode of Freenet, and we always mean the Opennet mode whenever we refer to Freenet. We note that the large-scale public Freenet on the Internet is operating in the Opennet mode in order for arbitrary users to join the Freenet. (Private) Freenet operating in the Darknet mode tends to be smaller scale among a limited number of trusted friends. In addition, the stronger security provided by Darknet is not based on improved protocols or architectures, but rather on assumed stronger trust among members of a Darknet. Attacks on

2 Opennet, such as the traceback attack developed in our work, can be launched on a Darknet if a member of the Darknet decides to do so (or it is subverted and controlled by an attacker).

1.2 Contribution

The contribution of this dissertation is organized into two parts. In the first part of the work, we thoroughly investigate the fine-grained decisions made in the Freenet project, including methods to prevent routing loop of content request messages, the handling of various messages in Freenet, and mechanisms for a Freenet node to populate and update its routing table. The objective of this investigation is two fold. First, we want to study how well the fine-grained design and development decisions of Freenet have been made to meet the anonymity goals of the network. Second, we want to obtain insights in developing fine-grained decisions to better support user anonymity. After the thorough examination of the fine-grained decisions made in Freenet, an effective traceback attack is then developed that can identify the originating machine of a content request message. That is, the anonymity of a content retriever can be broken in Freenet, even if only a single request message has been issued from the corresponding machine. The traceback attack exploited a few fine-grained design and development decisions made in Freenet, including the unique identifier (UID) based mechanism to prevent routing loops of content request messages.

We have carried out Emulab-based experimental studies to evaluate the performance of the developed traceback attack. The Emulab-based experiments are performed using the source code of Freenet 0.7 (the current version of Freenet), extended to support the traceback attack. With randomly chosen nodes to initiate content requests to random files stored in the Emulab-based

Freenet testbed, our experimental studies show that, for 24% to 43% of content request messages, we can identify their originating machines. In addition to Emulab-based experimental studies, we also performed Thynix-based simuation studies. Thynix is a simulator coming from the Freenet project [5]. This simulator allows us to carry out our traceback attack on a large-scale network.

Similar to Emulab-based experiments, the Thynix-based simulation studies confirm the effectiveness

3 of the developed traceback attack. We can identify the originating machines of 42.9% to 49.2% of content request messages in the simulation studies.

In the second part of our work, we investigate mechanisms to improve the anonymity of Freenet.

First, we have developed a simple and effective scheme named dynID to thwart the traceback attack on Freenet. In dynID, the UID associated with a content request message is dynamically changed at the beginning portion of the message forwarding path. As a consequence, an attacker can only trace back a content request message to the node where the UID value is last changed; it cannot uniquely determine the originating machine of the message. Importantly, dynID only has negligible impacts on the performance of Freenet in locating content on the network. For example, our simulation studies based on the original Freenet source code show that, for all content requests, we can successfully locate the corresponding requested content.

DynID prevents an attacker deterministically identifying the originator of a message request, but attackers can probabilistically trace back to the originator. In order to prevent any traceback attack, we developed a solution, Reroute-On-Loop (ROL), to prevent leakage of routing information.

With ROL, an attacker cannot distinguishing a node that has seen a particular message from a node that has not seen the message. ROL is a routing loop-handling mechanism. In ROL, when a message re-visits a node n, the node n will not send any failure message to the upstream node where the message comes from. Instead, the node will continue forwarding the message to the next closest peer who has neither forwarded the message to node n or received the message from node n. The fact that a node n does not send a failure message to the upstream node where a message comes from when a loop happens, prevents attackers from deciding whether or not the node n has seen a particular message. We note that, if an attacker cannot distinguish a node that has seen a particular message from a node that has not seen the message, it will become extremely difficult for the attacker to carry out a traceback attack.

Shortly after the traceback attack was developed in this dissertation, a patch was released by the Freenet project to mitigate the effectiveness of the traceback attack [29]. In addition, our

Reroute-On-Loop solution has also been considered by the Freenet project [29].

4 Together, the three findings will help us to better understand the fine-grained design and devel- opment decisions of p2pANs and their impacts on such systems’ performance and security strength.

We believe that our work will contribute towards improving the current p2pANs and building a new system that is privacy-aware and resistant on the Internet.

1.3 Structure of the Dissertation

The remainder of the dissertation is organized as follows. In chapter 2, we will describe peer- to-peer anonymous networks including anonymous communication systems and content sharing systems. Because our research focuses on anonymous content sharing networks, we specifically describe three popular systems, Freenet, Gnunet, and OneSwarm. In chapter 3 we report our complete work on developing the traceback attack on Freenet. In chapter 4, we present our solution to thwart this attack via dynID. In chapter 5, we develop a solution to prevent routing information leakage using Reroute-On-Loop scheme. In chapter 6, we discuss related works about general and specific attacks on p2pANs. In Chapter 7, we will conclude this dissertation.

5 CHAPTER 2

AN OVERVIEW OF ANONYMOUS NETWORKS

Most of anonymous networks are distributed, peer-to-peer systems. Those anonymous networks can be categorized into communication systems and content sharing systems.

Communication systems generally are low-latency and responsive. They act as middleman between a content retriever and publisher. Briefly speaking, anonymous communication networks are a proxy or a chain of proxies. At a high level, they use hop-by-hop IP address rewriting and point-to-point to provide user anonymity. A content retriever initiates a query message to the content server. Instead of being directly send to the server, the query message will be routed through multiple proxies. Each proxy knows only about its immediate neighbors. Therefore, only the first proxy knows the source of the query and the last proxy knows the real destination of the query. This provides some flexibility to the chain of proxies. Even if intermediate proxies are compromised, the anonymity of a content retriever and publisher may still be safe.

Unlike anonymous communication networks, anonymous content sharing networks are usually high-latency. In p2p anonymous content sharing networks, each node provides its own hard disk for the network storage space. Contents are stored in the network. A user can be a content retriever and provider. A content retriever does not know where the desired content is. Anonymous content sharing networks generally define their own network topology and protocol for inserting and searching contents. Nodes can join and depart from the network at any time.

The main difference between anonymous communication networks and content sharing networks is that one is for low-latency applications such as web browsing; the other aims to support file sharing applications that can tolerate a centain amount of network delay and application-level delay.

A communication system provide a secure channel/bridge for a content retriever and publisher.

The retriever knows exactly where the content is stored. However, in content sharing networks, the

6 content retriever and publisher are part of the network and the retriever does not know where the content is. They need to define their own network topologies and mechanisms for content insertion and retrieval.

2.1 Anonymous Communication System Tor

Figure 2.1: How Tor works

One popular real-world peer-to-peer anonymous communication network is Tor. Tor supports user anonymity through a chain of proxies or relays. A user needs to create a virtual circuit before it actually sends a request message to a remote server. In Tor, every communication channel contains exactly three relays: an entry, an intermediary, and an exit. As shown in Figure 2.1, Alice tries to send a content request to Bob. She knows the IP address of Bob. Instead of directly sending the query to Bob, Alice will foward it to the entry Tor proxy. Then, the entry proxy will forward it to the middle proxy who will continue forwarding it to the exit proxy. Eventually, the exit proxy will

7 deliver Alice’s content query message to Bob. At each proxy, the query message’s return address will be replaced with that proxy’s address. In this way, each of the three proxies and Bob only knows about the immediate upstream and downstream node. Bob does not know about Alice. This is an example of searching contents anonymously on the Internet.

Figure 2.2: Circuit creation on Tor

Tor provides a secure channel for users. In this following, we will describe the construction of a channel specifically. The procedure of constructing this three-hop circuit is shown in Figure 2.2.

First, the OP (Onion Proxy) on a user’s computer establishes a TLS connection with the first OR

8 (Onion Router). Then, the OP sends a CELL-CREATE cell to OR1 and uses the Diffie-Hellman handshake protocol to negotiate a base key between OP and OR1. From this base key, the OP and

OR1 produce a symmetric key which is used to encrypt data sent between these two parties. At this point, a one-hop circuit is created successfully. Similarly, the OP sends a CELL-CREATED cell with an encrypted Diffie-Hellman message to OR1; OR1 decrypts the encrypted message and sends it to OR2. OR2 creates a new base key and sends it back to OP through OR1. In this way,

OP and OR2 can a symmetric key. The OP can negotiate a symmetric key with OR3 in the same way. After a user builds the circuit successfully, he or she will encrypt the plain message with the OR3’s symmetric key, OR2’s key, and OR1’s key in order. By doing so, OR1 does not know about the destination of the message. OR1 just decrypts the message with its own symmetric key and forwards it to OR2. OR2 decrypts the message with its own symmetric key and forwards to

OR3. OR3 decrypts the message and sees the real destination and finally forwards the request message to the destined server. Therefore, only OR3 knows about the server of a request message.

This provides receiver anonymity to some extents.

2.2 Anonymous Content Sharing System Freenet, GNUnet, and OneSwarm

Since our works focus on anonymous peer-to-peer content sharing system, we will describe three popular real-world p2p anonymous networks, which are Freenet [2], GNUnet [34], and OneSwarm [3].

Furthermore, we will describe Freenet in more details because our studies use Freenet as a concrete example.

2.2.1 Freenet

In this section we provide a brief overview of the basic operations of Freenet that are most relevant to our current work. We refer interested readers to [2] and [10] for more details of Freenet.

Freenet is a peer to peer, anonymous content sharing system, with each node (a machine running

Freenet) contributing a portion of its hard disk space. As a peer to peer system, nodes may join and

9 depart from Freenet dynamically at any time. In Freenet, each node is associated with a location in the circular range [0, 1], where location 0 and location 1 are considered the same. The location of a node is randomly chosen by the node when it first joins Freenet. In order for arbitrary nodes to join Freenet, a set of seed nodes are provided, from which a new node can get connected to other nodes on Freenet.

When Freenet program starts on a node with location x (or when the node still needs more neighbors), an announcement message carrying the identification information of the node will be sent out and routed towards location x on Freenet. Intermediate nodes along the forwarding path of the message can add the requesting node as a neighbor if they still need more neighbors. In default, each Freenet node can have up to 40 neighbors. Given that the announcement message is routed towards the location x, it is likely that the majority of these intermediate nodes are close to location x. As a consequence, although nodes join Freenet in a distributed, asynchronous fashion, the topology of Freenet is semi-structured [22] in that, with a high probability, nodes with close-by locations are clustered together, and at the same time, a node may also connect to a neighbor with a far-away location. The semi-structured Freenet topology greatly improves the routing and lookup of data messages in Freenet, with a strong resilience to node departures or failures.

Freenet data insertion and retrieval involve a number of different types of messages. In this paper we only focus on Content Hash Key (CHK) based content messages, in which the routing key is the SHA-256 hash of the corresponding data to be inserted or retrieved. The CHK routing key is used to uniquely identify the corresponding data on Freenet. To a degree, CHK messages are the most fundamental in Freenet. For routing purpose, the CHK routing key is converted into a location value in the same range of [0, 1], and the corresponding message will be routed towards that location when received by a node.

In the following we will first describe the handling of CHK content request message, which is issued by a node when the user requests a file on Freenet. Each message is associated with a CHK routing key, a hop-to-live (HTL) value, and a unique identifier (UID). When a node receives a content request message, it will check if it has the corresponding data in its local data store. If it

10 Figure 2.3: Freenet routing scheme

does, it will return the data along the reverse path of the request message, and nodes along the path may cache the data to better serve potential later requests on the same file. Data cache also helps to spread popular data, which are requested by many Freenet users. If the current node does not have the requested data, it will forward the request message to the next closest neighbor based on the routing key.

In order to improve the likelihood that a message is routed to a destination node storing the requested data, the routing decision is made based on the distance between the CHK routing key

(after being converted to a value between 0 and 1), the locations of the neighbors of the current node, and the locations of the neighbors of its neighbors. That is, Freenet uses two-hop routing lookup instead of one-hop lookup (only based on the locations of the immediate neighbors), which helps improve the routing efficiency and avoid local minimum in the Freenet topology. For this purpose, each node in Freenet has the location information of its immediate neighbors, and the neighbors of its immediate neighbors.

The HTL value in a content request message is used to determine the number of hops the message should be forwarded along a forwarding path. Each intermediate node will decrease the

11 value, and when it reaches 0, the corresponding request message will be discarded instead of being forwarded. In addition, a Data not Found failure message will be sent back to the upstream node, which will be further propagated back to the content requester to indicate the failure of the content request. In Freenet, for security reasons, a node may not decrease the HTL value to 0 when it already reaches 1, with a configured probability. Otherwise, an attacker can precisely control how far a content request message can be forwarded. Similarly, the HTL value is only decreased by 1 with a preconfigured probability (default probability is 50%) when it equals the maximum value.

Otherwise, an attacker can infer the distance between him/her and the originator of a message. As a consequence of this random behavior, a content request message may be forwarded along a path longer than the specified HTL value in the message.

When a request message cannot be forwarded due to reasons other than HTL = 0 (for example, no additional neighbors are available), the request will be backtracked to the upstream node where it comes from, in the sense that the upstream node will forward the request onto the next closest neighbor (if it is available). This process continues until either HTL becomes 0, the requested data is found, or all possible routes have been tried but the data cannot be found. The default maximum initial value of HTL is 18.

The UID in a content request message is used by nodes to uniquely identify a message, and to prevent routing loops. UIDs are randomly generated and are of length of 8 bytes, it is unlikely that two unrelated messages will have the same UID value in Freenet. When a node receives a request message, it will check if it has seen this UID before. If it does, a reject with loop message will be sent back to the upstream neighbor where the message comes from. Each node maintains a list of UIDs that it has seen but has not finished processing the associated request message (the corresponding reply has not been received). It will also maintain a queue of the UIDs that the node has finished processing (the corresponding reply has come back), which can hold up to 10, 000

UIDs. The oldest UID will be deleted from the queue when the UID of a newly completely message needs to be inserted into the queue and the queue is already full. Completed request messages of different types share the same UID queue. As a consequence, a reject with loop message will be sent

12 back as long as the current node has seen the corresponding UID in the incoming request message, regardless of the type of the request message.

The handling of CHK data insertion messages is similar to that of data request messages. A data insertion message is routed towards a destination location based on the CHK routing key.

The message is forwarded until HTL reaches 0. After the HTL drops below a configured threshold, intermediate nodes along the path may write the data into its local data store, based on a few conditions. Relying on the threshold on the HTL value, Freenet prevents a file from being stored too close to the inserting node so as to improve the security of content inserter.

Another relevant message type is probe message, which is mainly used for administrative and debugging purposes. For example, when a node receives a probe message, it will send back its routing table information. Similarly, each probe message is associated with a UID. In addition, it also carries a destination location, to which the probe message should be forwarded. Valid destination location value should be in the same range of [0, 1]. When a probe message with an invalid location value (outside the range [0, 1]) arrives at a node, the message will be discarded instead of being forwarded.

2.2.2 GNUnet

In the following we describe the basic operations of GNUnet. GNUnet is a peer-to-peer content sharing network. Its design goal is to provide strong user anonymity and censorship-resistance. In

GNUnet, each node contributes a portion of its hard disk to the global network storage. Nodes can join and depart from GNUnet dynamically at any time without approval by a certified authority or central server.

In GNUnet, a node has an identification number, node ID. This ID can also represent a location.

In fact, the node ID offers a direct map to file request and the node storing the requested content.

In order to routing messages effectively, nodes form a Kademlia-like network topology [39]. Simply speaking, a node is directly connected to other nodes whose locations are closest to itself. This kind of network topology is strictly structured.

13 In terms of routing, GNUnet uses random routing along with Kademlia routing. If Kademlia routing is used, a message associated with a location will always be forwarded to a node whose location is the cloeset to its own location. With Kademlia network topology and routing algorithm, a routing path length between two nodes is highly likely to be within O(log(n)), where n is the size of a network. Slightly different from Kademlia, GNUnet’s message routing is actually carried out in two stages. In the first stage, a request message is routed randomly in the network. After traversing a sufficient number of hops (roughly log(n), where n is the number of nodes in a GNUnet network), in the second stage, the request message is forwarded according to the Kademlia protocol, with an exception that, the routing is carried out in a recursive fashion instead of an iterative fashion as in the original Kademlia system, due to the anonymity requirement of GNUnet.

The rationale behind the random routing in the first stage is to make the lookup of a file independent of the location of the originating machine. Although it was not explicitly stated, we believe that the random routing also helps to improve the anonymity strength of GNUnet. We note that Kademlia is a structured network topology, should the set of nodes that have originated or forwarded a request message become known, the complete forwarding path of a message in

Kademlia can be re-constructed. By including a random routing stage, an attacker can only trace a request message back to the last node involved in the random routing stage based on the routing protocol of Kademlia, but not the originating machine of the machine. Therefore, the random routing stage helps improve the overall anonymity strength of GNUnet.

However, random routing also introduces a new problem into GNUnet. Due to random routing, loops can be formed in the forwarding of a request message. To prevent this problem, each request message in GNUnet carries the information of the nodes traversed by the message using a bloom

filter. When a node needs to decide the next hop to forward a message to, the bloom filter carried in the message is used to exclude the nodes that have seen the message before. This approach has false positives, but will not have false negatives, which can guarantee the prevention of routing loops.

14 GNUnet utilizes a credit-based economic system for message routing. The goal is to prevent denial of service attack by limiting the resources available to an attacker. Simply speaking, this is a reciprocal behavior. The more queries a node n answers for its neighbor node m, the more likely the node m will forward and respond to node n’s query. In an extreme case, node m may not forward any messages from node n. It is much difficult for an attcker to flood its neighbor nodes because he or she has to respond to a certain amount of queries from them. In addition, GNUnet employs a special mechanism, known as shortcut [16], to maintain load balancing and reduce network latency.

After a node receives a query message from a neighbor node, it chooses to either indirect or forward the query if it does not possess the query results. Indirect a message means overwriting the return

IP address with its own address. Forwarding does not change the return address of the upstream node. Without return address rewriting, the query response will be sent to the preceding node directly, which leads to a shorter response time. A node decides to indirect or forward a query depending on its current load and bandwidth. If a node is idle, it will indirect queries; if a node is busy, it will forward queries; if a node is very busy, it may discard queries. Figure 2.4 illustrates the concepts of indirecting and fowarding.

2.2.3 OneSwarm

Like Freenet and GNUnet, OneSwarm is another peer-to-peer anonymous content sharing sys- tem ( [42] and [15]). Each node provides a portion of its hard disk for the network storage space.

The design goal of OneSwarm is to allow a user to share contents with others under a precise control to the level of privacy. Unlike Freenet and GNUnet, OneSwarm allows a node to connect to both a stranger and a friend simultaneously. A node can publicly share its contents with a trusted friend and anonymously with a complete stranger. This provides a flexibility and better downloading performance for users compare to Freenet and GNUnet.

In OneSwarm, nodes form an unstructured, random network topology. A user can connect to a trusted friend by manually entering his or her IP address. If a user has few or no friends who are willing to connect to him, then he can connect to random nodes through a rendezvous service or

15 Figure 2.4: Indirecting and forwarding

community servers. A community server contains a list of nodes so that a user can bootstrap onto the network through a set of random untrusted peers.

In order to improve the content lookup performace, it is better to find the shortest path from origin to destination. Because of this and random network topology, a search message is flooded by a node to its neighbors. In order to prevent a search message from being flooded more than one time at a node, each node maintains a set of rotating bloom filters to keep track of the search messages that have been recently flooded by the node. When an old search message arrives at a node, the message will not be further forwarded and no response message will be returned to the upstream node where the message comes from. A node will forward a query message to its untrusted peers with a random probability. The rationale behind it is prevention of collusion attacks. If a node possesses the file that matches a content search message, it responds with a reply message. The response will propagate back to the originator hop by hop. Among trusted peers, the response is immediately transferred. However, among untrusted peers, the response is delayed to emulate the

16 delay of a longer path. This prevents an attacker from inferring the distance between him and the query responder, which in turn preserves anonymity of the content provider.

In short, OneSwarm utilizes a few randomness along with a mix of trust and untrusted peers, in order to provide a balance between good system performance and strong users anonymity.

17 CHAPTER 3

TRACEBACK ATTACK ON FREENET

3.1 Introduction

Freenet has undergone more than a decade of active development and deployment, and is widely used by privacy-conscious users for sharing files [7]. The high-level security mechanisms adopted by Freenet, such as hop-by-hop message forwarding and address rewriting, are time-proven means to support user anonymity; in addition, the cryptographic algorithms used in Freenet, such as hash algorithm, symmetric and asymmetric key algorithms, are all well-established. However, we note that various finer-grained design and development decisions of Freenet have not been thoroughly investigated, and it remains unanswered how well the anonymity objective of the original Freenet design has been met.

A number of watermarking-based traceback attacks (see, for example, [14, 18, 33]) have been developed on low-latency peer to peer anonymous networks such as Tor [13, 6], which aims to sup- port anonymous communication services for interactive applications. In such low-latency anony- mous networks, the message forwarding delay budget at each node is limited. Consequently, watermarking-based traceback attacks can be successfully carried out on such networks. In con- trast to low-latency anonymous networks, anonymous content sharing systems such as Freenet do not have much constraint on the message forwarding delay budget. Any traffic patterns that may be embedded in messages of such networks can be easily destroyed. Existing watermarking-based traceback attacks on low-latency anonymous networks will not work well on anonymous content sharing systems such as Freenet.

We explore a few fine-grained design and development decisions made in Freenet and develop a traceback attack on Freenet. In particular, we show that the originating machine of a content request message can be identified. That is, the anonymity of a content retriever can be broken in

18 Freenet, even if only a single request message has been issued from the corresponding machine. In developing this traceback attack, we exploit a few design and development features in the Freenet system, including methods to prevent routing loop of content request messages, the handling of various messages in Freenet, and mechanisms for a Freenet node to populate and update its routing table [27].

In the developed traceback attack, an attacker will deploy a number of monitoring nodes in

Freenet to passively observe content request messages passing through the nodes. Once an interested request message (based on routing key) is observed, the attacker will iteratively connect to the neighbors of a node n that has seen (either forwarded or initiated) the interested request message, and query these neighbors to determine other nodes that have seen the message. After all nodes that have seen the message have been identified, the originating machine of the message can be determined if the message forwarding path satisfies certain conditions.

In this chapter, we will present the details of the developed traceback attack on Freenet, and perform both Emulab-based experimental studies and Thynix-based simulation studies to investi- gate the feasibility and effectiveness of the traceback attack [1, 5]. The Emulab-based experiments are carried out using the source code of Freenet 0.7 (the current version of Freenet), extended to support the traceback attack. With randomly chosen nodes to initiate content requests to random

files stored in the Emulab-based Freenet testbed, our experimental studies show that, for 24% to

43% of content request messages, we can identify their originating machines. Similarly, the Thynix- based simulation studies also confirm the effectiveness of the developed traceback attack. We can identify the originating machines of 42.9% to 49.2% of content request messages in the simulation studies. For the rest of the content request messages that we cannot uniquely determine the orig- inating machines, we are able to identify all the nodes that have either initiated or forwarded a content request message.

Furthermore, we briefly explore a few potential countermeasures to address the developed trace- back attack, and provide a simple yet powerful insight into the design and development of peer to peer anonymous networks so that similar traceback attacks can be effectively mitigated. By attack-

19 ing and providing proper security countermeasures on Freenet, we hope to enhance the anonymity strength of Freenet, and improve the user confidence in this anonymous content sharing system.

We note that although the traceback attack and the solutions are developed specifically on Freenet, the basic principles of the traceback attack and the solutions have important security implications for the design and development of similar peer to peer, anonymous content sharing systems.

The remainder of the chapter is organized as follows. We present the traceback attack on

Freenet in Section 3.2, and perform experimental studies in Section 3.3.

3.2 Traceback Attack on Freenet

In this section we will present the design of the traceback attack on Freenet. The traceback attack has two important components—connecting an attack node to a suspect node in Freenet, and querying a neighbor to determine if it has seen a content request message with a particular

UID value. In the following we will first describe the two important components of the traceback attack, and then we will describe the traceback process to identify all the nodes that have seen

(either initiated or forwarded) a content request message, and the difficulties and opportunities in identifying the originating machine of a content request message. Towards the end of this section, we present techniques to identify the originating machine of a content request message when the forwarding path satisfies certain conditions.

3.2.1 Connecting to a Freenet Node

As one of the important steps in the traceback attack, an attacker needs to connect an attack node a to a suspect Freenet node n so that a and n become neighbors of each other. The assumption is that the attacker knows the location of the suspect node n. We have developed an effective method to carry out this task and the details are reported in [27]. In the following we provide a brief overview of the method. The key insight of the method is to exploit the neighbor addition and replacement approach adopted by nodes in Freenet.

20 In Freenet, each node can have a pre-specified maximum number of neighbors (40 in default).

When an announcement message arrives at a node and the node does not have enough neighbors yet, the requesting node will be automatically accepted as a new neighbor. Otherwise (the node already has the maximum number of neighbors), the node will check a key condition to determine if an existing neighbor can be replaced by the requesting node. Neighbors at a node are classified into a few pre-defined categories, depending on how they get connected to the node. For example, one category is the set of neighbors that are connected to the node via announcement messages.

The key condition to determine if the node should perform a neighbor replacement operation is if any of the neighbor categories has successfully served at least a pre-configured minimum number of content requests. The exact intention of this condition is not explicitly stated in the Freenet design (or its source code), we can only speculate that this condition is used to make sure that the node has accumulated enough knowledge of the neighbors in their capacity in serving content requests. When this condition is satisfied, the least recently used (LRU) neighbor of the node will be replaced by the requesting (attack) node, regardless of the category of the neighbor. We note that this condition can be easily satisfied at “busy” nodes, which forwards a large number of requests and replies. The default minimum number of requests to be successfully served for performing neighbor replacement is 10.

If the condition is not currently satisfied at the suspect node n, we can repeatedly perform file insertion and retrieval operations to enforce this condition. Given that we know the location of the suspect node n, we can insert files with routing keys surrounding its location. Given that the routing key of a file is the SHA-256 hash of the file, a large number of files can be pre-composed so that the location range [0, 1] can be reasonably covered by their routing keys. Note that, due to the nature of hash functions, we do not need to have sophisticated file structure and content in order to have a reasonable coverage of the complete location range. (As a matter of fact, all files used for this purpose in the experimental studies in Section 3.3 are of one line text string, and we only slightly change the text string in different files in order to obtain a totally different routing key.)

21 In order to enforce the neighbor replacement condition at the suspect node n, we choose the

files with routing keys that are close to the suspect node and insert them into Freenet. By this file insertion operation, we know a number of files that are located close to the suspect node. We then request the inserted files on a different attack node. After we have successfully retrieved the files for a number of times exceeding the minimum threshold, we will announce a node into the Freenet with a location that is close to the suspect node. If the new node becomes the neighbor of the suspect node, we are done. Otherwise, we will repeat this process until the new node becomes the neighbor of the suspect node.

3.2.2 Querying a Neighbor

Another important component of the developed traceback attack is to determine if a neighbor has seen a message with a particular UID. Recall that each Freenet node maintains a list of UIDs associated with request messages that the node has not finished processing (the corresponding reply has not come back), and a queue of UIDs associated with request messages that the node has finished processing (the corresponding reply has come back). For simplicity, we refer to both as the set of UIDs maintained by the node. In order to determine if a neighbor has seen a content request message with a UID value, we can send a request message with the same UID value.

A key requirement of sending this request message to a neighbor is that, the message should not be forwarded any further by the neighbor. Should this occur, this (forged) request message may pollute the Freenet in terms of the nodes that have seen the UID value. More specifically, let

N denote the set of nodes that have initiated or forwarded an interested content request message with a particular UID value. If the forged request message is forwarded beyond the intended neighbor, nodes that have not previously seen the interested content request message will now see the corresponding UID value. Our traceback attack algorithm may falsely identify these nodes as members of N, and the result of the traceback attack could be wrong.

Our first try was to send a content request message to a neighbor with the desired UID value, but with the initial value of HTL set to 1. However, it turns out that this cannot prevent the

22 content request message from being further forwarded by the neighbor. As we have discussed in

Section 2.2.1, with a configured probability, the value of HTL will not be decreased when it already reaches 1, and the corresponding content request message will be further forwarded (a message is only discarded when HTL reaches 0 or it cannot be forwarded due to routing issues). Due to this issue, instead of sending a forged content request message, we will send a probe message with the desired UID value to a neighbor.

The trick to prevent this probe message from being further forwarded by the neighbor is to select an invalid destination location value outside the range [0, 1]. Recall that, different request messages, including both content request messages and probe messages, share the same data struc- tures maintained by a node to record recently observed UIDs. Moreover, a probe message carrying an invalid destination location value will be discarded by the receiving node. Combining these two features, we know that, when a neighbor receives a probe message constructed in this way, it will return a reject with loop message if the neighbor has seen a message with the UID value previously.

And more importantly, regardless if the neighbor has seen the UID value before, it will not further forward this probe message, so that no other nodes on the Freenet will be polluted by this forged probe message.

Monitoring node nk−1 nk Routing keys to be monitored Attack nodes

Figure 3.1: Illustration of the traceback attack.

23 nj nk−1

nj nk−1 nj nk−1

Figure 3.2: Case 1: nj Figure 3.3: Case 2: Figure 3.4: Case 3: forwarding request to nk−1 forwarding re- No message forward- nk−1. quest to nj, but back- ing between nj and tracked from nj. nk−1.

3.2.3 Identifying All Nodes Seeing A Content Request Message

In this subsection we present the details of the traceback process to identify all nodes that have seen a content request message on Freenet (see Figure 4.1). After all such nodes have been identified, in the next subsection we present techniques to identify the originating machine of a message. An attacker will deploy a number of monitoring nodes in Freenet, with each maintaining a set of interested routing keys to be monitored (the routing keys are calculated based on the files to be monitored). A monitoring node will passively observe the content request messages passing through the node and try to match their routing keys with the routing keys to be monitored.

To improve the chance for the attacker to catch an interested request message on Freenet, the monitoring nodes should be spread over the location space of [0, 1].

When an interested content request message is identified, a few pieces of information will be forwarded to an attack node, including the content request message itself and the set of neighboring nodes to determine which of them (if any) has seen the corresponding UID value. We note that we already know the upstream node nk from which the request message comes at the monitoring node. In this case, the neighbors of nk will be sent to an attack node (instead of neighbors of the monitoring node). Note that we have the neighbor information of nk at the monitoring node, due to the two-hop routing scheme of Freenet. To ease exposition, we will refer to the set of neighbor nodes forwarded to the attack node (along with the content request message) as the suspect nodes.

24 Note that the set of suspect nodes will not include the downstream node along the forwarding path of the request message, which we know has seen the corresponding UID before. In the initial step of the traceback attack, it is the monitoring node. For example, as shown in Figure 4.1, the monitoring node is a neighbor of nk, which is the downstream node of nk along the forwarding path of the message. We do not need to include it as a suspect node. In the later steps of the traceback attack, it is the neighbor from which we are tracing back to the current node. For example, shown in Figure 4.1, assuming that we have traced back from nk to nk−1, when we try to determine if the neighbors of nk−1 have seen the corresponding UID before, we will not include nk, the downstream node of nk−1 along the forwarding path of the message, as one of the suspect nodes.

When an attack node a receives the information, it will try to determine one by one if any of the suspect nodes has seen the corresponding UID value (i.e., the content request message) by utilizing the two components that we have discussed above. In particular, for each suspect node n, the attack node a will first connect to the node (Section 3.2.1), and then it will send a probe message with the corresponding UID value to the node to determine if the suspect node has seen the UID before (Section 3.2.2). Conceptually, we can consider that the attack node maintains a queue of the suspect nodes, and each time it removes one suspect node from the queue to determine if the suspect node has seen a particular UID value.

If a suspect node nk−1 has seen the UID value before, the neighbors of nk−1 will be added into the queue, and the traceback process continues (by removing the next suspect node from the queue).

Note that, given that the attack node is a neighbor of nk−1, we have the neighbor information of nk−1, due to the two-hop routing scheme of Freenet. When the queue becomes empty, the complete traceback process to identify all nodes that have seen the corresponding content request message is finished.

3.2.4 Difficulties in Identifying Originating Machine

In order to understand the difficulties in tracing back a general content request message to its originating machine, we consider two different traceback situations. In the first case, at each step

25 of the traceback process, there is only one suspect node that has seen the concerned UID value before; while in the second case, multiple suspect nodes have seen the UID value before. We refer to the traceback paths in the first case as linear reverse paths, and in the second case as non-linear reverse paths. Note that a reverse path associated with a content request message is concerned with the traceback path starting at a monitoring node (back towards the origin), which is different from the forwarding path that the message takes from the origin towards the destination (up to the monitoring node for traceback purpose).

In Freenet, a message forwarding path will be linear if the message does not backtrack along the forwarding path (see Section 2.2.1). However, as we will show later, a linear forwarding path does not always result in a linear reverse path, which makes it important for us to make the distinction between forwarding paths and reverse paths. In the following we will illustrate the difficulties in identifying the originating machine of a message along a non-linear reverse path. More specifically, during the traceback process of a message, when we trace back from a node nk−1 to determine if any of the corresponding suspect nodes (i.e., neighbors of nk−1, but excluding downstream neighbor node nk, from which we trace back to node nk−1, see Figure 4.1), more than one suspect nodes have seen the interested UID value.

We note that when we query if a suspect node has seen a UID value, we cannot determine the time when the corresponding content request message is received, and also we cannot determine the direction of the message forwarding. What we can obtain is only the fact if the node has seen the UID value. This makes it hard to determine which of the suspect nodes is the upstream node of nk−1 along the forwarding path of the request message, when multiple of them have seen the corresponding UID value.

To illustrate the difficulties in determining the upstream node of nk−1 in this case, we show three possible forwarding situations in Figures 3.2, 3.3, and 3.4. We note that they are not all the possible cases, but rather a few representative examples. A neighbor nj of node nk−1 may have seen a particular UID value because it forwards the corresponding request message to nk−1 (see

Figure 3.2). However, as shown in Figure 3.3, it is also possible that node nk−1 forwards the request

26 to nj, but then the message is backtracked from nj to nk−1, because the message cannot be further forwarded. (It is possible that the message has been further forwarded by node nj to other nodes, before the backtrack from nj to nk−1 occurs.)

Moreover, as shown in Figure 3.4, it is also possible that the two neighbors nj and nk−1 have no direct interaction regarding the forwarding of the message, although both of them have seen the UID value. In this case, node nj did not directly forward the message to nk−1, and node nk−1 also did not directly forward the message to node nj. As shown in the figure, node nj receives the message with the corresponding UID value but forwards the message to a different node (rather than node nk−1), and similarly, node nk−1 receives the message from a different node (instead of node nj). When this occurs, we note that the traceback process will observe a non-linear reverse path, even if the forwarding path is linear.

Without the information of message forwarding time or direction, in general it is hard for us to distinguish different cases, and uniquely identify the upstream node at a node nk−1, when multiple suspect nodes have seen a particular UID value. Consequently, we will aim to identify the originating machine of a content request message only if the reverse path is linear. However, we note that, identifying the originating machine of a request message associated with a linear reverse path is not trivial, and we cannot always successfully determine the originating machine of a message in this case. The key challenge is that, when we trace back along a linear reverse path from nk−1 to a single suspect node nj, we still need to determine which of the two cases presented in Figures 3.2 and 3.3 is true, that is, to determine if a backtrack has occurred. Note that the case presented in Figure 3.4 will not occur on a linear reverse path; otherwise, the reverse path will not be linear.

In the next subsection we will present a few techniques to identify the conditions under which we can uniquely determine the originating machine of a message associated with a linear reverse path, by exploiting the routing policy of Freenet. We point out that such conditions can be applied to the traceback of certain messages associated with non-linear reverse paths. However, in this work we will only apply them to messages associated with linear reverse paths for two reasons. First,

27 although they can be applied to messages associated with non-linear reverse paths, the complexity of determining the originating machines of such messages will be much higher than those associated with a linear reverse path. Moreover, the rate to successfully determine the originating machine of such messages can be potentially lower than those associated with a linear reverse path. Second, the experimental studies in Section 3.3 based on realistic Freenet testbeds show that it is not uncommon for a message to be associated with a linear reverse path on Freenet. This phenomena is likely caused by the interplay of a number of factors in Freenet, including the semi-structured network topology of Freenet, strong connectivity among Freenet nodes and two-hop routing lookup, and a reasonably large HTL value.

The semi-structured network topology ensures that a content request message will be forwarded towards the destination node (where the content is stored) rather quickly, instead of being forwarded as a random walk. The strong connectivity among Freenet nodes and the two-hop routing lookup ensure that there is likely a path to any destination from any originating machine of a content request message. A reasonably large HTL value, coupled with the above other factors, make it unlikely for a content request message to backtrack. All these factors help to have a linear forwarding path of request messages in Freenet. We note that a linear forwarding path will result in a linear reverse path if the situation presented in Figure 3.4 does not occur, and a high percentage of linear forwarding paths will in general imply a high percentage of linear reverse paths, which helps to identify the originating machine of the corresponding content request message. In addition, we would like to emphasize that, even if only a small number of content request messages can be traced back, it still presents a significant security threat to users of Freenet.

3.2.5 Identifying Originating Machine

In this subsection we present techniques to identify the conditions under which we can uniquely determine the originating machine of a content request message associated with a linear reverse path (unless otherwise specified, all messages considered in this subsection are associated with a linear reverse path).

28 Recall that, as we have discussed in Section 2.2.1, a Freenet node n will choose the next closest neighbor to forward a message to based on the distance between the routing key and the location of neighbors (and their neighbors). Consequently, it is possible for us to determine the forwarding direction of a message by exploiting the routing policy of Freenet. We first define a few notations.

Consider a forwarding path of a message, we let n → h denote that the message is forwarded from node n to node h, and n ⇆ h denote that the message is forwarded from node h to node n, and then backtracked from n to h (it is possible that the message has been further forwarded by n to other nodes, before being backtracked to node h).

Similarly, consider a reverse path associated with a message, we let n և h denote that we trace back the message from node h to node n. In addition, we let n0 և n1 և ... և nk−1 և nk ... և nm denote the complete reverse path, where nm is the attacker’s monitoring node, and n0 is the last node along the reverse path, of which no suspect nodes have seen the concerned UID value. In addition, we let d(n) denote the distance from the node n to the destination implied by the routing key of the message. For the convenience of discussion, we define the length of a (linear) path as the number of nodes on the path. In the following we establish the conditions under which we can uniquely identify the originating machine of a message, through a series of lemmas.

First, we consider a trivial case where the length of the reverse path of a message is two, i.e., n0 և nm. In this case, it is easy for us to see that n0 is the originating machine of the message.

We state this fact in the following lemma.

Lemma 1 (C1: Path with length of two). Given a linear reverse path n0 և nm, which is of length of two, n0 is the originating machine of the message.

In the following we focus on linear reverse paths that are longer than two.

Lemma 2. Given a linear reverse path n0 և ... և nk−1 և nk և nk+1 ... և nm, backtrack can be started at most one time during the forwarding of the message, along the corresponding forwarding path.

29 We prove this lemma by contradiction. Assume two (or more) instances of backtracks have occurred along the forwarding path of the message. We consider two cases. In the first case, the two instances of backtracks occur at the same node, say node nk. We first note that, given that we are tracing back from node nk+1 to node nk, node nk+1 is not part of the backtracks, which have occurred before we start the traceback process. Moreover, the two instances of backtracks must involve two different neighbors of nk. As a consequence, two suspect nodes of nk should have seen the corresponding UID value, and therefore, the reverse path cannot be linear, which contradicts the assumption that the reverse path is linear.

In the second case, the two instances of backtracks occur at two different nodes, say nodes nj and nk. Without loss of generality, we assume that nj is ahead of node nk along the forwarding path, that is, node nj forwards the message to node nk (possibly passing through a few other nodes). Therefore, there is a forwarding path nj → ... → nk−1 → nk. Note that, the instance of backtrack initiated by node nk must use a neighbor different from nk−1 as the next hop, given that nk−1 is the upstream node of nk. Therefore, similarly, two suspect nodes of nk should have seen the corresponding UID value, which again contradicts the assumption that the reverse path is linear.

Lemma 3 (Neighbor preference condition at a node). Given a linear reverse path n0 և ... և nk−1 և nk և nk+1 ... և nm, consider an arbitrary node nk between n1 and nm−1 (inclusive), node nk cannot have initiated a backtrack instance to node nk−1, if the condition d(nk+1) < d(nk−1) holds.

We prove this lemma by contradiction. Assume that node nk has indeed initiated a backtrack instance to node nk−1. From Lemma 2 we know that no other nodes on the corresponding forward- ing path has started another backtrack. In particular, we know that nk → nk+1 and we know that this message forwarding must occur after nk−1 ⇆ nk, given that we are tracing back from nk+1 to nk. Put in another way, when node nk first decided the next hop to forward the message to, it

30 selected node nk−1, and then selected nk+1 after the message was backtracked from nk−1. However, this contradicts the assumption that d(nk+1)

Lemma 4 (C2: Neigbhor preference along path). Given a linear reverse path n0 և ... և nk−1 և nk և nk+1 ... և nm, n0 is the originating machine of the message if, for every node nk between n1 and nm−1 (inclusive), the condition d(nk+1)

Based on Lemma 3, we know that no nodes between n1 and nm−1 (inclusive) have initiated an instance of backtrack. Therefore, n0 must be the originating machine of the message.

We comment that C2 (Lemma 4) does not require d(n) to be monotonically decreasing along the forwarding path of the message. It only requires the condition to be held between each pair of every other nodes along the forwarding path. As an example, consider a simple reverse path

0.76 և 0.87 և 0.10 և 0.05 (for destination 0.06). For simplicity, let us only use one-hop routing

(instead of two-hop routing), that is, the distance from a node to a destination is only based on the location of the node (instead of locations of the node and its neighbors). It is easy to verify that the distance to the destination does not monotonically decrease along the corresponding forwarding path. However, it satisfies C2, and we can determine that 0.76 is the originating machine of the corresponding message.

Lemma 5 (C3: Neigbhor preference at n1). Given a linear reverse path n0 և n1 ... և nk−1 և nk և nk+1 ... և nm, n0 is the originating machine of the message if there exists at least one neighbor n of n1 that has not seen the UID value, but the condition d(n)

We prove this lemma by contradiction. Assuming n0 is not the originating machine of the message, that is n0 → n1 does not hold. Then we must have n0 ⇆ n1, that is, the message was sent from n1 to n0 but backtracked to n1. The instance of backtrack may have been initiated by n1 or by some other node along the path. However, both cases can be proved similarly and we do not make the distinction. Importantly, note that when n1 determined the next hop to forward the message, it selected n0 over n (which has not seen the UID value), which contradicts the assumption that d(n)

31 We will use C1, C2, and C3 (Lemmas 1, 4, and 5) to identify the originating machine of a content request message. Given a linear reverse path identified by the traceback process presented in the last subsection, we check if either C1, C2, or C3 is satisfied.

3.3 Performance Evaluation

In this section we perform both Emulab-based experimental studies and Thynix-based simula- tion studies to evaluate the feasibility and effectiveness of the developed traceback attack. In the following we will first describe the setup and results of the experimental studies, and then we will briefly present the setup and results of the simulation studies.

3.3.1 Experimental Studies

Setup. We carry out the experimental studies using the Emulab testbed [1], and Freenet 0.7.

We extend the source code of Freenet 0.7 to add the functionalities to support the traceback attack.

A number of bash scripts have also been written to largely automate the traceback attack.

The Freenet networks we used in the experimental studies consist of 70 nodes. 4 out of the

70 nodes are seed nodes, through which other nodes can get connected to the Freenet testbed.

The set of 70 nodes in each Freenet network does not include the attack nodes (see Figure 4.1), which are not connected to the network before an attack starts. We use a set of 5 additional nodes as attack nodes (theoretically, one attack node is sufficient to carry out the attack). We perform 3 sets of experiments, each consisting of 100 experimental runs. Each set of experiments use an identical Freenet topology, which is randomly constructed as follows. When we first start a set of experiments, each node will randomly select a location and contact a seed node to join the

Freenet. The locations of the seed nodes are also randomly selected by the individual seed nodes, and seed nodes are started before other general nodes are started. After all nodes have joined the

Freenet testbed (and therefore the network topology of the Freenet is formed), we then run 100 experimental studies on the Freenet testbed (or simply the Freenet).

32 The 100 experimental studies in each set are grouped into 10 clusters, with each consisting of 10 experiments. For each group of 10 experiments, we insert a random file into the Freenet, we then randomly choose a node to retrieve the file to complete one experimental run. To make an experiment meaningful, we do ensure that the randomly selected originating machine does not already have the file. After each experimental run, we restore the Freenet to the original state before the file is requested (i.e., we remove the file from all the caches due to this file request, and the data store of the file requester), before we perform the next experimental run, so that it will not be affected by previous experimental runs. After a group of 10 experimental runs, a different

file is inserted into the Freenet, and the experiments are repeated.

After a set of 100 experimental runs, we will re-start the Emulab-based Freenet testbed, so that a different Freenet topology will be constructed for the next set of 100 experiments. (Different randomly selected node locations will cause different Freenet topologies.) We repeat this process three times (for the three sets of 100 experimental runs).

In our experiments, each node can have at most 4 neighbors. Without loss of generality, we use the nodes storing a requested file as the monitoring nodes. We note that this maximizes the length of the path that we need to trace back. In addition, a traceback attack is initiated after the monitoring node has sent back the requested file. That is, we rely on the queue of UIDs that a node has finished processing the corresponding request message to determine if it has seen a UID value before. We believe that, in real traceback attacks on the public Freenet, requested files should also be returned to help minimize the suspicion of a file requester that it is being traced back, as the requested file comes back as expected. We refer to the three sets of experimental runs as S1, S2, and S3, respectively.

Results. In this subsection we present the results of the Emulab-based experimental studies.

First we investigate how well we can determine the originating machine of a content request message.

Table 4.1 shows the results. For three sets of experiments, we can successfully determine the originating machine of 43%, 24%, and 41% of request messages, respectively.

33 Table 3.1: Results of experimental studies.

Successful Set Total Number Percentage S1 100 43 43% S2 100 24 24% S3 100 41 41%

We make two observations from the results. First, the successful rate to determine the origi- nating machine of a content request message is reasonably high (ranging from 24% to 43%). As we have discussed in Section 3.2, this is likely caused by a number of factors of Freenet, including semi-structured network topology, strong connectivity among nodes and two-hop routing lookup, and reasonably large HTL value. Second, the successful rate to determine the originating machine of a content request message varies greatly from 24% to 43%. The specifics of the Freenet network topology, the location of the file to be requested, and the location of the node to initiate a file request will all likely affect the forwarding path of the content request message, and consequently, the chance for the originating machine of the request message to be determined.

Despite the variation in the successful rate, we emphasize that, as shown in Table 4.1, the probability to determine the originating machine of a content request message is reasonably high in the performed experiments. In addition, as we have discussed in Section 3.2, for the rest of the content request messages that we cannot determine the originating machine, we can identify all the machines that have either initiated or forwarded the message, which could be helpful forensic information in some investigative cases.

Table 3.2: Classification of messages successfully traced back.

Set Total successful C1 C2 C3 C2 & C3 S1 43 17 19 24 17 S2 24 11 4 13 4 S3 41 25 12 12 8

In order to better understand the results of the experimental studies, in Table 3.2 we show the number of messages whose originating machine are successfully identified by rules C1, C2, and C3,

34 respectively. In the table we also show the number of messages that are successfully traced back by both rules C2 and C3. From the figure we can see that, although the originating machines of a large number of messages are determined because of the path length of two (i.e., C1), C2 and

C3 are indeed effective in identifying originating machines of messages that traverse a long path before encountering a monitoring node. In particular, it shows that the rule C3 alone is already very effective in helping determining the originating machine of a message.

Table 3.3: Properties of message paths.

Forwarding Reverse Set Linear Non-linear Linear (Failed) Non-linear S1 98 2 80 (37) 20 S2 85 15 55 (31) 45 S3 94 6 69 (28) 31

So far we have argued that, due to a number of protocol features of Freenet, it is likely that a message will traverse a linear forwarding path (without backtrack). In Table 3.3 we show the properties of message paths. In particular, we show the number of forwarding paths that are linear and that are non-linear, respectively (columns 2 and 3 in the table). As we have discussed in Section 3.2, we cannot determine the originating machine of a message if the corresponding forwarding path is not linear. As we can see from the table, it is quite common for a forwarding path to be linear in the performed experiments. As we have discussed above, the combination of semi-structured Freenet topology, strong connectivity among nodes and two-hop routing lookup, and reasonably large HTL value will likely cause forwarding path of content request messages to be linear, instead of containing backtracked branches (as shown in Figure 3.3).

A linear forwarding path increases the chance for us to determine the originating machine of the corresponding request message. However, note that a linear forwarding path does not guarantee the identification of the originating machine. As shown in Figure 3.4, a linear forwarding path may contain two neighboring nodes that have not directly interacted with each other regarding the forwarding of the request message. This complicates the identification of the originating machine of

35 the message, as it appears to the traceback algorithm that multiple neighbors have seen the same

UID value; that is, we have a non-linear reverse path. Whenever we have a non-linear reverse path, the traceback algorithm will not try to determine the originating machine of the message, given the traceback difficulties presented in Section 3.2.

In Table 3.3 we also show the number of reverse paths that are linear and non-linear, respectively.

In addition, we also show the number of linear reverse paths that we failed to identify the originating machines of the corresponding messages. We note that the majority of linear forwarding paths indeed result in linear reverse paths. However, a large number of them do not meet the condition of either C1, C2, or C3, and therefore, we cannot determine the corresponding originating machines.

We did verify that, for all these linear reverse paths that do not meet either C1, C2, or C3, the last node along a traceback path (i.e., n0) is indeed the originating machine of the corresponding message. However, we do not claim that we can successfully identify them, given that in a real-world attack (where we do not have access to the forwarding path), we cannot determine if a backtrack has occurred. On the other hand, we note that this could provide additional forensic information in investigative cases.

Figure 3.5 shows the empirical cumulative distribution function (CDF) of the lengths of all linear forwarding paths and all linear reverse paths. From the figure we can see that content request messages in S2 in general traverse a longer forwarding path than those in S1 and S3, which could partially explain the worse performance obtained in S2 (as shown in Table 3.3, S2 also contains more non-linear forwarding paths). However, we caution that forwarding path length is only one factor that will affect the chance for the originating machine of a content request message to be determined. As we have discussed, other factors such as the specifics of Freenet topology will also likely affect the probability to determine the originating machines.

In the figure, we also show the length distribution linear reverse paths. As we can see from the

figure, on average, linear reverse paths are shorter than linear forwarding paths, which indicates as the length of a linear forwarding path increases, the chance for the corresponding reverse path

36 1

0.8

0.6

0.4 S1-forward S2-forward S3-forward 0.2 S1-reverse S2-reverse S3-reverse

Cumulative Distribution Function (CDF) 0 0 5 10 15 20 25 Path length

Figure 3.5: Length distribution of linear paths. to be non-linear also increases; that is, the chance for the case presented in Figure 3.4 increases, which is intuitively reasonable.

In order to examine if path lengths have an impact on the possibility for a path to satisfy the conditions C2 and C3, in Figure 3.6 we show the CDF of all linear reverse paths with a length greater than 2, including the ones that we can successfully determine the originating machines of the corresponding messages (with length succeed), and the ones we cannot (with length fail). (We exclude all paths with length of 2 because they meet condition C1). As we can see from the figure, path lengths do not have much impact on the possibility for a path to meet conditions C2 or C3.

This is intuitively sound because the specifics of the connectivity between neighboring nodes (i.e., the location) along a path should play more role in meeting the conditions than the length of the path.

37 1

0.8

0.6

0.4 S1-succeed S2-succeed S3-succeed 0.2 S1-fail S2-fail S3-fail

Cumulative Distribution Function (CDF) 0 3 4 5 6 7 8 9 10 11 Path length

Figure 3.6: Length distribution of linear reverse paths.

3.3.2 Simulation Studies

The number of Freenet nodes in the Emulab-based experimental studies is constrained by the resources we can obtain from the Emulab project. In order to investigate the effectiveness of the developed traceback attack in larger-scale network topologies, in this subsection we perform simulation studies using the Thynix simulator coming with the Freenet project [5].

Setup. Thynix is a simulator developed to investigate the Freenet behaviors including probe routing and path folding. It supports the routing of Freenet request messages in the sense that, given a pair of source and destination nodes in Freenet, it can determine the path that a request message will follow in the Freenet. However, in order to scale to large Freenet network topologies, it does not support functions such as file insertion, storage, or retrieval. For this reason, in all the simulation studies, we will not insert and retrieve files as in the experimental studies. Instead, we collect the simulation traces (in particular, the route between a pair of source and destination

38 nodes) and analyze the traces to determine if the originating machine of a request message can be identified by the traceback attack, should the request message have been initiated by the source node.

We consider three network sizes with 4000, 8000, and 10000 nodes, and two node degrees of

8 and 16. A node degree specifies the maximum number of neighbors that a node can have in a topology. In combination, we have 6 different sets of network topologies, (4000, 8), (8000, 8),

(10000, 8), (4000, 16), (8000, 16), (10000, 16), in the format of (network size, node degree). We refer to them as S1 to S6, respectively. For simplicity, we also use S1 to S6 to refer to the six sets of simulation studies we perform using the sets of network topologies, respectively. We note that the current Freenet has about 3000 to 4000 nodes simultaneously online on average. In order to emulate the network topology of the current Freenet, we construct all the network topologies in the simulation studies to be of the small-world property [5]. Furthermore, since we focus on the investigation of the effectiveness of the traceback attack, in all the simulation studies we set HTL to a large value (200) so that with a high probability we can always find a path from any source node to any destination node.

In each set of simulation studies, we randomly generate 10 network topologies. For each network topology, we randomly choose 100 pairs of source and destination nodes in the topology and collect the routing information from the source node to the destination node. Put in another way, we emulate 1000 content request messages in each set of simulation studies.

Results. In this subsection we present the results of the Thynix-based simulation studies.

Overall, the results of the simulation studies are consistent with that of the Emulab-based experi- mental studies, and therefore, we will only briefly discuss the results in this subsection.

Table 3.4 shows the overall performance of the traceback attack in the 6 sets of simulation stud- ies; for 42.9% to 49.2% of content request messages, we can successfully determine the originating machine of a message. First we note that the results in the simulation studies are less various than that in the experimental studies (see Table 4.1). This could be related to the fact that the

39 Table 3.4: Results of simulation studies. Successful Set Total Number Percentage S1 1000 432 43.2% S2 1000 429 42.9% S3 1000 441 44.1% S4 1000 472 47.2% S5 1000 474 47.4% S6 1000 492 49.2% topologies created in the simulation studies are more structurally similar than the ones created in the experimental studies, in the sense that the topologies in the simulation studies are created using an algorithm with the specified network size and degree; in contrast, the ones in the experimental studies are created dynamically based on the arrivals of nodes joining a Freenet. Despite the minor difference between the specific numbers in the results, we emphasize that both experimental stud- ies and simulation studies confirm the effectiveness of the developed traceback attack. In addition, the simulation results also corroborate the soundness of the results in the experimental studies, although their network topologies only contain a small number of nodes.

Table 3.5: Classification of messages successfully traced back (simulation).

Set Total successful C1 C2 C3 C2 & C3 S1 432 2 125 425 120 S2 429 1 100 426 98 S3 441 0 94 435 88 S4 472 2 240 469 239 S5 474 1 206 469 202 S6 492 3 211 488 210

Table 3.5 shows the number of messages that can be traced back by rules C1, C2, and C3, respectively. In addition, we also show the number of messages that can be successfully traced back by both C2 and C3. As we can see from the table, in the simulation studies, we only have a very small number of paths of two hops (C1). This is likely related to the large network topologies used in the simulation studies (compared to that of experimental studies). Furthermore, we can

40 see again that the rule C3 is very effective in helping determining the originating machine of a message. The originating machine of the vast majority of messages can be determined by this rule alone.

Table 3.6: Properties of message paths (simulation).

Forwarding Reverse Set Linear Non-linear Linear (Failed) Non-linear S1 1000 0 920 (488) 80 S2 1000 0 915 (486) 85 S3 998 0 898 (457) 100 S4 1000 0 989 (517) 11 S5 1000 0 988 (514) 12 S6 1000 0 994 (502) 6

Table 3.6 shows the properties of message paths. We first note that all forwarding paths are linear (we cannot find a path for two pairs of source and destination nodes in S3, because HTL reaches 0). We caution that this observation could be partially caused by the stronger structure of the network topologies used in the simulation studies than that of the Emulab-based experimental studies. Network topologies in the simulation studies are randomly created; however, they are created by an algorithm to follow the small-world property. In contrast, the network topologies used in the experimental studies are created dynamically as nodes arrive to Freenet. The topological structure of a network created in this manner is likely to be between a random network and a small- world network. In addition, the relatively large node degrees also help to have linear forwarding paths (we note that the maximum node degree in the current Freenet is bigger than the ones used in the simulation studies). The results in the table again confirm that a linear forwarding path will likely result in a linear reverse path, which increases the chance for us to determine the originating machine of a message. As in the experimental studies, we cannot determine the originating machine of a large number of messages, despite the linear reverse path they are associated with. Such paths do not meet any of the rules of C1, C2, or C3.

41 1

0.8

S1-forward 0.6 S2-forward S3-forward S4-forward S5-forward 0.4 S6-forward S1-reverse S2-reverse 0.2 S3-reverse S4-reverse S5-reverse

Cumulative Distribution Function (CDF) S6-reverse 0 1 10 100 Path length

Figure 3.7: Length distribution of linear paths (simulation).

Figure 3.7 shows the CDF of the lengths of all the linear forwarding paths and linear reverse paths. Similarly, on average, linear reverse paths are shorter than linear forwarding paths, although the difference is not as great as in the experimental studies (see Figure 3.5).

Figure 3.8 shows the CDF of all the linear reverse paths with a length greater than 2. We can draw a similar conclusion as in Figure 3.6 that path length does not have much impact on the possibility for a path to meet conditions C2 or C3 for a linear reverse path.

3.4 Discussion

We note that the key capability that the traceback attack relies on is being able to query a

Freenet node to determine its state regarding the forwarding of a message (or rather the correspond- ing UID value). That is, by certain means, an attacker can distinguish a node that has forwarded a message from a node that has not forwarded a message. If this distinction can be somehow identified, an attacker can then determine the forwarding of a message, and a traceback attack can

42 1

0.8

S1-succeed 0.6 S2-succeed S3-succeed S4-succeed S5-succeed 0.4 S6-succeed S1-fail S2-fail 0.2 S3-fail S4-fail S5-fail

Cumulative Distribution Function (CDF) S6-fail 0 0 5 10 15 20 25 Path length

Figure 3.8: Length distribution of linear reverse paths (simulation). be potentially carried out. We therefore believe that, in order to prevent this and other traceback attacks on Freenet (and similar peer to peer anonymous networks), special attention should be paid to the design and development of the system so that, anyone external to a node should not be able to determine the message forwarding state at the node.

Following this simple yet powerful observation, a more proper solution to the traceback attack is to change the response of a Freenet node to an incoming request (or probe) message with a UID that has been observed by the node. Instead of sending back a Reject with Loop failure message to inform the requester this fact, a more general failure message should be sent back or no message should be sent back so that the requester cannot determine the message forwarding state of the node. This is the essential idea of one of our solutions to this traceback attack. We will describe our solution in detail in chapter 5.

43 3.5 Related Work

Freenet project has documented a few potential attacks on Opennet [4], including node harvest- ing on Freenet, mobile attacker source tracing, and routing table takeover. The traceback attack is a complete, practical, and efficient attack on Freenet to trace back a content request message to its originating machine. Importantly, this work helps to illustrate the security issues in the design and development of Freenet, and to provide insights on how to improve Freenet and similar p2p anonymous networks to proactively defend against similar traceback attacks.

Shortly after the traceback attack was identified on Freenet, the Freenet project developed a quick patch to mitigate the traceback attack, by removing the queue of UIDs associated with the completed request messages [29]. This scheme limits the flexibility of the traceback attack in terms of the time window that a traceback attack can be carried out. However, given that the list of UIDs associated with active request messages are still maintained by each node, the traceback attack can be still carried out, albeit with a smaller time window due to the scheme. One of the potential solutions discussed in the previous section is being considered by the Freenet project [41].

44 CHAPTER 4

DYNID: THWARTING THE TRACEBACK ATTACK ON FREENET

4.1 Introduction

The traceback attack we developed can break the anonymity of content retrievers on Freenet [25].

More specifically, the traceback attack can identify the originating machine of a content request message, even if a single content request message has been issued by the content retriever.

In the traceback attack, after a monitoring node of an attacker observes a content request message, nodes controlled by the attacker will iteratively connect to suspect nodes, and query them to determine if they have seen the message previously, based on the reply message from a suspect node. The traceback attack exploited a few fine-grained design and development decisions made in

Freenet. A key Freenet feature utilized by the traceback attack in determining if a node has seen a message is the unique identifier (UID) associated with each content request message.

The UID carried in a content request message is used to prevent routing loops of the message.

In Freenet, each node maintains a set of UIDs carried by the messages that it has seen, i.e., it has either originated or forwarded. When a request message with an old UID arrives at a node, the node will discard the message and reply to the upstream node with a Reject with Loop failure message. The traceback attack exploited this feature by sending to a suspect node a specially crafted probe message with the same UID as that of the interested content request message, in order to determine if the node has seen the content request message previously.

In this part of our work we develop a simple yet effective scheme named dynID (dynamic UID) to mitigate the traceback attack on Freenet. In dynID, the UID of a content request message will be dynamically changed at the beginning portion of the message forwarding path. Let n denote

45 the node where the UID of a content request message is last changed along the message forwarding path. In dynID, an attacker can only trace back the message to n; it cannot uniquely determine if n or any node is the originating machine of the message.

A key concern over dynID is that, given that a content request message is not uniquely associated with a UID anymore, loops may be formed along the message forwarding path, which may limit the search scope of a content request message and result in the failure in locating the corresponding content on Freenet. We note that a number of design decisions of dynID and the Freenet make this unlikely, and dynID should only have negligible impacts on the performance of Freenet in locating content on the network. Our simulation studies using the original Freenet source code, extended with dynID, show that, for all the content requests, we are able to successfully locate the corresponding content, despite the existence of short-lived loops along the forwarding path of a content request message (in some of the simulation studies).

Shortly after the traceback attack was identified on Freenet, the Freenet project developed a quick patch to mitigate the traceback attack, by removing the queue of UIDs associated with the completed request messages [29]. This scheme limits the flexibility of the traceback attack in terms of the time window that a traceback attack can be carried out. However, given that the list of UIDs associated with active request messages are still maintained by each node, the traceback attack can be still carried out, albeit with a smaller time window due to the scheme.

The lightweight dynID scheme can more effectively thwart the traceback attack on Freenet.

In particular, an attacker can only trace back a content request message to the node where the

UID value is last changed; it cannot uniquely determine the originating machine of the message.

It is worth noting that the two schemes complement each other, and can be combined to provide stronger user anonymity for content retrievers on Freenet.

We present the details of the dynID scheme in thwarting the traceback attack on Freenet, and conduct simulation studies to investigate its performance impacts on Freenet in locating content on the network. The remainder of the chapter is organized as follows. In Section 4.2 we provide the essential background on the traceback attack. In Section 4.3 we detail the proposed dynID scheme

46 in thwarting the traceback attack on Freenet. We perform simulation studies on the impacts of dynID on the performance of Freenet in Section 4.4.

4.2 Background

In this section we provide the necessary background on the traceback attack on Freenet, and the critical Freenet features that were exploited by the traceback attack. We refer interested readers to [25] and [10] for the details of the traceback attack and Freenet, respectively.

Monitoring node nk−1 nk Routing keys to be monitored Attack nodes

Figure 4.1: Basic structure of the traceback attack

Figure 4.1 illustrates the basic structure of the traceback attack. In the traceback attack, an attacker will deploy a number of monitoring nodes in Freenet, with each maintaining a set of interested routing keys to be monitored. (A routing key is calculated based on the content to be requested and carried in a content request message.) A monitoring node will passively observe the content request messages passing through the node and try to match their routing keys with the routing keys to be monitored.

When an interested content request message is identified by a monitoring node m, it will forward a few pieces of information to an attack node controlled by the attacker, including the set of suspect nodes, which are the nodes who may have seen (i.e., either originated or forwarded) the message.

Let n → m denote that a message is forwarded from node n to node m, that is, node m received the interested message from node n. Then the suspect nodes are the neighbors of node n (excluding

47 node m since we know that m has received the message). The neighbor information of node n is available at node m due to the two-hop lookup algorithm employed by Freenet [25].

After an attack node receives the set of suspect nodes, it will connect to (i.e., become a neighbor of) each of the suspect nodes one by one. This component of the attack is carried out by exploiting the Freenet features on supporting an arbitrary node to join Freenet and the neighbor update mechanism at a Freenet node [27] and [25]. After an attack node is connected to a suspect node, it will send a specially crafted probe message to the suspect node to determine if the node has seen the corresponding content request message previously. This process exploited the Freenet feature on preventing routing loops of content request messages.

In Freenet, each content request message is associated with a unique identifier (UID), and each node also maintains the set of UIDs that it has seen. UIDs maintained by a node is classified into two categories. The first category is a list of UIDs associated with active request messages (whose corresponding reply message has not come back from the downstream node), and the second a queue of UIDs of the completed request messages (whose corresponding reply message has come back from the downstream node). When a request message arrives at a node, the node will first check if it has seen the UID carried in the message (in the set of UIDs of either category). If the node has seen the UID previously, it will immediately send back a failure message Reject with

Loop to the upstream node where the message comes from. Otherwise, it will process the request message normally.

Based on the reply from a suspect node, the attack node can determine if the suspect node has seen the message previously. After a node n that has seen the message is identified, its neighbors will be in turn considered as suspect nodes and will be contacted and probed by an attack node to determine if any of them has seen the message previously. Similarly, the neighbor information of node n is available at the attack node because of the two-hop lookup algorithm of Freenet. This process continues until no new suspect nodes are identified. At the end of this process, we have the set of all nodes that have either originated or forwarded the content request message.

48 A set of lemmas were developed in [25] to help determine the originating machine of a content request message, after all nodes that have seen the message have been identified, by exploiting the routing algorithms employed by Freenet to forward a request message. In essence, the originating machine of a content request message can be uniquely determined, if the forwarding path of the message satisfies certain conditions defined in the lemmas. The experimental studies performed in chapter 2 showed that, the originating machines of 24% to 49.2% of content request messages can be uniquely determined.

4.3 DynID to Thwart Traceback Attack

In this section we provide the details of the dynID scheme in thwarting the traceback attack on

Freenet. We will present the intuition and the design choices of the scheme, and illustrate how it can help thwart the traceback attack on Freenet.

As we have discussed above, Freenet relies on a UID based mechanism to prevent routing loops of request messages, and this mechanism is one of the key features being exploited by the traceback attack. By iteratively connecting to and probing suspect nodes starting from a monitoring node where an interested content request message is observed, an attacker can identify all the nodes that have seen the message (based on the UID value). The originating machine of the message can then be determined if the message forwarding path satisfies certain conditions [25].

Given that the UID associated with a content request message plays a critical role in guiding the progress of a traceback attack, one way to thwart the traceback attack is to dynamically change the UID value associated with the message along the message forwarding path, which is the basic idea of the dynID scheme. In designing dynID, we must take into consideration a few factors that may affect the performance of Freenet, and whether or not an attacker can infer if the UID value of a message has been changed. In particular, we note that UID associated with a request message is used by Freenet to prevent routing loops. If a routing loop is formed along the forwarding path of a content request message, Freenet may not be able to locate the desired content even if the content

49 exists on the network. Therefore, dynID should be designed in a way that can reduce the likelihood of forming routing loops, which is strongly related to when the UID value should be changed. In the following we will discuss these design choices of dynID.

We first provide some necessary Freenet background on controlling how far a request message can traverse in the network. In Freenet, in addition to UID and routing key, a content request message is also associated with an HTL (hop to live) value, which is used to control how many hops (nodes) the message can traverse. The HTL field is initialized to a maximum value at the originating machine, and reduced by one at each node. When the HTL value becomes 0 at a node, the message is discarded by the node, and a Data not Found failure message is returned to the upstream node (and propagated back to the originating machine).

In Freenet, for security reasons, a node may not decrease the HTL value by 1 when the HTL value is the maximum value, so that a receiving node cannot infer if the upstream neighbor is the originating machine of the message, even if HTL has a value that is 1 less than the maximum value.

Instead, the HTL value is only decreased by 1 with a preconfigured probability (default probability is 50%) when it equals the maximum value. Similarly, the HTL value may not be decreased to 0 when it already reaches 1. As a consequence of this behavior, a message may traverse a path that is longer than the initial (maximum) HTL value.

Location where UID value should be changed. Now let us discuss when the UID value of a request message should be changed. To simplify the discussion, we first assume that the UID is only changed once along the forwarding path. As mentioned above, a consequence of dynamically changing the UID value of a message is that routing loops may be formed along the forwarding path, and more importantly, the desired content may not be located because of routing loops.

Figure 4.2 illustrates an instance where a routing loop is formed along the forwarding message.

In the figure, node A originates a message with UID value of uid1, which is forwarded along the solid line from A to B, and then to C, D, and E, in that order. Assume the UID value is changed at node E from uid1 to uid2, and the message is forwarded back to node B with the new UID value. Given that the incoming message carries a new UID value, node B considers the message a

50 uid1 uid1 uid2 AB C DE

uid1 uid1

Figure 4.2: A forwarding path with loop. new message, and (likely) the message will be forwarded along the dashed line to nodes C and D in that order. Node D may again forward the message to node E. However, given that node E has seen uid2 previously, it will directly reject the message, and node D will choose a different node to forward the message to (if it has one). We refer to the number of nodes involved in a loop (i.e., the nodes that forward the same message more than one time) as the size of the loop, and denote it as

|L|. In Figure 4.2, |L| = 3.

We note that the larger the value of |L| is, the greater impact a loop will have on the decrement of the HTL value, and consequently, the search scope of a content request message. (Note that the

HTL value may not always be decreased by 1 at a node.) In the worst case, the HTL value could be reduced by |L|, that is, each node in the loop will decrease HTL by 1. If the value of |L| is large, this can greatly affect the search scope of a content request message. For example, assume the maximum value of HTL is 18 (default HTL maximum value in Freenet), and further assume |L| to be 15, the concerned content request message can only roughly visit 3 nodes that are not part of the loop along a forwarding path, and the likelihood for the content request message to locate the desired content will be small. Therefore, changing the UID value after a request message has traversed a large number of nodes may not be desired, due to the impacts on the performance of

Freenet in locating requested content on the network.

Another related issue is if the UID value of a request message should be changed at a fixed

(relative) location along the forwarding path, or each node should independently decide whether

51 or not it will change the UID value. In order to illustrate the issue, we present a failed solution that we first worked on. In this failed solution, we choose to change the UID value after a message has traversed at least two nodes. More formally, let max denote the default maximum HTL value

(18 in Freenet), and let htli denote the HTL value carried in a content request message when it arrives at a node n, and htlo the new HTL value (may be htli − 1) to be carried by the outgoing message. A new random UID value will be assigned to the message by node n if htlo = max − 2 and htli > max − 2.

We note that the two conditions ensure that the UID value is only changed one time along the forwarding path (when htlo = 16). We also note that the UID value may be changed at node that is more than 2 hops away from the originating machine, given that the HTL value may not be decreased when it equals the maximum value. However, this solution cannot completely prevent the traceback attack. For example, if a monitoring node observes a content request message with

UID = 17, it can safely infer that the UID value of the message has not been changed, and therefore, it can carry out the traceback attack to determine the originating machine of the message. Based on this observation, we conclude that changing the UID value at a fixed location (relative to the

HTL value) cannot provide the desired anonymity strength. Instead, we need to let each node to independently decide if the UID value should be changed.

In order to balance the two requirements (the UID value should not be changed after a content request message has traversed a large number of nodes, and each node should independently decide if the UID value should be changed), dynID will only dynamically change the UID value of a message at the beginning portion of the message forwarding path. More specifically, the UID value will only be changed with a preconfigured probability at the nodes where the HTL of a message still equals the maximum value (i.e., htl = max). Recall that the HTL value of a message may not be decreased by 1 when it equals the maximum value, therefore, multiple nodes may independently change the UID value of the message. In essence, the basic process of changing UID in dynID is similar to that of updating the HTL values in the original Freenet. If the UID value is changed at

52 a node n, node n will also maintain the mapping between the two UIDs so that a reply message can be properly processed and propagated back to the upstream node.

Given this design of the dynID scheme, a natural question is if the UID value should also be changed at other parts of the forwarding path. We note that the objective of this work is to thwart the traceback attack that can uniquely identify the originating machine of a content request message. In dynID, an attacker cannot deterministically infer if the UID value has been changed at the beginning portion of the forwarding path. Therefore, it cannot uniquely determine if the last node being traced back to is the originating machine. Consequently, it is not necessary to change the UID value again at other parts of the forwarding path.

It is also worth noting that the UID value of a request message may not be changed along the forwarding path, depending on the length of the forwarding path, the dynamics of the HTL value of the message, and the probabilistic behavior of each node in deciding if the UID value should be changed. The desired content may be located close to the originating machine of the message (before the UID value is changed), or all the nodes where the HTL value equals the maximum value decide not to change the UID value. A related question is whether or not we should guarantee that the

UID value should be changed at least one time along the forwarding path. By the same argument as above, even if htl = max at the monitoring node, an attacker still cannot deterministically infer how many nodes the message has traversed or if the UID value of a message has been changed.

Therefore, we do not need to guarantee that the UID value should be changed at least one time.

Based on the above discussions, the UID value of a content request message may only be changed at the beginning portion of the message forwarding path while htl = max, for the purpose of thwarting the traceback attack that can uniquely identify the originating machine of a content request message. We leave the investigation of probabilistic identification of the originating machine of a request message, and further changing the UID value along the message forwarding path as future work.

How secure is dynID? In dynID, an attacker can trace back a message from a monitoring node to the node n where the UID value is last changed along the forwarding path of the message.

53 However, based on the information carried in the request message alone, an attacker cannot deter- mine if node n is the originating machine or it is merely a node where the UID value is changed.

(An attacker can compute the probability that node n is the originating machine, but it cannot uniquely determine if it is so.)

One complicating factor is the availability of the routing tables at the nodes in Freenet. Due to the two-hop message lookup mechanism adopted in Freenet, the routing table of any node in the Freenet can be obtained by an attacker (see [27] and [25]). At the high level, Freenet adopts a two-hop shortest-distance routing mechanism. Consider a message msg, and let d(k) denote the distance from node k to the routing key carried in the message msg (based on the virtual location of node k and its neighbors). Let m denote a node on which the message msg arrives, and let

N denote the set of all neighbors of node m. In the two-hop shortest-distance routing mechanism of Freenet, node m will choose the neighbor with the smallest distance to the routing key of the message msg, that is argmink{d(k)}.

In [25], a few lemmas were developed to help determine the originating machine of a message, by exploring the routing algorithm used in Freenet. A natural question is that if the routing algorithm

(and the availability of routing tables of all nodes) can help an attacker to determine the node n where the UID is last changed is indeed the originating machine. To ease exposition, let us assume that node k in Figure 4.3 is the last node that an traceback attack have identified to have seen an interested UID value uid (of an interested request message msg), all the neighbors of node k did not see uid (except node n, from which the attacker traced the message back to node k). Without loss of generality, we further assume that node k is the source node of message msg among all the nodes that the traceback attack has identified to have seen the corresponding UID value. That is, the message is first forwarded to node k, before being further forwarded into any other nodes being identified by the traceback attack, based on the UID value observed by the monitoring node.

Let N(k) denote the set of neighbors of node k (excluding node n for the simplicity of discussion).

If the attacker can conclude based on the routing tables of N(k) and the routing algorithm in Freenet that none of the nodes in N(k) can have forwarded the message to node k, then it can also conclude

54 h j k n

Figure 4.3: Can node j forward a message to node k if node h is more preferred? that node k is the originating machine of the message (recall that we assume node k is the source node of all nodes that have been identified by the traceback attack based on the UID value).

In the following we show that, although the routing tables of all nodes and the specifics of the routing algorithm of Freenet are available to an attacker, it cannot determine if a neighbor of node k can have forwarded a message to the node. We note that despite the similarity between the problem being faced here and the problem being faced in [25], the problem faced here is much harder. In [25], an attacker knows all the nodes that have seen a particular UID value, it only needs to determine the direction of a message forwarding in order to determine if a machine is the originator of a message. In contrast, an attacker in the dynID scheme does not know all the nodes that have either originated or forwarded a message.

One potential approach an attacker may try to determine if node k is the originating machine is to check if it is possible for any of the neighbors j ∈ N(k) to forward the message msg to node k, based on the routing tables of the neighbors. If it is impossible for any of the neighbors to forward the message to node k, then the attacker can conclude that node k is the originating machine of the message. Following the methodology in [25], it is enticing to believe that if a neighbor j has a more preferred neighbor h than node k (i.e., d(h) < d(k)), node j will not forward the message to node k, and if this applies to all neighbors of node k, none of them could have forwarded the

55 message to node k. However, this observation is incorrect. Even if d(h) < d(k) holds, it is still possible for node j to forward the message to node k.

One possible scenario is that node j tried to forward the message to node h first (because it is more preferred); however, the desired content is not located along that path, and the message is backtracked to node j, which then chosen node k. Another possible scenario is that, the message was actually forwarded by node h to node j, and then forwarded by node j to node k. There are other scenarios where it is possible for node j to forward the message to node k, even if more preferred neighbors of node j exist. Given that it is alway possible to make a case that the message may have been forwarded by a neighbor to node k, the attacker cannot conclude node k is the originating machine, even if it is the source node among all the nodes that have identified by the traceback attack to have seen a request message.

Per node or per message probability to change the UID value. In the original Freenet, the HTL value is kept the maximum value on a per node basis. More specifically, the behavior of a node to decrease or not to decrease the HTL value when htl = max is determined when the node is started (joining Freenet). After the behavior is determined, it will be applied to all content request messages in the same manner. However, it provides a security hole in the sense that an attacker can easily determine the behavior of node in terms of the HTL update. For example, an attacker can connect two nodes to a target node, and send a content request message from one attack node to request content stored on another attack node. By observing the HTL value at the receiving attack node, it can infer if the target node will decrease HTL value or not (when it equals the maximum value).

Similarly, using the same method, an attacker can determine if a target node will change the

UID value (when htl = max), should changing UID value be performed on a per node basis.

Moreover, an attacker can profile all nodes on Freenet beforehand, instead of during a traceback attack. Profiling the behavior of all benign nodes on Freenet (in terms of HTL and UID update) will help an attacker in determining the originating machine of a message, at least in some special cases. For example, assume that node k is determined as the source node of all nodes that have

56 been identified by a traceback attack, and assume (in an extreme case) that all neighbors of node k will decrease the HTL value (when htl = max), then the attacker can safely infer that node k is the originating machine if, for example, the observed HTL value equals the maximum value. For this reason, in dynID, both the HTL value and the UID value are updated on a per message basis instead of on a per node basis. That is, a node will determine if the HTL value and UID value of a message should be updated, independent from other messages that have been forwarded by the node.

4.4 Performance Evaluation

In this section we conduct simulation studies to investigate the impacts of the dynID scheme on the performance of Freenet in locating content on the network. We first describe the set-up of the simulation studies, and then we present the performance results.

4.4.1 Simulation Set-up

The performance studies are carried out using the simulator coming with the Freenet project.

Unlike other simulators that re-implement (and normally simplify) a product system, the simulator coming with Freenet uses the original Freenet source code (version 0.7, the latest version of Freenet).

Put in another way, the behavior of nodes in the simulator is identical to that of nodes in the real- world public Freenet. We extend the source code of Freenet to support the dynID scheme.

All the Freenet networks we used in the simulation studies consist of 400 nodes, and each node can have up to 6 neighbors. The networks are Kleinberg (small-world) networks and constructed in the following manner [28]. As in the real-world Freenet, each node in a network will be assigned with a location in the circular space [0, 1], where location 0 and 1 are considered identical. The set of locations used are evenly distributed in the circular space, with a distance of 1/N, where N is the number of nodes in the network (400). The first location is 0, the second 0.0025, the third

0.0050, and so on.

57 Next we describe how a node selects neighbors. A node n will select half of the maximum number of neighbors based on a probability that is reversely proportional to the distance between node n and a candidate node. After all nodes have selected half of the maximum number of neighbors (i.e., 3 in our case), node n may have between 3 and 6 neighbors (node n may be selected by other nodes as a neighbors, therefore, it may have more than 3 neighbors). For each node n with b< 6 neighbors, we randomly select 6 − b nodes as its neighbors among all the nodes who still need more neighbors.

We run 10 sets of simulation studies, with each set containing 100 simulation runs. Each set of simulation studies use the same network topology (i.e., 10 different network topologies are used).

In each set of simulation studies, we insert 100 different files at randomly selected nodes, and then retrieve them one by one from randomly selected nodes. We refer to the 10 sets of simulation studies as S1 to S10, respectively. In all simulation studies, we use the Freenet default values unless otherwise stated. In particular, we use the default HTL value (18), and the default probability 50% to decrease HTL when it has the maximum value. The probability to change the UID value in dynID when htl = max is also 50%.

4.4.2 Results

In this section we present the results of the simulation studies; we focus on the impacts of the dynID scheme on the performance of Freenet in locating content on the network. As we have discussed early, loops may be formed along a message forwarding path in dynID, which will (likely) decrease the HTL value and limit the search scope of a content request message. As a consequence, we may not be able to return the desired content for a content request message, if the loop situation becomes very bad (which has been taken into consideration in the design of dynID).

Table 4.1: The number of successful content lookup requests

Set Total Successful S1-S10 1000 1000

58 Table 4.2: Properties of message forwarding paths.

Non-linear Set Linear Total Loop (size) Rejected loop L&R S1 99 1 0 1 0 S2 100 0 0 0 0 S3 99 1 0 1 0 S4 98 2 0 2 0 S5 99 1 0 1 0 S6 99 1 0 1 0 S7 100 0 0 0 0 S8 99 2 1 (2) 2 1 S9 100 0 0 0 0 S10 98 2 0 2 0

As shown in Table 4.1, for all the 1000 content requests, we are able to successfully locate the desired content. A number of factors should have contributed to the desired performance of dynID. First, dynID limits the change of the UID values only to the beginning portion of a message forwarding path. This design coupled with the probabilistic behavior of decreasing HTL values implies that the size of a loop, even if one is formed, will not be large. (The number of nodes with htl = max along a message forwarding path follows a geometric distribution with p = 0.5.)

Moreover, the two-hop lookup algorithm and the semi-structured network topology also help to route a content request message towards the right direction [25]. In addition, a large initial value of HTL (18) also reduces the chance for a content request message to back track (to the originating machine). Combining all these factors, the chance for a forwarding path to enter a loop caused by dynID is small. In addition, even if a loop is formed, its impact on affecting HTL (and therefore limiting the search scope of the request message) should also be small (see below).

In order to better understand the results of the simulation studies, in Table 4.2 we show the number of message forwarding paths of various properties. A forwarding path is considered linear if every node n on the path only forwards the message to a single downstream node one time.

Consequently, paths containing loops caused by dynID (loop in the table) and paths where Reject with Loop occurs (Rejected loop in the table) are non-linear paths. In the table the column L&R

59 shows the number of paths that contains both loops and Rejected loops. As we can see from the table, the majority of message forwarding paths are linear. Moreover, we only have one loop that is caused by dynID, and the size of the loop is very small (2), which should only impose a minimal impact on the search scope of a content request message. Note that S8 contains more than 100 forwarding paths. When a Data not Found failure (because htl = 0) is returned to the originating machine, it will automatically start a new content request message with a different UID value in the original Freenet [2]. We consider the path followed by this new content request message as a separate path from the previous request message.

In conclusion, we have developed a simple yet effective scheme named dynID to thwart the traceback attack on Freenet. In dynID, an attacker can only trace back a content request message to the node where the UID value is last changed; it cannot uniquely determine the originating machine of the message. Importantly, the developed dynID scheme only has negligible impacts on the performance of Freenet in locating content on the network.

60 CHAPTER 5

ROL: REROUTE-ON-LOOP IN ANONYMOUS P2P CONTENT SHARING NETWORKS

5.1 Introduction

In order to support censorship-resistant content publishing and user privacy on the Internet, a number of anonymous peer-to-peer content sharing networks (apCSNs) have been developed and deployed, including Freenet and GNUnet [10, 2, 34, 35]. One of the critical problems in the design of such a system is how to detect and handle routing loops. At the high level, two different approaches have been developed. The first one targets proactive routing loop prevention. In such an approach, a request message will carry the information of the nodes that it has traversed. When a node needs to forward a request message to a neighbor, the carried information will be used to prevent the message from being forwarded to a node that the message has traversed before. GNUnet adopts this approach, where the information of the nodes that a request message has traversed is carried in the message using a bloom filter [35].

The second approach aims to detect routing loops and react accordingly. Freenet adopts this approach. In Freenet, a unique identifier (UID) is carried by each request message and maintained by the nodes that have processed the message. When a request message with an old UID value arrives at a node, the node will send a failure message Reject with Loop to the upstream node n where the message comes from, so that node n will choose a different neighbor to forward the message to. A principal requirement of any loop handling scheme in apCSNs is that it should not leak any message forwarding information that can undermine the anonymity of the user who originates the message. However, both approaches leak certain level of message forwarding information, which can be exploited by attackers to undermine or compromise the user anonymity. For example, based

61 on the bloom filter carried in a request message in GNUnet, an attacker can determine the set of nodes that have seen (either originated or forwarded) the message, and in many cases, a partial forwarding path can be reconstructed based on the routing protocol of GNUnet, which further deteriorates the user anonymity.

Similarly, reactive loop detection schemes such as the one adopted in Freenet can also be exploited to determine the set of the nodes that have seen a request message. For example, in order to determine if a node has seen a request message with a particular UID value before, an attacker can send a specially crafted probe message with the interested UID value to the node.

The attacker can confirm that the node has seen the request message if a Reject with Loop failure message is returned. Moreover, as shown in [25], for a large portion of content request messages, the complete forwarding path can be determined, and the originating machine can be identified.

More discussions on the impact of loop handling schemes on the user anonymity of apCSNs will be provided in Section 5.2.

In this chapter we will develop a new routing loop handling scheme named Reroute On Loop

(ROL). In ROL, each request message will carry a UID value, and each node in the network will maintain the history of the UID values of the recent request messages that have traversed the node.

In addition, for each UID value, the node will also record the set S of the neighbors to which the corresponding request message has been forwarded by the node and the neighbors from which the message has come. When a node n receives a request message m with an old UID value, node n will forward the message m to the next closest neighbor (based on the routing protocol of the apCSN), excluding the neighbors in set S. In this way an attacker cannot determine if a node has seen a request message before by sending a specially crafted message with an old UID value. An old message at node n will be rerouted to an unused neighbor (or discarded due to other properties of the request message), and critically, no failure message revealing the fact that node n has seen the message will be returned to the upstream node where the message comes from.

A critical concern of ROL is its performance impact on the forwarding of request messages on the resulting apCSN. Given that a request message may traverse a node multiple times, messages

62 in an apCSN with ROL may traverse a longer path compared to the ones without ROL. Moreover, many apCSNs have a bound on the number of nodes that a request message can traverse, and therefore, ROL may limit the search scope of a request message. Consequently, a content insert message may not be able to identify the ideal location where the content should be inserted, and a content request message may not be able to reach the target location where the message should be routed to. In order to understand the efficacy and effectiveness of ROL, we will perform simulation studies using the Thynix simulator developed by the Freenet project [5], with a number of different network topologies, ranging from small-world topologies to random network topologies [22, 5].

Our simulation studies show that, compared to the current loop handling scheme in Freenet,

ROL only has minor performance impacts on the lengths of message forwarding paths on various network topologies (and consequently the search scope of a request message if a message is bounded by the number of hops it can traverse). For example, the average routing path lengths of messages with ROL are only increased by less than 1 hop compared to the current loop handling scheme of

Freenet on small-world network topologies. Our simulation results confirm that ROL is a practical scheme, and can be deployed on Freenet and similar apCSN systems.

The remainder of the chapter is structured as follows. In Section 5.2, we provide the necessary background on existing apCSNs and their routing schemes to illustrate the impact of loop handling on the user anonymity. In Section 5.3 we develop the new ROL scheme. We perform simulation studies to investigate the performance of ROL in Section 5.4, and discuss related work in Section 5.5.

We conclude this work in Section 5.6.

5.2 Background

In this section we provide the necessary background on the loop-handling operations of two representative apCSNs, Freenet and GNUnet, including their formation of network topologies, their routing algorithms, and how they handle routing loops. Towards the end of this section, we will also briefly discuss the operations of another apCSN named OneSwarm. We refer interested readers to [10, 2, 34, 35, 36] for more details on these apCSNs.

63 5.2.1 Freenet

Freenet nodes try to form a small-world network topology [22], where, with a high probability, the majority of neighbors of a node n have a location that is close to the location of node n. At the same time, a node may also connect to neighbors with a far-away location, which provide short-cut for routing messages to a remote target location. In default, each Freenet node can have up to 40 neighbors.

The UID in a content request message is used by nodes to uniquely identify a message, and to detect routing loops. In Freenet, UIDs are randomly generated and are of length of 8 bytes, it is unlikely that two unrelated messages will have the same UID value in Freenet. We note that, although nodes in Freenet aim to form a small-world network topology, routing loops may be formed on Freenet due to a number of factors.

First, the small-world network topology is not as structured as the structured peer-to-peer

(P2P) systems such as Chord [37]. In the structured P2P systems, deterministic routing can be used, and it can be guaranteed that routing loops will not be formed (at least in static cases). How- ever, structured P2P systems can themselves leak too much message forwarding information [38], based on the network topologies and the routing protocols. The greedy routing used in small-world networks cannot guarantee that a request message is always forwarded to the ideal target location.

The greedy routing protocol, to a degree, is only a best-effort approach based on the local infor- mation available at a forwarding node. As such, multiple tries may be carried out in forwarding a message, which increases the chance to form a routing loop. In addition, in order to deal with local minima and in an effort to locate the ideal target location of a message, some special techniques are also adopted (for example, forwarding a message to the next best neighbor, even if the next one is farther away from the target location of the message compared to the current node), which further increases the chance of routing loop formation.

Second, due to the nature of P2P systems, the network topology of Freenet is formed in a distributed fashion, and may not be an ideal small-world network, which further degrades the

64 performance of the greedy routing and increases the chance to form routing loops. In order to detect routing loops, each node maintains the history of the messages that it has recently seen in the form of UID values. When a node receives a request message, it will first check if it has seen the corresponding UID before. If it does, a Reject with Loop failure message will be returned to the upstream neighbor where the message comes from. Otherwise, the message is processed according to the routing protocol of Freenet.

However, as shown in [25], the loop handling mechanism in Freenet can be exploited by an attacker to identify all the nodes that have seen a request message. Moreover, when the path traversed by a request message satisfies certain conditions, the complete forwarding path can be re-constructed and the originating machine of the message can be identified. One of the key insights utilized by the traceback attack in [25] is that, by observing the responding message from a node to a specially crafted probe message with an interested UID value, an attacker can infer whether or not the node has seen a concerned content request message with that UID value. In order to prevent the leakage of message forwarding information while detecting and responding to routing loops, we need a new loop handling scheme in Freenet.

5.2.2 GNUnet

In the following we briefly discuss the operations of GNUnet, and illustrate how the handling of routing loops in GNUnet may leak message forwarding information that can be exploited. GNUnet nodes form a Kademlia-like network topology [39], and message routing is carried out in two stages. In the first stage, a request message is routed randomly in the network. After traversing a sufficient number of hops (roughly log(n), where n is the number of nodes in a GNUnet network), in the second stage, the request message is forwarded according to the Kademlia protocol, with an exception that, the routing is carried out in a recursive fashion instead of an iterative fashion as in the original Kademlia system, due to the anonymity requirement of GNUnet.

The rationale behind the random routing in the first stage is to make the lookup of a file independent of the location of the originating machine. Although it was not explicitly stated, we

65 believe that the random routing also helps to improve the anonymity strength of GNUnet. We note that Kademlia is a structured network topology, should the set of nodes that have originated or forwarded a request message become known, the complete forwarding path of a message in

Kademlia can be re-constructed. By including a random routing stage, an attacker can only trace a request message back to the last node involved in the random routing stage based on the routing protocol of Kademlia, but not the originating machine of the machine. Therefore, the random routing stage helps improve the overall anonymity strength of GNUnet.

However, random routing also introduces a new problem into GNUnet. Due to random routing, loops can be formed in the forwarding of a request message. To prevent this problem, each request message in GNUnet carries the information of the nodes traversed by the message using a bloom

filter. When a node needs to decide the next hop to forward a message to, the bloom filter carried in the message is used to exclude the nodes that have seen the message before. This approach has false positives, but will not have false negatives, which can guarantee the prevention of routing loops.

However, given the bloom filter is carried in the message, an attacker receiving the message can determine all the nodes that have seen the message before (it may mistake some nodes that have not seen the message before, but that probability should be very small, due to the objective of bloom

filters used in GNUnet), which degrades the anonymity of GNUnet. Furthermore, after the set of all nodes that have seen a message is identified, in certain cases, a partial message forwarding path may be re-constructed for the nodes involved in the second routing stage, which further degrades the anonymity of GNUnet. Overall, loop prevention techniques based on information carried in a request message have some undesired implications on the anonymity strength of the resulting apCSNs, given that the information is readily available to an attacker who can observe the request message.

66 5.2.3 OneSwarm

The loop handling schemes adopted in Freenet and GNUnet are the representative ones used in existing apCSNs. OneSwarm adopts a slightly different loop handling scheme [36]. In OneSwarm, nodes form an unstructured, random network topology. And as such, a search message is flooded by a node to its neighbors (with certain restrictions), instead of being routed as in Freenet or

GNUnet. In order to prevent a search message from being flooded more than one time at a node, each node maintains a set of rotating bloom filters to keep track of the search messages that have been recently flooded by the node. When an old search message arrives at a node, the message will not be further forwarded and no response message will be returned to the upstream node where the message comes from. This is different from the loop handling scheme used in GNUnet, where a bloom filter is carried in a request message.

We note that this scheme works in OneSwarm because of the message flooding mechanism used in OneSwarm. It will not work if messages are routed instead of being flooded. Due to scalability concerns of flooding, we will only focus on loop handling schemes that can work with message routing mechanisms instead of only message flooding mechanisms.

5.3 Reroute On Loop

The two loop handling schemes adopted by Freenet and GNUnet both leak certain level of message forwarding information that can be exploited by an attacker to compromise or undermine the user anonymity of these networks. It is critical to develop a secure loop handling scheme in order to improve the anonymity strength of these apCSNs. We note that loop prevention schemes such as the one adopted in GNUnet would require node traversal information to be carried in a message itself, which naturally leaks certain message forwarding information. In this work, we only consider loop detection and handling schemes, which do not have this requirement.

The loop handling scheme in Freenet leaks message forwarding information because a node will respond with a Reject with Loop failure message if it receives an old request message, which can

67 be exploited by an attacker. A potential approach to addressing this problem is to design and use a general failure message, instead of using a failure-specific response message, as briefly discussed in [25]. In particular, when a forwarding loop is detected at a node, instead of sending a Reject with

Loop failure message to indicate there is a forwarding loop, the node should send to the upstream neighbor a general failure message, so that the neighbor cannot infer the specific reason of the failure.

However, this approach has some important implications on the optimization and performance of Freenet, and more importantly, after careful examination, we note that it can still be exploited by an attacker. For example, by retrying a number of request messages with different UID values and contents, an attacker can determine a failure is caused by a specific UID value (routing loop) or failure of content lookup. As a consequence, an attacker can determine if a node has seen a request message before. Another approach is to not respond to an upstream neighbor any failure message at all when a forwarding loop is detected by a node (as in OneSwarm). However, without the failure message, the upstream neighbor cannot detect the routing problem and cannot forward the request message to a different node. Therefore, the critical issue is how to ensure that a request message encountering a forwarding loop can be routed continuously towards its target location.

In this section we will develop a new loop handling scheme, named Reroute on Loop (ROL) that will not leak any message forwarding information. In essence, ROL is very similar to the loop detection and handling scheme in Freenet, with a minor but critical difference. ROL can be adopted in many different apCSNs. However, in order to make our discussion more concrete, we present ROL in the framework of Freenet (note that ROL is only concerned with loop handling, other aspects of message routing are apCSN specific). In ROL, each request message is associated with an UID value, and each node maintains a history of the UID values of the recent request messages that the node has seen so as to detect routing loops. In addition, for each UID value, a node n will also record the set S of the neighbors to which the corresponding message has been forwarded by node n and the neighbors where the message came from.

68 When a request message arrives at a node n, the node will first check if it has seen the message before, based on the UID of the message. (In order to focus on message routing and loop handling, we assume node n does not have the content that the message is looking for. Otherwise, the content will be returned on the reverse path of the request message, and the message will not be further forwarded.) If it has not seen the message before, the message is forwarded according to the routing protocol of the apCSN, for example, forwarding the message to the neighbor whose location is closest to the target location of the message, based on the (CHK) routing key of the message. If node n has seen the message before, it will continue forwarding the message to the next closest neighbor, excluding the ones in set S. Importantly, no failure message will be returned to the upstream neighbor to indicate the detection of the routing loop. (A failure message may be returned later due to other reasons rather than routing loops, for example, data cannot be found or route cannot be found.)

ROL impacts on message path lengths and HTL. Note that a node n determines the next closest neighbor cn to which a message m should be forwarded only based on the local forwarding information available at node n. Therefore, node n may forward message m to a neighbor cn who has seen the message before, as long as node n has not used neighbor cn before (for message m).

Consequently, a message may traverse a node multiple times in ROL. Figure 5.1 shows an example where a node is traversed 2 times by a content request message. In the figure, the numbers along the edges show the order of the message forwarding. In the figure, node A originates a content request message; it is forwarded to node B, and then to nodes C, D, E, and F , in that order.

Node F then forwards the message back to node C, without knowing that node C has seen the message before. When node C receives this message from node F , it checks and notices that this is an old request message, it will then select the next best neighbor to forward the message to, excluding nodes B and F (from which node C received the message) and node D (to which node

C has forwarded the message previously). In the example, node C select node G as the next hop to forward the message to.

69 8 G H

7

1 2 3 4 A BCD E

6 5

M F

Figure 5.1: Forwarding of a content request message.

Given that loops are admitted in ROL and a node along the forwarding path of a message may be traversed multiple times by the message, ROL may have a deteriorating impact on the performance of Freenet in terms of message path lengths, which is the key concern over the adoption of ROL in real-world apCSNs such as Freenet. In the next section, we will carry out extensive simulation studies with various network topologies to investigate the performance of ROL. However, we first note that previous studies have shown that the probability to form a forwarding loop in small-world network topologies is small [26]. Therefore, ROL is rarely triggered in the normal forwarding of request messages (it is mainly used to prevent attackers from exploiting the loop handling scheme).

In addition, we also make an observation here to show that the performance of ROL may not be as bad as first conceived. Many factors including the specifics of network topology and the neighbor connectivity will have an impact on the performance of ROL, when a message does encounter a routing loop.

In Freenet, when node C receives the request message from node F (see Figure 5.1), it will return a Reject with Loop failure message, so that node F will select another neighbor to forward the message to. Without loss of generality, let M be next hop to which the message will be forwarded by node F in Freenet. In contrast, in ROL, it is node C who will decide how the message should be further forwarded. Let d(n) denote the distance from node n to the destination of a

70 request message. We note that, under normal conditions, d(C) < d(M), because node F selected node C over node M when it first decided to which node the message should be forwarded. Put in another way, node C is closer to the target location of the request message than node M. It is inconclusive to state which of the two nodes (C and M) is at a better position to forward the message to its target location, although some of the neighbors of node C have been used before, which decreases the search capability of node C. HTL = k HTL = k

C1 V C2

Figure 5.2: Implication of HTL operation.

Recall that each request message is associated with an HTL value, which is used to prevent a message from looping forever in Freenet. HTL puts a constraint on the search scope of a request message. Given that a message may traverse a node n multiple times in ROL, it will further limit the search scope of a request message, if HTL is decreased each time the message passes node n. One naive solution to the problem would be that, a node will not decrease the HTL value of a message, if it has seen the message before (therefore, the HTL value is only decreased once when a node first receives the message). However, this solution may be exploited by an attacker to determine if a node has seen a message before. See Figure 5.2 for an example. In the figure, let us assume that an attacker wishes to determine if node V has seen a request message with a particular UID value. It can connect two attack nodes C1 and C2 to node V [27, 25], and then send a specially crafted request message to node V from node C1 with a particular HTL value, say k. If attack node C2 receives the request message from node V with an unchanged HTL value

(HTL = k), the attacker can infer that node V has seen the request message before.

71 One way to address this issue is to let each node decrease the HTL value with a preconfigured probability (as done when HTL equals the maximum initial value or 1 in Freenet). In this way, an attacker cannot determine if an unchanged HTL is caused by the forwarding of an old message, or due to the probabilistic behavior of HTL manipulation. However, it has a side effect that the message search paths may become much longer, which may not be desirable as longer search paths will degrade the performance of Freenet.

As we will show in the next section, the performance impact of ROL on the message path lengths is minor on various network topologies. Given this observation, we will not change the behavior of HTL manipulation due to ROL. That is, each node will decide if it will decrease the value of HTL according to the original protocol of the corresponding apCSN.

How should content be transfered back to the requester? In ROL, temporary routing loops can be formed in the sense that a node may see and forward a request message multiple times. This presents a unique issue on how the requested content should be propagated back to the requester of the content. Let us again use the network topology in Figure 5.1 as an example. Recall that, a request message is originated at node A, and forwarded along the path A → B → C →

D → E → F → C → G → H, and assume that node H has the content that is being requested.

When the requested content is delivered back to the content requester, there are two different paths

(in this example). One is the original reverse path H → G → C → F → E → D → C → B → A, another one is a short-cut path H → G → C → B → A. That is, when the content is propagated back to node C from node G, node C can either forward back the data onto the original reverse path (to node F ), or directly forward back the data to the earliest neighbor from which node C received the corresponding request message (in this case, node B).

Both approaches have their advantages and limitations. Using the short-cut path will minimize the response time to a content request message, and likely improve the user experience on the resulting apCSN. However, it also has its own shortcomings. First, in order to remove the state maintained at the nodes not on the short-cut path, some kind of special messages should still be sent along the forwarding path for the nodes not on the short-cut path, which complicates the apCSN

72 protocol. For example, a new request cancel message can be sent from node C to the next hop along the original forwarding path, that is D, if short-cut path is used to propagate the content back to node A. When a request cancel message is received by a node, the corresponding state related to the request message will be removed, and the cancel message is further forwarded along the forwarding path. When the request cancel message is forwarded back to node C from node F , node C can simply discard the cancel message.

Second, using a short-cut path also has performance implications on Freenet. In Freenet, aggressive content caching is used to improve the probability that data is located and returned in a timely manner. Specifically, when content is returned along the reverse path in Freenet, nodes along the path will cache the received content (with certain restrictions related to security). Using short-cut paths will reduce the chance of data caching in the network.

Propagating content along the original reverse path is the simplest, without any change to

Freenet. However, it is certainly undesirable, given that content can be returned to the requester on the short-cut path to improve the user experience on Freenet. Another subtle issue is that, given that the message paths could be slightly longer in ROL compared to those in Freenet, more copies of content could be cached in the network. However, given each node only has limited cache

(storage) space, spreading more copies of the same content could affect the availability of other content in the network.

We propose a hybrid approach to propagate content back to its requester, where content is forwarded onto both the short-cut path and the original reverse path. In the hybrid approach, a node along the reverse path of a message will forward the data back to all the upstream neighbors (in particular, the one on the short-cut path). Instead of caching content with a probability of 1 (with some restriction related to security), content is only cached at a node with certain probability. In this way, the impact of longer message path on caching should be minimized. When a node receives the content again (maybe multiple times, depending on the number of upstream neighbors), it will simply discard the content.

73 As an example, when node C receives the content from node G, it will forward it back to node

B on the short-cut path, in addition, it will also forward the content to node F on the reverse path.

When node C receives the content from node D, it will simply discard the content. Note that, if there are multiple loops at node C, it will need to forward the content to the upstream neighbor in each loop.

What if a request message cannot be forwarded concerning ROL? In an extreme situation, it may occur that all the neighbors of a node n have been involved in the forwarding of a request message. When this happens, node n cannot further forward the message to any other nodes. Should a failure message (such as Route not Found) is returned immediately, the upstream neighbor can infer that the more likely cause of the failure is that the node n has seen the request message before, instead of other routing problems. As a consequence, an attacker can exploit this behavior to determine if node n has seen a request message before. However, we note that, given the large number of neighbors that a node can have (up to 40 in Freenet), this situation should rarely occur. Second, whenever this really happens, node n can delay the delivery of the failure message for certain amount of time (for example, average processing time for a message to traverse a few nodes) to prevent the upstream neighbor from inferring the specific reason of a failure.

5.4 Performance Evaluation

In this section, we perform simulation studies to investigate the performance of ROL. We will

first describe the simulation setup, and then we will provide and discuss the results of the simulation studies.

5.4.1 Simulation Setup

The simulation studies are carried out using the Thynix simulator coming with the Freenet project [5]. Thynix is a simulator developed to investigate the Freenet behaviors including probe routing and path folding. It supports the routing of Freenet request messages in the sense that, given a pair of source and destination nodes in Freenet, it can determine the path that a request

74 message will follow in the Freenet using the Freenet (greedy) routing protocol. However, in order to scale to large Freenet network topologies, it does not support functions such as file insertion, storage/caching, or retrieval. We extend the simulator to support ROL. To ease exposition, we refer to the current loop handling scheme in Freenet simply as Freenet. We note that ROL and

Freenet only differ in the loop handling behavior, they are identical in all the aspects of the Freenet operation. In particular, both of them use the greedy routing in order to forward a request message to its destination.

In order to thoroughly investigate and understand the performance of ROL compared to Freenet, we consider a number of key network properties in the simulation design, including network size

(number of nodes), node degree, and network topology. We consider network sizes with 2000,

4000, 8000, and 10000 nodes, and three node degrees of 8, 16, and 24. A node degree specifies the maximum number of neighbors that a node can have in a network. In combination, we have 12 different sets of networks, (2000, 8), (4000, 8), (8000, 8), (10000, 8), (2000, 16), (4000, 16), (8000,

16), (10000, 16), (2000, 24), (4000, 24), (8000,24), (10000, 24), in the format of (network size, node degree). We refer to them as S1 to S12, respectively. For simplicity, we also use S1 to S12 to refer to the set of simulation studies performed on the corresponding network. We note that the current

Freenet has about 3000 to 4000 nodes simultaneously online on average.

In terms of network topology (how nodes are connected), we consider a number of different network topologies, including both small-world topologies and random topologies. In the following we describe how nodes are connected in different topologies. As in the real-world Freenet, each node in a network will be assigned with a location randomly selected in the circular space [0, 1], where locations 0 and 1 are considered identical. In a small-world topology, two nodes are connected

(becoming neighbors of each other) with a probability that is inversely proportional to the distance between the two nodes [5, 22]. In a random network topology, nodes are randomly connected, regardless their distance.

Nodes in the real-world Freenet attempt to form a small-world topology, but there is no guar- antee that they can achieve this goal. The network topology of the real-world Freenet is more likely

75 to be some variation between a small-world topology and a random topology. For this reason, we will also consider hybrid network topologies, where x of neighbors of a node are selected randomly, and the remaining neighbors of the node are selected according to the small-world criterion. We consider x = 5%, 10%, 20%, and 30%, respectively. Furthermore, since we focus on the investigation of the performance of ROL, in all the simulation studies we set HTL to a large value (2000) so that with a high probability we can always find a path from any source node to any destination node.

For ROL and Freenet, we perform 12 sets of simulation studies, S1 to S12, with each set consisting of two groups of simulation studies. One group uses a small-world topology, another random topology. In each simulation study (with a fixed network topology), we randomly select two nodes in the network, we determine the route from the source node to the destination node

(using ROL or Freenet), and then we record the routing path length in the number of nodes along the path. We perform 1000 simulation studies in each group of simulation studies (with randomly selected pairs of source and destination nodes in each simulation study), which simulates 1000 random content requests on the network. We use average routing path lengths in each group as an indicator of the performance of a loop handling scheme. In general, a shorter average routing path length is preferred.

Table 5.1 summarizes the properties of the 12 sets of networks used in the simulation studies.

In the table we also show the average network path length of the corresponding network topology.

The average network path length is a graph property independent of the (greedy) routing used in an apCSN. It allows us to combine both network size and node degree into a single parameter of the network. In general, a large average network path length indicates that nodes in a network are more spread, and the topology likely has a larger network diameter. As we can see from the table, the average network path length of a network is strongly affected by the node degree. As the node degree increases, the average network path length becomes smaller for a fixed network size. On the other hand, given a fixed node degree, the average network path length becomes greater as we increase the network size. Both are intuitively sound. It is also interesting to note that random networks have a shorter average network path length compared to the corresponding

76 small-world networks. This could be related to the fact that random networks have less restriction on connecting two nodes than small-world networks.

Table 5.1: Properties of the networks used in simulations.

Average network path length Set Size Degree Small-world Random S1 2000 8 4.230 4.053 S2 4000 8 4.589 4.413 S3 8000 8 4.949 4.765 S4 10000 8 5.076 4.882 S5 2000 16 3.265 3.083 S6 4000 16 3.528 3.377 S7 8000 16 3.789 3.638 S8 10000 16 3.872 3.709 S9 2000 24 2.879 2.772 S10 4000 24 3.133 2.954 S11 8000 24 3.400 3.195 S12 10000 24 3.478 3.281

5.4.2 Simulation Results

In this subsection we present the results of the simulation studies. First we present the results on small-world topologies and on the random network topologies using the 12 sets of networks.

Then, we present the results of the simulation studies using hybrid network topologies. Towards the end of this section we show the results of reverse path lengths of Freenet and ROL.

Small-world and Random Network Topologies. Table 5.2 shows the average routing path lengths of ROL and Freenet in the 12 sets of networks. From the table we can see that, overall ROL only has a minor performance impact on the average routing path lengths compared to Freenet. In particular, the increment of average routing path lengths of ROL is negligible compared to that of Freenet in small-world networks. All the increments are greatly less than 1 in the small-world networks. Moreover, in certain cases (for S3, S4), ROL actually has a shorter average routing path compared to Freenet, and in some other cases (S5, S6, S7, S9, S10, S11, and

S12), there is no change in the average routing path length between ROL and Freenet. For the last

77 case, we have checked that there is no routing loops caused by ROL, and ROL and Freenet have the same message forwarding paths.

Table 5.2: Average routing path lengths of Freenet and ROL.

Small-world networks Random networks Set Freenet ROL Freenet ROL S1 7.411 7.553 43.448 45.678 S2 8.706 8.880 86.725 93.604 S3 9.910 9.839 156.186 188.381 S4 11.071 10.997 208.433 215.617 S5 4.790 4.790 11.736 12.424 S6 5.257 5.257 21.165 21.522 S7 5.733 5.733 42.062 46.118 S8 5.951 5.957 50.307 58.128 S9 4.184 4.184 6.513 6.651 S10 4.573 4.573 11.073 12.287 S11 4.942 4.942 18.472 19.696 S12 5.084 5.084 24.414 25.536

The performance of ROL is somewhat worse on random networks compared to small-world networks. However, we note that Freenet also works worse in random networks compared to small- world networks. Therefore, although ROL has a greater absolute increment in the average routing path length in random networks, the relative increment compared to Freenet is still relatively small.

For example, ROL has no more than 10% increase in average routing path length for the majority of random networks.

In order to better illustrate the performance of ROL and Freenet with respect to the network properties, we show in Figures 5.3 and 5.4 the average routing path length as a function of the average network path length. From the figures we can see that as the average network path length increases, in general the average routing path length also increases (with a notable dip in the random networks in Figure 5.4). This is expected because a longer average network path means that the nodes in the network are spread, and the network diameter is likely larger. In general a message will traverse more nodes in order to reach a destination in a more spread network for any routing algorithms.

78 14

12

10

8

6

4 Average routing path length 2 Freenet ROL 0 0 2 4 6 8 10 Network path length

Figure 5.3: Average routing path length (small-world networks).

Figures 5.5 and 5.6 show the empirical cumulative distribution function (CDF) of the routing path lengths for both ROL and Freenet, in small-world networks and random networks, respectively.

To make the figures more legible, we use two networks (S2 and S3) as the representative examples.

Data with other networks show a similar trend. From Figure 5.5 we can observe that, ROL and

Freenet has a very similar CDF of routing path lengths, which again confirm that the impact of

ROL on message path lengths should be very small compared to Freenet on small-world networks.

In addition, both ROL and Freenet have relatively short routing path lengths, for example, more than 95% of messages have a routing path that is no greater than 18 hops, which is the default maximum initial value of HTL on Freenet.

In contrast, from Figure 5.6 we can see that both ROL and Freenet have much longer routing path lengths on random networks compared to small-world networks. (Note that the routing path length of 2000 of Freenet in S3 is caused by our limit on the HTL value in the simulation studies.

The actual routing paths could be longer.) For an example, the majority of routing paths have

79 300

250

200

150

100

Average routing path length 50 Freenet ROL 0 0 2 4 6 8 10 Network path length

Figure 5.4: Average routing path length (random networks) a length that is greater than 50 hops, and a large number of routing paths have a length that is greater than 500 hops. Given that nodes in random networks are connected randomly, independent of their locations, we do not expect any routing algorithms, and greedy routing in particular, will work well in this type of networks. Despite the relatively large routing path lengths, we emphasize that ROL performs similarly as Freenet, put in another ways, ROL also does not have major impact on routing path lengths compared to Freenet in random networks.

In order to better understand the impact of ROL on message forwarding, in Table 5.3 we show the number (and percentage) of messages that encounter a loop (traverse a node multiple times) during the forwarding from the source to its destination, for both small-world networks and random networks. Recall that in each group of simulation studies we perform 1000 content requests. From the table we can see that in small-world networks only a very small number of messages will encounter a loop in each group of simulation studies, ranging from 0 to 65, which is less than 7% of messages in all groups of simulation studies (with small-world networks). This shows that messages

80 1

0.8

0.6

0.4

Freenet, S2 0.2 ROL, S2 Freenet, S3 ROL, S3

Cumulative Distribution Function (CDF) 0 0 10 20 30 40 50 60 70 Routing path length

Figure 5.5: Distribution of routing path lengths (small-world networks). in small-world networks will rarely encounter loops with greedy routing, which has been observed in some previous studies [26]. The ROL in this context is mainly used to prevent attackers from exploiting the loop handling scheme.

In contrast, a large percentage of messages will encounter loops in all the groups of simulation studies on random networks. Again, given that nodes are randomly connected in random networks, it is not surprising that a message will be forwarded back to a node that has seen the message previously. We also observe that the node degree plays a key role in the formation of routing loops in both small-world networks and random networks. As the node degree increases (with a fixed network size), the probability for a message to encounter a routing loop becomes smaller. This is understandable; in an extreme case when a network becomes a clique (nodes have the largest degree), there will be no forwarding loops.

Hybrid Networks. Given that the network topology of the real-world Freenet is more likely to be a variation between a small-world network and a random network, in the following we per-

81 1

0.8

0.6

0.4

Freenet, S2 0.2 ROL, S2 Freenet, S3 ROL, S3

Cumulative Distribution Function (CDF) 0 0 500 1000 1500 2000 Routing path length

Figure 5.6: Distribution of routing path lengths (random networks) form simulation studies using hybrid networks. The hybrid networks are constructed using the parameters of S2, that is, the network size is 4000, and the node degree is 8, S3 (8000, 8), and S11

(8000, 24) in these networks. Using other sets of networks will provide similar observation.

Table 5.5 shows the average routing path length on various hybrid networks, with x = 5%, 10%, 20%, and 30%. For comparison, we also include the results for the small-world topology (0% random), and random topology (100% random). As we can see from the table, even with added randomness in networks, ROL can still perform well compared to Freenet, and in some case outperform Freenet, in terms of average routing path lengths. We also show that the results under different network sizes and node degrees. Table 5.4 and 5.6 show that ROL and Freenet have similar performances.

Furthermore, as we can see from those tables, there is a trend that is a higher node degree makes

ROL impose less performance impact. Overall the simulation studies confirm that ROL is a prac- tical loop handling scheme, and can be deployed on apCSN systems such as Freenet, which aim to form a small-world network topology.

82 Table 5.3: Number of messages in loops.

# of messages in loops (%) Set Small-world networks Random networks S1 35 (3.5%) 556 (55.6%) S2 55 (5.5%) 664 (66.4%) S3 45 (4.5%) 781 (78.1%) S4 65 (6.5%) 789 (78.9%) S5 1 (0.1%) 231 (23.1%) S6 0 (0%) 377 (37.7%) S7 0 (0%) 519 (51.9%) S8 3 (0.3%) 586 (58.6%) S9 0 (0%) 90 (9.0%) S10 0 (0%) 225 (22.5%) S11 0 (0%) 340 (34.0%) S12 0 (0%) 414 (41.4%)

Table 5.4: Average routing path lengths on hybrid networks with parameters of S2.

Network Topology Freenet ROL Small-world 8.706 8.880 5% 8.832 10.371 10% 9.552 9.638 20% 11.056 11.406 30% 12.365 15.235 100% Random 86.725 93.604

The results of reverse path lengths of Freenet and ROL. In order to better understand the performance impact of ROL, we also compare the reverse routing path length of a given request with Freenet routing scheme and the reverse routing path length of the same request with ROL.

We only show the two reverse path lengths of a request in which a loop has occurred, because

Freenet and ROL have the same routing path if no loop at all. To effectively show the results, we calculate the difference of two reverse routing path lengths of Freenet and ROL on small-world topologies. The value is the the reverse routing path length of ROL minus that of Freenet. If the difference value is zero, it means that ROL and Freenet has the same performance; if the difference value is negative, it suggests that ROL has shorter reverse path than Freenet does; otherwise, it indicates that Freenet produces shorter reverse path than ROL, which means that Freenet performs

83 Table 5.5: Average routing path lengths on hybrid networks with parameters of S3.

Network topology Freenet ROL Small-world 9.910 9.839 5% random 10.373 11.036 10% random 10.947 12.091 20% random 15.201 15.152 30% random 18.015 23.611 (100%) random 156.186 188.381

Table 5.6: Average routing path lengths on hybrid networks with parameters of S11.

Network Topology Freenet ROL Small-world 4.942 4.942 5% 4.943 4.862 10% 4.941 4.904 20% 4.967 4.943 30% 4.958 4.981 100% Random 18.472 19.696 better than ROL. We only show the results with network parameters of Set 2, Set 3, and Set 4.

The reaseon why we choose to show the results of those three sets is that they have the worst performances among all the sets. It is more important to see their path length differences than other sets that barely have a loop.

First, we compare a reverse routing path length of Freenet and ROL with the original reverve path. Figure 5.7 shows the Cumulative Distribution Function (CDF) of the reverse path length differences of Freenet and ROL with original reverve path. There are two important observations.

First, as we can see, ROL can have shorter reverse paths than Freenet. This explains why average routing path lengths of ROL can be less than that of Freenet. Second, about 87% of the reverse path length differences are less than 17 hops, despite of the fact that a few difference values are more than 50. This suggests that ROL could cause a more than 10 hops longer forwarding path compare to Freenet, which raises a concern about content lookup performance. However, we will explain how to mitigate this problem and why this is not a big issue.

84 1

0.8

0.6

0.4

0.2 S2 S3 S4

Cumulative Distribution Function (CDF) 0 -100 -50 0 50 100 150 Routing path length difference

Figure 5.7: Comparison of forwarding path lengths between Freenet and ROL (small-world networks)

Figure 5.8 shows the Cumulative Distribution Function (CDF) of the reverse path length dif- ferences between Freenet and ROL with short-cut path. As it states in 5.3, using short-cut path can improve response time of a data request message, which is good for user experience. From

Figure 5.8, it is clear that in most cases, a short-cut path is shorter than its corresponding Freenet path. Even under the cases where short-cut path is longer, the increment is no more than 12 hops.

This suggests that ROL can have shorter reverse path although it may have a longer forwarding path. We emphasize that the significance of a shorter reverse path is more important than that of a longer forward path, because in general the performance bottleneck in a P2P content sharing network is the download speed. Put it another way, file downloading time is much longer than content lookup time. Therefore, the benefit of a shorter reverse path clearly out-weighs the cost of a few extra hops on a forwarding path. In conclusion, the comparison of reverse path lengths of

Freenet and ROL confirms that ROL is a feasible loop-handling scheme that can be deployed on apCSN systems such as Freenet, which tends to form a small-world network topology.

85 1 S2 S3 0.8 S4

0.6

0.4

0.2

Cumulative Distribution Function (CDF) 0 -180 -160 -140 -120 -100 -80 -60 -40 -20 0 20 Returning path length difference

Figure 5.8: Comparison of returning path lengths between Freenet and ROL’s shortcut (small-world networks)

5.5 Related Work

In response to the traceback attack on Freenet [25], Ian Clarke has proposed the same idea of

ROL [40]. We independently developed the ROL scheme, and critically, we carried out extensive simulation studies on the performance of ROL. As discussed in Section 5.2, Freenet and GNUnet have their own loop handling schemes [2, 34]; however, they both leak certain level of message forwarding information that can be exploited to compromise or undermine the user anonymity of these networks. OneSwarm has a slightly different loop handling scheme [36]; however it only works in flooding based apCSNs instead of routing based apCSNs. ROL can work in routing based apCSNs.

86 5.6 Conclusion

In summary, we have developed a new loop handling scheme named Reroute-on-Loop (ROL) that would not leak any message forwarding information so as to improve the anonymity strength of the resulting apCSN. Using the Thynix simulator we have also shown that overall ROL only has minor performance impacts on routing path lengths compared to Freenet. Our simulation studies confirmed that ROL is a practical loop handling scheme that can be deployed on apCSN systems such as Freenet.

87 CHAPTER 6

RELATED WORK

In this chapter we briefly discuss the related work about general and specific attacks on p2pANs.

We first describe general traceback attacks on peer to peer anonymous networks including attacks on low-latency p2p anonymous networks such as Tor [13, 17, 18, 19]. After that, we discuss specific traceback attacks on p2pCSNs, such as GNUnet and OneSwam. Towards the end of this section we discuss attacks specific to Freenet, and the countermeasure adopted by the Freenet project to mitigate the traceback attack developed in this paper.

A number of theoretical attack models on p2p anonymous networks have been developed, in- cluding the predecessor attack [32], the intersection attack [11], the sybil attack [8], and the eclipse attack [24]. They do not target any particular p2p anonymous networks; rather, they investigate the relationship between the number and coverage of attacking nodes and the likelihood that any message can be traced back. They were developed based on the fact that the members of p2p anonymous networks are dynamic. If a sufficient portion of a p2p anonymous network consists of attack nodes, it is likely that they can collaborate to identify the possible origin of a message.

Such studies provide us with some guideline on the deployment of attacking nodes on a p2p anonymous network. However, in this dissertation we are more interested in attacks that exploit the operational features of a p2p anonymous network instead of attacks that rely on a large number of attack nodes to cover the critical regions of a p2p anonymous network. In addition, real-world p2p anonymous networks including Freenet have taken steps to prevent a large number of nodes in any individual network domain from joining the networks, so that it is becoming hard to launch such attacks from a single network domain. The traceback attack developed in our work requires much less resources on the attacker compared to this type of attacks.

88 A number of watermarking techniques have been developed to trace back traffic on low-latency p2p anonymous networks such as Tor (see, for example, [14, 18, 33]). A possible solution is to introduce a random and long-enough delay into the network traffic, in order to blur or remove watermarks. As we have discussed in Section 3.1, existing watermarking-based traceback attacks on low-latency anonymous networks will not work well on anonymous content sharing systems such as Freenet, given that nodes in such networks do not have much constraint on the time to process and forward a message. Any traffic pattern that may be embedded in messages of such networks for the traceback purpose can be easily destroyed.

In the following, we will discuss traceback attacks that are specific to p2pCSNs, such as GNUnet and OneSwam. First, in GNUnet, as we mentioned in Section 2.2.2, a node either indirects or forwards a query message depending on its current bandwidth and load. If the node is not busy, it will indirect a content lookup message to one of its peers; if the node is busy, especially with heavy outbound traffic, it will choose to forward a content query message to one of its peers. A traceback attack has been developed in GNUnet, which is called Shortcut Attack [16]. In this attack, an attacker connects to a suspected node s. The attacker tries to entice node s to forward traffic instead of indirecting it by increasing the node’s outbound traffic. When the suspected node s forwards a query, it does not overwirte the return address of the proceding node n. Therefore, the attacker knows that node s received a query message from Node n. He can determine that node s is not the originator of the content query message. Then, the attacker will try to connect to node n and repeat the operation to determine whether or not node n is the originator of a query message.

If there is no node sends the query message to node a, then the attacker can conclude that node n is the originator, which deanonymizes GNUnet. This Shortcut Attack is only feasible if GNUnet networks are very stable because forcing a node to use forwarding, instead of indirecting, needs a lot of resource and time.

OneSwarm is an anonymous peer-to-peer file sharing system, similar to Freenet and GNUnet.

Its main design goal is to resist traffic analysis attacks and timing attacks that are easily performed on P2P file sharing networks such as . It uses probabilistic forwarding of queries to

89 thwart traffic analysis and application-level delays to mitigate timing attacks. However, OneSwarm is vulnerable to a novel timing attack named twin timing attack. This attack uses simultaneous queries from two attackers who directly connect to the targeted node. The basic idea of the twin timing attack is for the two attackers to send the same content request to the targeted node. Let the total delay (network delay and application-level delay) between the first attacker and the targeted node be d1. Similarly, let the total delay between the second attacker and the targeted node be d2. If the difference between d1 and d2 is no more than a threshold value (i.e., 600ms), then the targeted node is the source of requested content; otherwise, the targeted node is just a forwarder, not the source of the requested content. For more details about this new timing attack, we refer to the paper [42]. One potential solution to this time attack is caching query results, which is not employed by OneSwarm. Caching content lookup messages and their responses will thwart the attack because it is obscure whether a response comes from a cached peer or the original source.

An attack called Pitch Black has been developed on Freenet [43]. Pitch Black works in the

Darknet mode of Freenet, which allows the swapping of locations between neighboring nodes, based on the distance to their respective neighbors. By continuously lying the distance to its current neighbors, an attack node can force benign nodes to swap to a highly concentrated location region. This will render an imbalanced distribution of nodes in the location range of [0, 1], where a small number of nodes will be responsible for a large portion of contents stored in Freenet, which can cause unnecessary congestion and even file loss in Freenet. However, the Pitch Black attack targets deteriorating the performance of Freenet instead of tracing back the origin of any messages. In addition, it only works in the Darknet mode of Freenet. Freenet in the Opennet mode does not support location swapping between neighboring nodes. Freenet project has considered a potential solution to Pitch Balck attack. The solution is to randomly change the location of a node periodically, in order to uniform the locations of nodes between the range [0,1] on Freenet.

However, this method could cause network churns and a less stable network, which may bring negative impacts on routing performance of Freenet.

90 CHAPTER 7

SUMMARY

In this dissertation we thoroughly investigate the fine-grained decisions made in the Freenet project, including methods to prevent routing loop of content request messages, the handling of various mes- sages in Freenet, and mechanisms for a Freenet node to populate and update its routing table. The objective of this investigation is to learn how well the fine-grained design and development decisions of Freenet have been made to meet the anonymity goals of the network and to obtain insights in developing fine-grained decisions to better support user anonymity. After the thorough examina- tion of the fine-grained decisions made in Freenet, we have developed an effective traceback attack that can identify the originating machine of a content request message. That is, the anonymity of a content retriever can be broken in Freenet. The traceback attack exploited a few fine-grained design and development decisions made in Freenet, including the unique identifier (UID) based mechanism to prevent routing loops of content request messages.

Since the traceback attack is detrimental to user anonymity, we investigate mechanisms to enhance the anonymity of Freenet. We have developed a simple and effective scheme named dynID to thwart the traceback attack on Freenet. In dynID, the UID associated with a content request message is dynamically changed at the beginning portion of the message forwarding path. As a consequence, an attacker can only trace back a content request message to the node where the

UID value is last changed; it cannot uniquely determine the originating machine of the message.

Importantly, dynID only has negligible impacts on the performance of Freenet in locating content on the network.

We note that DynID prevents an attacker deterministically identifying the originator of a mes- sage request, but attackers can probabilistically trace back to the originator. In order to prevent any trackback attack, we developed a generic solution, Reroute-On-Loop (ROL), to prevent routing

91 information leakage. ROL prevents an attacker from distinguishing a node that has seen a partic- ular message from a node that has not seen the message. In ROL, when a message re-visits a node n, the node n will not send any failure message to the upstream node where the message comes from. Instead, the node n will continue forwarding the message to the next closest peer who has not forwarded the message to node n or received the message from the node n. The fact that the node n does not send a failure message to the upstream node where a message comes from when a loop happens, prevents attackers from knowing whether or not the node n has seen a particular message. If an attacker cannot distinguish a node that has seen a particular message from a node that has not seen the message, it will become extremely difficult for the attacker to carry out any kind of traceback attack.

Together, our three findings will help us to better understand the fine-grained design and devel- opment decisions of p2pANs and their impacts on such systems’ security strength and performance.

More specifically, our studies have discoverd operational-level vulnerabilities to deanonymize p2pANs.

We also developed fine-grained schemes and mechanisms to enhance anonymity of p2p content sharing systems. We believe that our work will contribute towards improving current p2pANs and building a new system that is privacy-preserving and censorship-resistant on the Internet.

92 BIBLIOGRAPHY

[1] Emulab. Network emulation testbed. http://www.emulab.net/.

[2] Freenet. ://freenetproject.org/.

[3] OneSwarm. https://oneswarm.org/.

[4] Freenet. Opennet attacks. https://wiki.freenetproject.org/Opennet_attacks/.

[5] Thynix. Freenet Simulator. https://wiki.freenetproject.org/Simulator.

[6] Tor. https://www.torproject.org/.

[7] C. Callanan, H. Dries-Ziekenheiner, A. Escudero-Pascual, and R. Guerra. Leaping over the firewall: A review of censorship circumvention tools. Report by Freedom House, Apr. 2011.

[8] John Douceur and Judith S. Donath. The sybil attack. pages 251–260, 2002.

[9] Ian Clarke, Scott G. Miller, Theodore W. Hong, Oskar Sandberg, and Brandon Wiley. Pro- tecting free expression online with freenet, 2002.

[10] Ian Clarke, Oskar Sandberg, Brandon Wiley, and Theodore W. Hong. Freenet: A distributed anonymous information storage and retrieval system.

[11] George Danezis and Andrei Serjantov. Statistical disclosure or intersection attacks on anonymity systems. In in Proceedings of 6th Information Hiding Workshop (IH 2004, pages 293–308, 2004.

[12] Roger Dingledine, Michael J. Freedman, and David Molnar. The : Dis- tributed anonymous storage service. In In Proceedings of the Workshop on Design Issues in Anonymity and Unobservability, pages 67–95, 2000.

[13] Roger Dingledine, Nick Mathewson, and Paul Syverson. Tor: The second-generation onion router. In In Proceedings of the 13th USENIX Security Symposium, pages 303–320, 2004.

[14] Junwei Huang, Xian Pan, Xinwen Fu, and Jie Wang. Long pn code based dsss watermarking. In INFOCOM, pages 2426–2434, 2011.

93 [15] Tomas Isdal, Michael Piatek, Arvind Krishnamurthy, and Thomas Anderson. Privacy- preserving p2p data sharing with oneswarm. In In ACM SIGCOMM, 2010.

[16] Dennis Kgler. An analysis of and the implications for anonymous, censorship-resistant networks. In Proceedings of the 3rd International Workshop on Privacy Enhancing Technolo- gies (PET 2003, pages 161–176. Springer-Verlag, 2003.

[17] Brian N. Levine, Michael K. Reiter, Chenxi Wang, and Matthew Wright. Timing attacks in low-latency mix systems (extended abstract). In PROCEEDINGS OF THE 8TH INTERNA- TIONAL FINANCIAL CRYPTOGRAPHY CONFERENCE (FC 2004), KEY WEST, FL, USA, FEBRUARY 2004, VOLUME 3110 OF LECTURE NOTES IN COMPUTER SCI- ENCE, pages 251–265. Springer, 2004.

[18] Zhen Ling, Junzhou Luo, Wei Yu, Xinwen Fu, Dong Xuan, and Weijia Jie. A new cell counter based attack against tor. ACM, November 2009.

[19] Steven J. Murdoch and George Danezis. Low-cost traffic analysis of tor. In In Proceedings of the 2005 IEEE Symposium on Security and Privacy. IEEE CS, pages 183–195, 2005.

[20] Pai Peng Peng. On the secrecy of timing-based active watermarking trace-back techniques, 2006.

[21] Ryan Pries, Wei Yu, Xinwen Fu, and Wei Zhao. A new replay attack against anonymous com- munication networks. In Proceedings of IEEE International Conference on , ICC 2008, Beijing, China, 19-23 May 2008, pages 1578–1582. IEEE, 2008.

[22] Oskar Sandberg. Distributed routing in small-world networks, 2007.

[23] Atul Singh, Miguel Castro, Peter Druschel, and Antony Rowstron. Defending against eclipse attacks on overlay networks, 2004.

[24] Atul Singh, Tsuen wan johnny Ngan, Peter Druschel, and Dan S. Wallach. Eclipse attacks on overlay networks: Threats and defenses. In In IEEE INFOCOM, 2006.

[25] Guanyu Tian, Zhenhai Duan, Todd Baumeister, and Yingfei Dong. A traceback attack on freenet. In In Proceedings of IEEE INFOCOM, April 2013, 2013.

[26] G. Tian, Z. Duan, T. Baumeister, and Y. Dong, “Thrawting traceback attack on Freenet,” in Proc. IEEE GLOBECOM, Atlanta, USA, Dec. 2013

94 [27] T. Baumeister, Y. Dong, Z. Duan, and G. Tian. A routing table insertion attack on Freenet. In Proceedings of ASE International Conference on Cyber Security, Washington D.C., USA, Dec. 2012.

[28] Freenet. Kleinberg networks. https://wiki.freenetproject.org/Kleinberg_network/.

[29] Toad. How safe is Freenet anyway? http://amphibian.dyndns.org/flogmirror/ #20120911-security.

[30] Xinyuan Wang, S. Chen, and S. Jajodia. Tracking anonymous peer-to-peer voip calls on the internet, 2005.

[31] Xinyuan Wang, Shiping Chen, and Sushil Jajodia. Network flow watermarking attack on low-latency anonymous communication systems, 2007.

[32] Matthew K. Wright, Micah Adler, Brian Neil Levine, and Clay Shields. The predecessor attack: An analysis of a threat to anonymous communications systems. ACM Trans. Inf. Syst. Secur, 7:2004, 2004.

[33] Wei Yu, Xinwen Fu, Steve Graham, Dong Xuan, and Wei Zhao. Dsssbased flow marking technique for invisible traceback. Technical report, in Proceedings of IEEE Symposium on Security and Privacy (SP), 2007.

[34] GNUnet, https://gnunet.org/

[35] N. S. Evans and C. Grothoff, “R5N : Randomized recursive routing for restricted-route net- works,” in Proc. 5th International Conference on Network and System Security (NSS 2011), Milan, Italy, Sep. 2011.

[36] T. Isdal, M. Piatek, A. Krishnamurthy, and T. Anderson, “Privacy-preserving P2P data shar- ing with OneSwarm,” in Proc. ACM SIGCOMM, 2010.

[37] I. Stoica, R. Morris, D. Liben-Nowell, D. Karger, M. F. Kaashoek, F. Dabek, and H. Balakrish- nan, “Chord: a scalable peer-to-peer lookup protocol for internet applications,” IEEE/ACM Trans. Netw., vol. 11, no. 1, pp. 17–32, 2003.

[38] G. Ciaccio, “Improving sender anonymity in a structured overlay with imprecise routing,” in Proc. International Conference on Privacy Enhancing Technologies, 2006, pp. 190–207.

95 [39] P. Maymounkov and D. Mazires, “Kademlia: A peer-to-peer information system based on the xor metric,” in Proc. First International Workshop on Peer-to-Peer Systems, 2002, pp. 53–65.

[40] Toad, “Consider reroute-on-loop,” https://bugs.freenetproject.org/view.php?id=5467.

[41] Toad. Reduce the number of distinguishable failure modes. https://bugs.freenetproject. org/view.php?id=5466#bugnotes.

[42] Swagatika Prusty, Brian Neil, and Levine Marc Liberatore. Forensic investigation of the oneswarm anonymous filesharing system, 2011.

[43] Nathan S. Evans, Chris Gauthierdickey, and Christian Grothoff. Routing in the dark: Pitch black. In In Proceedings of the 23rd Annual Computer Security Applications Conference (AC- SAC). IEEE Computer Society, 2007.

96 BIOGRAPHICAL SKETCH

The author was born in Sichuan province, China, 1987. The author earned a Bachelor of Science degree in Computer Science at Ramapo College of New Jersey. After that, he started to pursue his Doctoral Degree at Florida State University. This dissertation is the final requirement for his Doctoral Degree in Computer Science from Florida State University. His research interest are internet security, anonymous networks, and anonymous p2p content sharing systems.

97