Detecting and Recovering from Overlay Routing Attacks in Peer-to-Peer Distributed Hash Tables

A thesis for the degree of Master of Science in Computer Science

Keith Needels [email protected]

Department of Computer Science Rochester Institute of Technology

February 22, 2008

Committee:

Professor James Minseok Kwon, Chair

Professor Alan Kaminsky, Reader

Professor Warren R. Carithers, Observer

Abstract

Distributed hash tables (DHTs) provide efficient and scalable lookup mechanisms for locating data in peer-to-peer (P2P) networks. A number of issues, however, prevent DHT based P2P networks from being widely deployed. One of these issues is security. DHT protocols rely on the users of the system to cooperate for lookup requests to successfully reach the correct destination. Users who fail to run the protocol correctly can severely limit the functionality of these systems. The fully distributed nature of DHTs compounds these security issues, as any security mechanism must be implemented in a non- centralized fashion for the system to remain truly P2P.

This thesis examines the security issues facing DHT protocols, and we propose an extension to one such protocol (called ) to mitigate the effects of attacks on the underlying lookup message routing mechanism when a minority of nodes in the system are malicious. Our modifications require no trust to exist between nodes in the network except during the joining process. Instead, each node makes use of locally known information about the network to evaluate hops encountered during the lookup routing process for validity. Hops that are determined to be invalid are avoided. These modifications to the Chord protocol have been implemented in a simulator and then evaluated in the presence of malicious nodes. We present the results of this evaluation and compare them to the results obtained when running the unmodified Chord protocol.

ii

Table of Contents

1. Introduction ...... 1 2. Peer-to-Peer Protocols ...... 3 2.1. Overlay Networks ...... 3 2.2. Napster and Gnutella ...... 4 2.3. Distributed Hash Tables ...... 5 2.3.1. Chord...... 6 2.3.2. Pastry...... 8 2.3.3. Content Addressable Networks (CANs) ...... 9 3. Peer-to-Peer Protocol Security Issues and Related Work ...... 11 3.1. Data Attacks ...... 11 3.2. Identifier Attacks ...... 12 3.3. Routing Attacks ...... 14 4. Chord Secure Routing Design ...... 17 4.1. Threat Model ...... 17 4.2. Design Overview ...... 18 4.4. The Backtracking Algorithm ...... 18 4.5. The Hop Verification Algorithm ...... 24 4.6. Maintaining Statistical Data ...... 25 4.7. Joining the Network ...... 26 4.8. Updating Finger Table Entries ...... 27 5. Simulator Design ...... 28 5.1. Using the GUI Utility...... 29 5.1.1. Experiment Setup ...... 31 5.1.2. Viewing Experiment Results ...... 32 5.2. Writing Tests in Java ...... 33 6. Evaluation ...... 35 6.1. Dropped lookup requests ...... 35 6.2. Incorrect Random Routing ...... 37 6.3. Malicious Sub-ring Routing ...... 39 6.3.1. Effect of the Standard Deviation Parameter ...... 43 6.3.2. Effect of the Pruning Parameter ...... 45 7. Conclusion ...... 48 8. References ...... 49

iii

1. Introduction

The popularity of peer-to-peer (P2P) networks took off beginning in 1999 thanks to the success of Napster [9], and it has been a hot research topic ever since. Although there have been applications before Napster that could be considered peer-to-peer, giving the average Internet user the ability to easily obtain music and movies for free has made P2P technology well known throughout the world. Research in this area has resulted in a powerful class of P2P lookup protocols called distributed hash tables. While these protocols are scalable and efficient, they suffer from security vulnerabilities that prevent them from being widely deployed in open networks.

A peer-to-peer network can be defined as a network where there are no central servers. Instead, each user of the system is both a client and a server, and is referred to as a peer. Peers connect directly to each other to transfer data. In a true peer-to-peer network, peers also locate data without using a central server, without using any kind of hierarchical organization, and without making some peers more important than others. There is no single point of failure. If a peer fails, other peers can continue to use the system. If a single peer is using all of its bandwidth, other peers are not affected.

While illegal file sharing is perhaps the most well known application of peer-to-peer networks, there are many useful legal and ethical uses for these decentralized systems. These uses include overlay multicast, data backup, distributed file systems, distributed databases, instant messaging, DNS, and so on. Many large scale distributed systems that can benefit from an architecture where there is no single point of failure can make use of P2P technology.

Unfortunately for Napster, it was not a true peer-to-peer network since file lookups were handled by a central server. This directory server was an easy legal target, and Napster was shut down. In the wake of Napster’s demise, many new peer-to-peer systems were developed. These systems were fully decentralized, but most of them relied on flooding techniques to locate peers containing desired data, which is not very efficient and is not guaranteed to find sought after files. To solve these problems, yet another class of P2P systems were developed, called Distributed Hash Tables (DHTs). DHTs are fully decentralized systems that locate data efficiently, and are today a popular research topic in the field of distributed systems. A few of these systems will be discussed in Section 2.

The fundamental purpose of a DHT is to find the peer (also called a node) responsible for a resource given a key for that resource. Since it is not practical to keep track of every node in the network, each node should only be responsible for keeping track of a small subset of other nodes. Finding the node responsible for a key should be done by forwarding lookup requests through a structured . All of the nodes that a particular node keeps references to are the links that node has in the overlay. The number of hops needed to complete a lookup should also be small.

While DHTs are much more efficient and scalable than the flooding systems that were popular immediately after the fall of Napster, there are some serious issues that prevent

1 them from being deployed in large, open networks, one of which is security. With DHTs, we must rely on other peers to correctly forward our lookup requests in order to find the peer responsible for a key. Unlike physical network routing, with overlay routing in an open DHT system, anybody can become a router.

An individual attacking a DHT has many types of attacks available to them. Attackers can modify, drop, and misroute lookup requests. They can take responsibility for certain data in order to deny its availability or provide modified data to other nodes. They can forward incorrect overlay routing table updates. The list goes on and on. DHTs, in their original forms, are an easy target for attackers. Large open P2P networks that use DHTs cannot exist until the fundamental security issues of the underlying DHT protocols are addressed.

The purpose of this thesis is to examine security issues facing DHT protocols and present an extension to the Chord DHT protocol to mitigate some of these available attacks. Our goal is to allow Chord to make use of readily available information that is obtained through the normal operation of the protocol to evaluate the lookup routing process and respond to attacks. This is done in a fully distributed fashion. We assume that a node trusts no node besides itself, except the bootstrap nodes that it uses to join the network. Our results show that our proposed extension to Chord is able to correctly complete lookups over 90% of the time with up to half of the network consisting of compromised nodes performing naïve attacks, and our extension offers significant improvement over the base protocol in the face of sophisticated attacks.

In order to understand the security issues that DHT protocols face, we first must understand P2P/DHT protocols, which is the goal of section 2. In section 3, we will survey the vulnerabilities that are present in DHT protocols and solutions that have been proposed. In section 4, we detail our extension to the Chord protocol to avoid some of these attacks. We give an overview of the architecture of the software simulator used for testing these proposed changes in section 5, and we evaluate these changes in the simulator in section 6. We summarize and conclude in section 7.

2

2. Peer-to-Peer Protocols

The purpose of this section is to provide background information on Peer-to-Peer networks. It is necessary to understand how these protocols work in order to understand the security issues involved with them. Although several protocols from the last century could be considered P2P (such as Usenet), we have chosen to focus on P2P protocols starting with Napster, which was released in 1999. This is when P2P protocols really became a popular research topic. While a few basic security issues will be laid out in this section, Section 3 will contain a detailed overview of these issues and research that has been performed.

2.1. Overlay Networks

A peer-to-peer network is a type of overlay network , meaning that nodes in the network are not linked together physically but are instead connected with virtual links over the underlying physical network (normally, the Internet). Peers typically communicate with the TCP protocol, and a link in a P2P network can be thought of as a TCP connection over the physical network and not a direct physical link.

One major difference between an overlay network and a physical network is that overlay networks can be organized in any way desired by the designer. Physical links are generally restricted only to machines that are within close proximity to each other with a physical connection (such as an Ethernet connection) between them. It is not possible to create and change physical links between nodes on the fly. It is, however, very easy to create an overlay link between two nodes simply by initiating a TCP connection.

Peer-to-peer protocols attempt to create an overlay network with a structure that allows any node to find other nodes responsible for desired data quickly. These protocols must define how the overlay network is to be organized, how to handle nodes that join and leave the network, how to route lookup requests through the network, and so on. To limit overhead, we only want each node to have a small number of links in the overlay network, but at the same time we want lookups to occur quickly. We also want to do this in a fully distributed manner where there are no single points of failure and no nodes that are more important than others in the operation of the system.

There have been many proposed peer-to-peer protocols. These protocols take various approaches to organizing an overlay network. Each has its own strengths and weaknesses, and some sacrifice being fully distributed for various reasons. The next few sections will give brief overviews of five of these protocols: Napster [9], Gnutella [7], CAN [10], Chord [13], and Pastry [11]. We will focus on Chord since it has been chosen as the target protocol of this thesis.

3

2.2. Napster and Gnutella

Although Napster [9] is not the first protocol to make use of decentralized, distributed resources, it is the protocol that made P2P technology famous by making the multimedia files of many individuals accessible to the world. While the files shared on Napster were fully distributed, the directory of these files was not. The Napster overlay network consists of a Napster file directory server with every Napster peer connected directly to it and only it. When a user joined the network, they would send this directory server a list of all files that they were sharing. When a user wanted to find a file, they would send their search phrase to the directory server, and the directory server would return a list of peers that had this file. Peers would directly connect to one another to transfer the file.

The use of a central directory server means that Napster is not considered to be a “true” Peer-to-Peer system. This single point of failure eventually did fail when legal attacks forced it down, and the Napster network ceased to exist. After Napster failed, developers looked for a way to fully distribute the file lookup mechanism to prevent the network from being vulnerable to a single point of failure. One of the more well known protocols developed was called Gnutella.

Gnutella did away with the central directory server that Napster contained. Instead, each node was a directory server for its own files. Nodes in the overlay were connected to one another in a more or less arbitrary fashion. To perform a search, a node would send its search request to each node it was connected to in the overlay. Each node receiving the lookup request would then send the request to every node it was connected to, and so on. A node that contained the desired data would inform the searching node. This method is referred to as flooding. To keep a lookup request from flooding indefinitely, each lookup request had associated with it a time-to-live counter that each node would decrement before flooding the request to its neighbors. When this time-to-live counter reached zero, it would no longer be flooded.

There are several disadvantages to the flooding approach taken by the original Gnutella protocol. First, it is a very inefficient method of searching. A search request might result in thousands of messages being sent to and from thousands of nodes. Another side effect is that if a file exists in the system, there is no guarantee that it will be found. If the file resides on a node outside of the time-to-live radius, it will never be discovered. This means Gnutella is fundamentally not scalable. As the network grows, the search capability of a node does not necessarily grow, as search is limited by the time-to-live counter. Later releases of Gnutella were able to address these limitations to some extent by using a hierarchy of regular nodes and “supernodes.” The regular nodes would communicate only with supernodes, and supernodes would communicate with each other.

Gnutella’s initial scalability issues fueled even more research into Peer-to-Peer networks. This research resulted in what are known as (DHT) protocols. These protocols use structured overlay networks, unlike Gnutella which allowed the network to be arranged in any way imaginable. Location of desired data is guaranteed and is guaranteed within a bounded number of hops.

4

2.3. Distributed Hash Tables

Unlike flooding protocols such as Gnutella, Distributed Hash Tables are structured. This means that nodes that want to participate cannot create arbitrary links in the overlay network. Each DHT protocol defines how nodes should form connections in the overlay, how those nodes should deal with new nodes joining, and how they should deal with nodes leaving and failing. More importantly, the protocols also define how lookup messages should be forwarded through the system. Three popular DHTs that have received a great deal of research attention include CAN [10], Chord [13], and Pastry [11]. Although DHTs can vary greatly from one to another, there are a few things that almost all of them share in common.

The entire purpose of a DHT, like any other P2P system, is to locate the node(s) responsible for a desired data item. Each data item has associated with it a key , which is known beforehand, such as a file name. Each key is associated with a node or a group of replica nodes that are responsible for maintaining the desired data or a reference to where the desired data might be found. Unlike Napster and Gnutella, most DHTs only support exact key searches and do not support keyword searches, and this is a popular open research topic. All functionality other than locating the node responsible for a key, such as actually retrieving the resource being sought, are the responsibility of higher layers of the Peer-to-Peer application.

Each node in the DHT network is responsible for storing a subset of the overall key-data pair set. Each node maintains overlay links to a small number of nodes in the network for the purpose of routing lookup requests. Lookups are performed by forwarding the lookup request through the overlay network as defined by the DHT protocol. Typically, the number of links maintained by each node is O(log n) and the number of hops for a lookup request to complete is also O(log n), where n is the number of nodes in the system.

Nodes and keys in DHTs are both mapped to identifiers , usually by using a hash function with a well distributed random output of values. In Chord and Pastry, an identifier is simply an integer, and a hash function such as SHA-256 can be used to hash nodes and data keywords to their identifiers. In CAN, identifiers are points in a multi-dimensional coordinate system.

Identifiers are used to determine which node is responsible for which key. In Chord, a key is stored on the node with the first identifier with a higher value than the key’s identifier. In Pastry, a key is stored on the node with the identifier numerically closest to the key’s identifier. In CAN, a key is stored on the node whose “zone” (centered by the node’s identifier coordinates) contains that key’s identifier coordinates.

For routing lookup requests, each node contains a routing table that contains some small subset of the nodes in the system. These routing tables are used to forward a lookup request progressively closer towards its destination. With Chord, for example, a lookup request is forwarded to the node in the routing table with the identifier that most closely precedes the key’s identifier.

5

While these are common aspects of most DHTs, the details of each protocol vary, and so does the routing table size and average lookup hop count. Chord and Pastry, for example, have a routing table size of O(log n) and lookup requests take O(log n) hops to reach their destination. CAN, on the other hand, has a constant sized routing table but lookup requests take O( n1/ k) where k is a constant system parameter.

The next three subsections will provide more detail on Chord, CAN, and Pastry, with most of the attention focused on Chord since it is the target system of this thesis.

2.3.1. Chord

Chord’s identifiers are integers. The identifier for a key is obtained by hashing that key with some hash function that is used by all of the nodes in the system which returns integers of some bit length m. This hash function can be any well distributed hash function, and SHA-1 is used in the original Chord paper, which has a bit length m value of 160. A node is assigned an identifier by hashing its IP address. Nodes and keys are then arranged in an identifier ring modulo 2 m. Each key’s value is stored on the first node with an identifier equal to or following that key’s identifier in the ring in the clockwise direction.

In order to find nodes that are responsible for keys, each node has to store some routing information in a table. In Chord, this routing table is called a “finger table.” The Chord finger table for a node with identifier id contains m entries, numbered from 0 to m-1. For finger table entry i, the node stored in that entry is the first node whose identifier succeeds id + 2i (mod 2k) in the clockwise direction. It is possible (and often probable) to have duplicate entries in the finger table. Figure 2.1 shows a sample finger table with an illustration of how the finger table is derived for a node with identifier 770. Node 770’s last finger table entry should be the node that succeeds 770+2 9. This node is Node 275, so a reference to Node 275 is stored in the last finger table entry in Node 770’s finger table. The rest of the finger table entries are filled in with the same process for i = 0 through 8. Note that the first several finger table entries all point to the same node. This is because those entry pointers are all very close to node 770, and they all fall between 770 and 788.

6

Figure 2.1. An illustration of a sample finger table.

As Figure 2.1 illustrates, each node only has information about a subset of the nodes in the overall system. As the system gets much larger, the number of unique nodes in each node’s finger table becomes a smaller fraction of the overall number of nodes. The size of the finger table has been shown by [13] to be O(log n) where n is the number of nodes in the system. The advantage of the finger table is that when performing a lookup we can jump about half of the remaining distance between the node doing the routing and the node responsible for the key. This divide and conquer approach to routing lookup requests has been shown by [13] to use O(log n) hops for each route. The algorithm for routing a lookup request from a node is simple: forward the request to the last finger table entry that precedes the identifier of the key. The node preceding the destination node will detect that the key falls between itself and its successor and return information about its successor to the node performing the lookup. Figure 2.2 shows an example of the route a lookup request might take through a Chord network. In this figure, Node 770 is performing a lookup request for Key 665, which it finds stored at Node 678.

7

Figure 2.2. An example of the route taken by a lookup in a Chord network.

For a new node to join a Chord network, it needs to know of any one node that is already in the network. Finding a node that is already in the network is done out of band. The joining node will then use this node as a bootstrap node to perform a lookup on its own identifier. The node returned by this lookup will be the new node’s successor in the Chord ring. The new node will send a message to its successor notifying it that it is now that node’s predecessor and the successor will inform its former predecessor that the newly joining node is now its successor. The joining node will then use its successor to perform the appropriate lookups to fill in its finger table. Since nodes will be joining and leaving continuously, each node needs to periodically re-perform these lookups using a method called fix_fingers() in order to keep its finger table up to date.

2.3.2. Pastry

Pastry is in many respects similar to Chord. As with Chord, nodes and keys are hashed to integer identifiers that are placed on a ring and range between zero and the maximum hash output value. The identifier length used in [11] is 128 bits. Unlike Chord, the node responsible for a key is the node whose identifier is numerically closest to the key’s identifier.

8

Nodes keep track of identifiers in base 2 b, where b is a configurable parameter and usually has a value of 4. The first major difference between Pastry and Chord is the b organization of each node’s routing table. A Pastry routing table is arranged into log 2 N rows, where N is the number of nodes in the system. Each row has 2 b-1 possible column entries. The value of an entry at row i and column j is any node that shares the routing node’s first i digits in its identifier and has j as the next digit.

Each node also keeps track of additional nodes in a leaf set and a neighbor set. The leaf set contains L nodes, where L is another configuration parameter. The leaf set consists of the L nodes with the closet numeric identifier to this node. The first L/2 nodes precede this node on the ring and the last L/2 nodes succeed this node. The neighbor set consists of the “physically closest” nodes to this node (as opposed to numerically), where closest is defined by a proximity metric such as ping.

When routing a lookup request, the first place a node looks is into its leaf set. If the sought after identifier lies somewhere between the identifier of the first node in the leaf set and the last node in the leaf set, then we know which node is responsible for the key and can forward the lookup request to its destination. If the identifier is not in range of the leaf set, a node will use its routing table instead. The lookup request will be forwarded to the entry that shares the longest common prefix with the key’s identifier and the node’s identifier plus the key’s next digit. If no such node exists, the lookup request is simply forwarded to the closest node to the destination from among all nodes in the routing table, leaf set, and neighbor set. The expected number of hops for a lookup to b complete is log 2 N.

To join the network, a node will send a lookup request for its own identifier through a bootstrap node. The joining node will create its routing table by copying the row used by each node along this route that it used in the routing process. Filling in broken routing table entries will be done reactively as the missing entries are detected.

The peers that a node can keep in its routing table are flexible in Pastry. Each routing table entry can be any node that meets the prefix requirement. This can be used to exploit locality, and each node can attempt to fill its routing table with the entries that offer the best performance. This is in contrast to Chord, where there is only one correct entry for each routing table row.

2.3.3. Content Addressable Networks (CANs)

A CAN is quite different from Chord and Pastry. In a CAN, node and key identifiers are points on a d-torus in a d-dimensional space, where d is a configuration parameter. Each node is responsible for a “zone,” which is a bounded area of the overall CAN space surrounding the node’s identifier point. All keys that hash to points within a node’s zone are the responsibility of that node.

9

For routing, each node needs to know only about the nodes with bordering zones. This is a fixed, small number of nodes. For routing, nodes forward lookup requests to any node whose zone is closer to the destination. In many cases, this might be the node whose coordinates make the most progress to the destination, but when routing a node may take into consideration a tradeoff between distance gained and locality. The number of hops required to reach the destination is O( n1/ d). Since d is fixed for any one CAN, this means that as network size increases the number of hops on a CAN route increases faster than it does with Chord or Pastry.

10

3. Peer-to-Peer Protocol Security Issues and Related Work

Distributed hash tables face many security issues that Napster and Gnutella did not. Since the Napster directory was centrally managed, all of the security mechanisms for performing lookups needed only be at once place. The central management, however, proved to be Napster’s fatal weakness. With Gnutella, the lack of a network structure meant that all one needed to do was create overlay links to as many nodes as possible in order to be assured that flooded lookup attempts would actually be correctly sent through the system.

DHTs are both fully distributed and structured. Users must rely on other nodes in the system to follow the structured protocol correctly for the system to work. In the physical networks that consist of the Internet, routers are controlled by trusted corporations and other entities that are unlikely and unable to attempt to attack the overall system. With an overlay network, on the other hand, end users control the virtual routers in the system. The relative small size of an overlay network and the ability of a user to control multiple nodes in the overlay give attackers a great opportunity to compromise the system.

Survey papers have been written that enumerate the attacks available to a malicious user of a DHT system [12, 14]. These papers provide useful information that should be considered by anyone trying to secure a DHT. This section will review those attacks, and present papers that have attempted to address those attacks. We will also discuss how those approaches motivate and differ from the approach taken in this thesis.

3.1. Data Attacks

A simple attack that can be performed on DHT protocols is an attack on the data stored in the system. An attacker can deny the existence of data that its nodes are responsible for and can modify any legitimate data he stores. The attacker can also introduce compromised data into the system.

Data integrity is an application level security issue. The sole purpose of a DHT protocol is, given a key, to find the node responsible for that key. The behavior of that node after it is found is not the responsibility of the lookup protocol used to find it. However, DHT protocols can help in the response to these attacks once they are detected by associating multiple nodes with each key, a technique known as replication. Replication is needed not only in case of an attack, but also in case of node failures.

Many DHT protocols include replication features. In Chord, replicas are stored on the set of nodes that immediately succeed the node that the protocol specifies should store a given key. That way, if the responsible node fails, the node that is now responsible for that key already contains the associated data at the time of the failure. Pastry stores replicas on the closest nodes to the responsible root node on the ring. CAN uses multiple

11 hash functions to generate multiple identifiers for each key, and this results in a random distribution of replica identifiers throughout the network.

From a security standpoint, it can be argued that the replication approach taken by CAN is the most resistant to attacks, as is done by [14]. If replicas are all stored in a cluster of contiguous nodes, a malicious node in the area could potentially deny access to the entire replica set during the lookup process. By spreading the replica nodes out over the system with multiple hash functions (or by simply hashing the key multiple times with the same function) we can reduce the likelihood of an attack of this type being successful. This type of replication can be easily adapted to both Chord and Pastry, although doing so will result in more overhead.

Since an attacker can deny the existence of data that he or she should be responsible for, when performing a lookup we need to check multiple replicas to be sure that the data really does not exist. Looking up multiple replicas in systems using multiple hash functions can be done in parallel so that the process does not add significant waiting time to the lookup process. Verifying the integrity of the received data is, again, outside the scope of a DHT protocol and outside of the scope of this thesis.

3.2. Identifier Attacks

If an attacker can position the nodes he controls in the network in such a way as to control all replica nodes for a data item, then replication may be rendered ineffective. This type of attack is possible when nodes are allowed to choose their own identifiers. An attacker can simply compute the identifiers of all of the replicas of a key and create nodes with those identifiers. An attacker can also place itself in strategic positions in order to force a victim to use the attacker’s routers for all routing table entries.

Any truly secure DHT protocol cannot allow nodes to choose their own identifiers. Identifiers must be assigned in a secure and verifiable fashion. We also cannot allow a node simply keep generating new identifiers quickly, as this would allow them to obtain identifiers near the keys they wish to attack given enough time.

One simple solution to this might be to force nodes to use the hash of their IP address as their identifier. This allows other nodes to easily verify the legitimacy of a node’s identifier and to ignore messages from nodes that are not using the correct identifier. However, in some cases an attacker may have a large range of IP addresses at his disposal, especially if IPv6 is being used. In this case, the attacker could hash IP addresses until he finds one that is close to an identifier he seeks and then use that IP address. Even when this is not the case, it is often the case that multiple users may be running nodes behind a NAT router, thus having the same IP address. We can hash the port that the P2P application is running on as well, but since users can choose their ports, this gives the user more available identifiers. In Chord, a node’s identifier is the hash of its IP address, port, and virtual node number. Since each user can run many virtual nodes, this also gives an attacker access to an even wider array of available identifiers.

12

The allocation of IP address blocks is technically centrally managed by ICANN. This means that any application using IP addresses over the Internet can never be truly fully distributed. Working solutions to the identifier assignment problem can be achieved if we are willing to give in to another centralized concept: certificate authorities (CAs).

A certificate authority can take a public key from a user and bind it to a random identifier chosen by the certificate authority. Nodes can verify the authenticity of other node’s identifiers by checking the CA’s signature. This has the added benefit of providing a public key infrastructure that can be used for exchanging messages between peers. The disadvantage is, of course, that a CA is a single point of trust and a single point of failure.

A related attack to the identifier attack is the Sybil attack [4]. A Sybil attack is an attack where a single attacker joins a peer-to-peer network with numerous identities, giving that attacker control of a large portion of the network. If an attacker gains control of a large enough portion of the network, redundancy features that can be used to access denied or corrupted data can be rendered ineffective. An attacker who controls a large enough fraction of the network will be in control of almost all of the data in the overall system. The attacker will also be in control of most of the routers in the system, and can disrupt lookup requests travelling through the network.

This attack can occur when a system does not take measures to associate distinct entities with distinct identities. We would like each entity to be associated with a maximum of some small, constant number of identities. In a perfect world, we would also like identity assignment and verification to be performed in a completely distributed way, such as with a web of trust. Unfortunately, [4] shows that no system that uses a fully distributed identity verification method will be completely invulnerable to a Sybil attack.

While some papers have put forward distributed solutions that prevent Sybil attacks to a certain extent (for example, [2, 3]) the only solution that completely works is to use a central certificate authority. Trusted certificate authorities are proposed for DHTs by [14, 12, and 1]. Since it may be unreasonable to expect the trusted authority to verify real world identities, [1] proposes charging a fee to obtain a certificate to limit the number of identities that an attacker is willing to obtain. Another idea put forward by some is to force nodes to solve puzzles. This idea is rejected by [1] and [4]. The argument against puzzles is that they must be easy enough for the slowest machines to be able to solve them in a reasonable amount of time and yet hard enough to prevent an attacker with a large amount of resources from obtaining many certificates quickly, which does not seem to be a reasonable requirement.

In this thesis, we are not proposing a defense against Sybil attacks. Our defense is for a system with some minority fraction of the nodes compromised. Our main goal is to prevent routing attacks in a system where an attacker (or group of attackers) manage to compromise a subset of the legitimate nodes in the system. We will assume some Sybil attack defense mechanism is in place, such as a certificate authority charging money for

13

certificates. The centralized nature of the certificate authority is unfortunate, but unavoidable as shown by [4].

3.3. Routing Attacks

Building on section 3.2., we will now assume that malicious nodes cannot choose their location in the overlay network and that an attacker cannot completely overwhelm the network by creating an unlimited number of identities. Even with these security mechanisms in place, an attacker controlling even a small fraction of randomly placed nodes can seriously disrupt the system. While these nodes can compromise the data they are supposed to be storing, replication allows us to find a good alternate node provided there are enough replicas. The real problem posed by the attacking nodes is their ability to compromise the DHT lookup routing protocol.

There are two ways to route through a DHT based peer-to-peer network: recursively and iteratively. Recursive routing means that a lookup request is sent from hop to hop through the overlay network until it reaches its destination, which can then respond either directly to the node performing the lookup or by sending a response backwards through the lookup’s path. Iterative routing is when the node performing a lookup contacts each node on the route one by one and asks for the next hop towards the destination. The disadvantage of iterative routing is that we must send a query and receive a response from every node on the network, so lookups take about twice as long as they do with the recursive method when the destination directly responds to the node performing the lookup. The advantage is that iterative routing gives the node performing a lookup complete control over the routing process.

Both recursive and iterative routing can be compromised if a malicious node is encountered on the path to a lookup’s destination. A malicious node can drop the lookup request, forward it to the wrong node, or respond with the wrong destination. With iterative routing, we are also vulnerable to an attack where malicious nodes can keep sending us from one incorrect malicious node to another indefinitely without ever reaching the destination. With recursive routing, this indefinite routing attack would be treated the same way as a dropped packet as the lookup request was sent out and no response was received. It is important to note that since all DHTs have to be fault tolerant, they all must deal with dropped lookup requests to an extent as dropped lookup requests will occur occasionally with non-malicious node failures. A malicious node may choose not to behave the same way at all times or when handling lookup requests from different parts of the system. While a normal node that has failed would be removed from the network, a malicious node can behave just well enough to remain in the network and then drop all lookup requests that it receives.

A lookup request needs to reach only one malicious node before the lookup is compromised. If the average hop count is h and the fraction of malicious nodes is f, then the probability of a route not containing any malicious nodes is (1-f)h [14]. In Chord, the average hop count in an n node network is approximately ½ log 2n. With a 1,000 node

14

network, this means we can expect an average hop count of around 5. If 25% of nodes in the system are compromised, the probability of a lookup request avoiding any malicious node is 0.75 5, which is 24%. So in this case, an attacker only needs to control 25% of nodes to disrupt 76% of lookups.

The effects of a routing attack may be exacerbated in systems that do not have a constrained routing table. A constrained routing table is a table where each entry only has one possible correct value. Chord has a constrained routing table. For a node with identifier n, the only correct entry for finger table entry i is the node that succeeds the value n+2 i (mod m) on the identifier ring. Pastry, on the other hand, does not have a constrained routing table. For a particular routing table entry in Pastry, any node that meets the prefix requirements is a valid node for that entry. Pastry tries to fill these entries with the matching nodes that have the best locality measurement in order to optimize performance. This can allow attackers to fake locality and increase the odds that their nodes are used as routing table entries by others, as shown by [1]. Also shown by [1] is the fact that it is easier with Pastry for an attacker to provide malicious nodes as routing table updates, especially for the top rows, since it likely that an attacker that has control of any significant fraction of the system will control at least one node for each short prefix.

Aside from the routing tables being constrained, the routing table entry selection should be constrained as well, as pointed out in [12]. Otherwise, a malicious node can simply use only the malicious nodes that appear in its constrained routing table when routing. CAN, for example, allows each node to decide which node to route to next based on a tradeoff between progress towards the destination and round trip time to the next hop. Since next hop selection is not constrained, we cannot verify that our routing request is being routed correctly.

There are several design principles proposed by [12] for securing DHT protocols. The first of these is to define verifiable system invariants and verify them, and the second is to allow the node performing a lookup to observe lookup progress. The idea here is to use constrained, iterative routing. We should verify that those constraints are being met as we are routing. This is one of the major principles behind the proposals that we are making in this thesis. We propose here a way for verifying system invariants and for reacting to situations in which those invariants are not met.

One solution for avoiding routing attacks is proposed in [1]. This is a solution that works with the Pastry DHT. The goal is to successfully retrieve a set of replicas for a given key, where the replicas are a subset of the neighbor set for the root node responsible for a key. This is a contiguous set of nodes. A node performing a lookup will use its own neighbor set to compute the average numerical distance between node identifiers in the identifier ring. This value is then compared to the average distance between node identifiers in the replica node set that is returned from a lookup request. If the average distance between identifiers in the replica set is too large compared to our own computed average then it is determined that a malicious replica set was received.

15

If a node performing a lookup determines that a replica set is malicious, numerous lookup requests are then sent through the node’s neighbor set. These neighbor nodes will use a separate, constrained routing table to route the lookup requests through the network. The original Pastry protocol does not use constrained routing tables, so a separate table is kept. Since each node’s constrained routing table is different and not based on performance metrics, when a lookup request is sent through different neighbors it should take a diverse set of routes towards the destination. The set of replica sets received in response is combined, and all of these nodes are contacted and asked to provide their neighbor set. Any new nodes found are then asked to provide their neighbor set as well, and as long as new neighbor nodes are provided, this process is repeated up to three times. When this is completed, the closest nodes found to the key’s identifier are determined to be the correct replica set. This method was shown to find the correct replica set over 99.9% of the time when up to 30% of the nodes in a 100,000 node system are compromised.

Our proposed system also makes use of the average node identifier distance in the network, but it differs in that we use this information to actively avoid routing attacks during the routing process instead of reacting to them when they are detected after the lookup request has been completed. We use the average node identifier distance to verify system invariants as we are observing the lookup process – two principles proposed by [12]. These ideas are explicitly rejected for Pastry by [1] which claims that this would add too many extra hops and not be very accurate. Our results with Chord, however, show a significant increase in routing success over the base Chord protocol. [1] relies on performing a large number of parallel lookups for the same identifier to find the responsible replica set, while our system does not.

A mechanism for defending against a “Byzantine join attack” is proposed by [5]. This proposed system, called S-Chord , modifies Chord routing tables to make use of swarms instead of individual peers. Swarms consist of the set of all nodes that are within (C ln n) / n of the swarm’s point location, where C is a configurable constant and n is the number of nodes in the system. Each lookup request is forwarded from all of the nodes in one swarm to another. If the number of Byzantine peers joining in a time period is below some configurable threshold, then the correct successor swarm for a key may be found with high probability. This defense mechanism, however, requires that each node keeps track of O(log 2n) nodes and each lookup request consists of O(log 2n) messages.

Another proposed mechanism for secure routing with Pastry is proposed by [8]. The idea is to move untrusted nodes into separate Pastry rings that interface the main Pastry ring via two anchor nodes. Messages going through the main Pastry ring can then bypass the untrusted rings. The mechanism for deciding which nodes should be trusted and which should not is left as future work. With the ability to perfectly detect untrustworthy nodes, the percent chance of a lookup request completing successfully is equal to the percentage of nodes in the system that are trustworthy. An actual trust system that can be used is not provided by [8] however, and creating one is an open problem. Our proposed modifications to Chord do not rely on any trust system.

16

4. Chord Secure Routing Design

This section describes the changes we are making to the Chord protocol in order to avoid routing attacks.

4.1. Threat Model

In order to design a defense, we first need to understand the attacks we are defending against. The purpose of this thesis is to propose a method of avoiding routing attacks. We will assume that attackers cannot choose their node identifiers. This can be done by using a certificate authority as shown in Section 3.2. We will assume that some fraction of the overall set of nodes is compromised and that all of these nodes can collude with each other. We will assume that the attacking nodes are a minority of nodes in the system. This is an important assumption: we are not designing a defense to Sybil attacks.

Here are the basic capabilities that we will assume an attacker has: Attackers can drop lookup requests. Attackers can forward lookup requests to incorrect nodes. Attackers can direct lookup requests to other malicious nodes in any manner they wish. Attackers can be selective in which lookup requests they respond to correctly and which they do not. Basically, an attacker can do anything they want with a lookup request that is sent to their node.

To evaluate the performance of this system, we will test against three different types of attacks. These attacks are designed to represent the most effective lines of attack available to a malicious node.

• Dropping Lookup Requests. This is a simple type of attack. When a malicious node performing this type of attack receives a lookup request, the node simply does not respond. The system must be designed to recover from nodes that drop lookup requests. • Randomly Misrouting Lookup Requests. With this type of attack, the attacker does not drop the lookup request but instead tries to send the victim to some random next hop. This may be another misrouting node that sends the victim off in yet another random direction, preventing the victim from ever reaching the destination. This is more difficult to defend against than the lookup dropping attacker, since it may appear that the attacking node is actually cooperating by giving a next hop. • Performing a sub-ring attack. In this type of attack, a group of attackers are colluding to try to cause lookup requests to end up at a malicious node. These attackers have two finger tables – one is the correct finger table that it uses for itself, and the other is a finger table that consists of the first succeeding malicious node of each node of its correct finger table. When an attacker receives a lookup request, that lookup request is forwarded using the malicious node finger table. The lookup request is therefore “captured” by the attackers and will only be

17

forwarded through malicious nodes, and will reach a malicious destination, which will often not be the correct destination. This attack is the most difficult to detect since it appears that the attackers are cooperating and routing correctly. Each hop will make progress to the destination, but the ultimate destination will always be a compromised node.

4.2. Design Overview

The main idea behind the proposed system is to use locally known statistical data about the average numerical difference between consecutive node identifiers to detect routing attacks during the routing process and to recover from detected attacks. We store the identifiers of the successors and predecessors of nodes in our finger table for the purpose of computing the average numerical distance between node identifiers. We use a pruning mechanism to remove distance samples that are likely the result of malicious nodes in our finger table. As we are routing towards our destination, we use the computed average distance to determine whether the hops we encounter are likely valid or invalid based on their distances from routing table reference points called finger pointers . When an invalid hop is detected, we backtrack to the previous node on the route and request a next hop that makes less progress towards the destination in an attempt to avoid the node that provided us with the invalid hop.

In order to control the routing process, we will use iterative routing as opposed to recursive routing. A recursive lookup occurs when a node performing a lookup sends the request out into the network and lets the other nodes forward it towards its destination. This gives the user no control over the route their lookup request takes. With iterative routing, the node performing the lookup contacts each node on the route towards the destination one by one and requests the next hop. This gives the node performing the lookup control over the routing process.

Finger pointers are the identifier values that a node looks up in order to fill in a finger table entry. For a node with identifier id and finger table row i, the value of the finger pointer is id + 2 i. Since a finger table entry’s pointer clearly falls somewhere between two nodes, the difference between the pointer and the identifier for the node in the entry should be less than the distance between two nodes. We will compare this distance to the average distance between nodes that we will compute from our own finger table. If the distance is too large, the hop fails verification. We backtrack around nodes that either do not respond to our requests or provide a hop that fails verification.

4.4. The Backtracking Algorithm

Normally, the hop we take during the Chord routing process is the hop that most closely precedes the identifier of the key we are seeking. In our system, when a faulty node is detected during the routing process, we will fall back to the previous node on the route and use its next closest preceding node to the destination. This offers less progress, but

18 gives us a way to route around the faulty node. If a node runs out of hops to give because it has no more nodes in its routing table that precede the destination, we will fall back to the previous node on the hop and use its routing table and repeat the process.

All nodes that are determined to be faulty/malicious or that have run out of hops to give will be stored in a temporary “black list” that is created for each lookup request, so that we never use this node again during the lookup request. To prevent the need to query the same node multiple times, we request the entire finger table of a node the first time it appears on the path towards the destination and cache it for the duration of the lookup attempt.

There is a limitation in the Chord protocol that we must address. With Chord, every lookup request must be routed through the node that immediately precedes the node responsible for the key being sought. This means that if the node preceding the destination node is faulty, backtracking by itself cannot find a way around this node. To address this, each node is made aware of the identifier of the predecessor of every node in its finger table, and this identifier is stored as an extra column in the finger table. As long as we can find a non-faulty node with the destination node in its finger table, we can identify that node as the destination node by verifying that the key identifier we are seeking falls between that node’s identifier and its predecessor’s identifier. This allows us to bypass faulty nodes immediately preceding the destination. Since nodes often have many finger table entries for nearby identifiers, as we get closer to the destination we have a good chance of being able to find a node with the correct destination in its finger table, allowing us to bypass faulty nodes preceding the destination.

We need to be careful about under what circumstances we bypass nodes on the route to the destination. If we request a node’s finger table and see that it contains a reference to what appears to be the destination node, we might be tempted to use this reference and bypass the rest of the routing process. The issue with this method of bypassing is that nodes further away from the destination are more likely to have out of date successor information, and the destination may have changed and that node has not yet called its fix_fingers() method. Therefore, we only want to use bypassing as a last resort. We will not immediately bypass the rest of the nodes on a route when a node on the route knows of the destination. Instead, we will only bypass if that node has run out of any other hops that we can use.

We illustrate the concept of backtracking and bypassing in Figure 4.1. In this situation, the first and second hops have passed verification. The node reached during the second hop has a reference to the destination node in its finger table, which we have verified by checking to see that the identifier of the key we are looking for falls between that node and its predecessor. We do not immediately bypass, and instead route to the closest preceding node (hop 3.) Hop 3 provides a next hop that fails verification, so we fall back and ask for the second closest preceding node (hop 4.) Hop 4 also provided a next hop that fails, so we fall back again and try to use the third closest preceding node to route towards the destination (hop 5), which again provides a next hop that fails verification.

19

This time when we fall back, the node we are using has no more preceding hops, so we now bypass and reach the destination with hop 6.

Figure 4.1. An illustration of backtracking and bypassing.

The modified closest_preceding_node() algorithm is shown in Figure 4.2. Since we aren’t always returning the closest preceding node, we have renamed this algorithm next_hop() . This function takes four input variables. The input variable id is the identifier being sought. The input variable index specifies which preceding node we want to obtain from the finger table. An index value of 1 means we want to find the closest preceding node, a value of 2 means find the second closest preceding node, and so on. The index variable will have a value of greater than 1 when we are trying to route around the node that was the closest preceding node because it either provided a hop that failed verification or ran out of preceding hops that we could use. The input variable nodeid is the identifier of the node that we are obtaining the next hop from and the input variable fingertable is the finger table of that node.

The local variable uc in next_hop(), which is short for “unique count,” is the number of unique preceding nodes that have been counted while looking for the index -th closest preceding node.

The first thing the next_hop() algorithm does is check to see if the identifier we are looking for falls between the node we are using and its successor. If it is, and the value of index is 1, then we can return the successor of the node we are using, which is the destination node. If index is greater than 1, it means we are trying to find a further away preceding node than this node’s successor, which is impossible, so we return null.

20

n.next_hop(id, index, nodeid, fingertable) uc = 0 if id in range (nodeid , fingertable[1] ): if index == 1: return fingertable[1] else: return null bypassnode = null for i = m down to 1: if id in range (fingertable[i].predecessor, fingertable[i] ]: bypassnode = fingertable[i] if fingertable[i] in range (n, id ]: if (i == m or fingertable[i] != fingertable[i+i] ): uc = uc + 1 if (uc == index ): return fingertable[i]

return bypassnode

Figure 4.2 : Revised closest preceding node algorithm.

The next step is to go through the finger table, from the last entry down to the first entry, just as you would in the unmodified closest_preceding_node() algorithm described in [13]. If we look at a finger table entry and determine that the identifier we are looking for falls between it and its predecessor, then we will save a reference to that node and return it as the destination node later only if we cannot find any other valid next hop. As soon as we find a node that precedes the identifier we are looking for, we will start increasing our unique counter ( uc ). We only increase the unique counter after that when a finger table entry is different from the finger table entry that follows it. Once the unique counter is equal to the index we are looking for, we can return the finger table entry that we were checking.

If we get all the way through our finger table and determine that we do not have enough entries for the requested index, then we simply return null. A return value of null indicates to the caller that this node has no more next hops to give. A return value that precedes the requested identifier indicates to the caller that the returned node is the next hop to the destination. A return value that succeeds the requested identifier indicates to the caller that the returned node is the destination that is responsible for the key being sought.

This next algorithm we will describe, shown in Figure 4.3, is the main routing algorithm, a modified version of find_successor() . This algorithm makes use of the find_next_hop() algorithm that was already described and of a verify_hop() algorithm that will be described later.

21

Our find_successor() algorithm keeps track of a stack of Chord nodes that are used during the routing process, called the routing stack . We will use the word “router” to describe a node whose finger table is being used for routing purposes. This stack contains the identifier, finger table, and current backtracking index of every router on it. We show this as a stack of which are tuples that represent these three pieces of information. We also keep track of a blacklist, which is a list of routers that have either given us hops that failed verification or that have run out of nodes on their finger table to give us as next hops. We also keep track of the number of hops so far— using the variable attempts—and allow the user to input a limit to the number of hops that will be used at a maximum. The first thing pushed onto the router stack is the routing table information for the node that is performing the lookup.

n.find_sucessor( id, hoplim ): routerStates = new STACK of < id, fingertable, index > tuples blackList = new LIST of nodes attempts = 0

routerStates.push( n.identifier , n.fingertable , 0);

while !routers .isEmpty() and attempts < hoplim : curRouterState = routerStates .pop() curRouterState.index++ nextHop = n.next_hop_forward_bypass( id, curRouterState.index, curRouterState.id, curRouterState.fingertable ); if (nextHop != null and verify_hop(curRouterState. id , nextHop. id ) and !routerStates .contains( nextHop )) or curRouterState.id = n.id :

if id in range (n.id , nextHop.id ): return nextHop else if blacklist.contains(nextHop.id) : routerStates .push( curState ) else: routerStates .push( curRouterState ) routerStates .push( attempts ++ else: blackList .add( curRouterState.node.id )

return null

Figure 4.3. The algorithm for finding the successor node of an identifier.

Our main routing loop will run as long as there are routers still available in the routing stack and as long as we haven’t exceeded the maximum number of hops. During each iteration, we pop the top router off the stack, increment its backtrack index, and perform a lookup with its information. The first time a router is popped off the stack, its index value will be set to 1, meaning our next_hop() algorithm will use the closest preceding node in that rouer’s finger table. If it is ever popped off the stack for a second time, that value will be set to 2, meaning next_hop() will return the second unique preceding node

22

in that router’s finger table. This is how backtracking to route around problem nodes is performed. Once we have performed a next hop lookup, we make sure the lookup actually returned a next hop, use verify_hop() to check to see that the hop is valid, and check to see that the next hop isn’t already on the routing stack. If it passes these tests (and we allow our own node to always pass these tests) then we perform another test. If the node return succeeds the identifier we are seeking, then it is the destination node, and we return it. If the node returned is on the black list, then we simply push the router we were using back onto the stack, and during the next iteration its index will be increased and a prior finger table entry will be used. If the next hop is not the destination and is not on the black list, then we will contact that node and request its finger table. We then add this next hop to the routing stack, with an initial index of 0.

If a hop fails verification, then we add the router that provided the hop to the blacklist. We do nothing to the routing stack. The previous router on the path will be used on the next generation, and it will be used with a higher index value, and thus we will use a finger table entry that offers less progress.

If our router stack ever empties or we exceed the maximum number of hops, the lookup has failed and we return null to indicate that this has happened.

Figure 4.4. A sample route using the modified algorithm

Figure 4.4 shows an example of how routing works in the modified system. In this example, the dark nodes are behaving and the white nodes are compromised. The source

23

node and the node responsible for the key being sought are both labeled. The first hop uses the closest preceding node on the source node’s finger table. This node then provides a reference to a next hop that fails the verification algorithm. Our algorithm will then go back to the source node’s finger table and call next_hop() again with an index value of 2, which returns a reference to the second closest preceding node, shown as hop 2. The rest of the routing process completes without incident and the destination node is found.

4.5. The Hop Verification Algorithm

As you will recall, a finger table is a table with m entries, where m is the length in bits of the identifiers used in the network. Entry number i in the finger table points to the first node that succeeds the value ( id + 2i) mod 2 m – we call this the pointer value for entry i. We know that the pointer value points to a space in the Chord ring that falls between the finger table entry and its predecessor. In order to verify that a hop is legitimate, we verify that the distance between a finger table pointer and the identifier of the node for that entry falls within what we would expect the typical range to be between nodes in the Chord ring. This typical value is something we compute by averaging together locally known distance values that we obtain from the data stored in our finger table, which is described in the next subsection. The verify_hop() algorithm is shown in Figure 4.5.

n.verify_hop(firstNodeId, secondNodeId, indexUsed) m fingerPointer = ( firstNodeId + pow (2, indexUsed )) % 2 distance = secondNodeId – fingerPointer % 2 m

acceptableDistance = AVG_DISTANCE + ( sd_mod * STD_DISTANCE )

if (distance > acceptableDistance): return false else : return true Figure 4.5. The verify_hop() algorithm

The input parameters are the identifier of the node that gave the hop, the identifier of the node at the end of the hop, and the index we used to look up this hop in the routing table of the node providing the hop. The distance variable that is computed is the difference of the end node of the hop and the finger pointer that is used to point to it from the node that provided the hop. If this value is higher than the value we wish to accept, then we reject the hop, otherwise we accept it.

The acceptable distance is computed using three variables. AVG_DISTANCE is the average distance between nodes computed from locally known distance samples. STD_DISTANCE is the standard deviation of the distance samples. The sd_mod parameter is a system parameter that controls how many standard deviations over the average we wish to allow a node/finger pointer distance to be. SD_MOD provides a way to balance false positives vs. false negatives. The acceptable distance is, as shown, the average distance plus the standard deviation scaled by SD_MOD.

24

By forcing the next hop to fall within an acceptable distance of the routing node’s finger table pointer value, we are tightly restricting where in the Chord ring the next hop may be. If an attacker does not have control of any nodes in that acceptable area, the attacker cannot fool us into using another attacking node as the next hop. Since nodes cannot arbitrarily place themselves wherever they wish in the Chord ring, it becomes much more difficult for a malicious node to have our lookup request forwarded to any node except the correct next node on the route.

4.6. Maintaining Statistical Data

In order to compute the average distance between nodes in the system, we will store some extra information about each node in our finger table. Figure 4.6 shows what a row in our finger table looks like.

The index, node identifier, and node remote reference are columns that appear in the normal Chord protocol’s finger table. We are also storing the identifier of the node’s predecessor’s identifier and the identifiers of the nodes in its successor list. Knowing these identifiers allows us to generate samples for computing the average distance between nodes and the standard deviation of that average. All of this data will be obtained from a node when we are calling our fix_fingers() method to update a finger table entry.

Index Node Node Remote Node Node Identifier Reference Predecessor Successor Identifier List Identifiers Figure 4.6. Row format for the modified finger table.

It is possible for a node to lie to us about the identifiers of its predecessor and successors if we do not take extra precaution. However, since we are operating under the assumption that nodes are not allowed to decide their own identifiers and that identifiers are verifiable, we can make sure that the provided identifiers are valid by storing some extra information. For example, if a certificate authority is being used, we can store signed node certificates, which contain a node’s identifier, IP address, public key, and a signature of the granting authority. We also need a mechanism to prevent a node from using certificates of nodes that are not currently in the network. This can be done by requiring that nodes occasionally request the current date and time from its predecessor and successors signed with their secret keys. This data can be requested by nodes when they update their finger table, and can be decrypted with that node’s public key, allowing us to verify that the node has recently been in the network. In the worst case, we can at least ping each of these nodes occasionally to make sure they really are in the network.

Even with these protection mechanisms in place, the attacker can still provide real node certificates for nodes that are not actually its predecessor and successor. This would cause the victim to calculate an average node distance that is higher than the actual

25

average that should have been calculated. To prevent this type of attack, we prune our data set.

The distance between node IDs on the Chord ring is exponentially distributed, as it is for any ring based DHT system with randomly assigned node IDs [1]. A useful property of exponential distributions is that the average difference of consecutive values is equal to the standard deviation of those differences. This means that the average distance between nodes and the standard deviation of these distance values should be approximately equal. Since malicious nodes can provide successor and predecessor node identifiers that are not consecutive to its own identifier, we would expect to derive distances from those malicious nodes that are greater than they really are which would cause our computed standard deviation to be greater than our computed average.

We prune our sample set by throwing out the highest values until the computed standard deviation is “close enough” to the average. We do this with another system parameter, called the pruning parameter. We remove the highest distance sample values until the standard deviation of what is left is less than the average distance scaled by the pruning parameter. This algorithm is shown in Figure 4.7.

n.calculate_statistics (distanceSamples (LIST of ), pruningParameter ) done = false average = stdeviation = 0 distanceSamples .sortAscending() while (! done ): average = AVG( distanceSamples); stdeviation = STDEV( distanceSamples); if stdeviation > average * pruningParameter : distanceSamples. remove( distanceSamples. size() – 1) else: done = true

return (average , stdeviation )

Figure 4.7. The calculate_statistics() algorithm

The justification for throwing out the largest distances when pruning (to attempt to get an average and standard deviation that are reasonably close) is because nodes can lie and provide nodes that are further away than its actual successors/predecessor, but it cannot lie about a node that is too close, because no such node exists that is closer than the correct successors and predecessor. The effectiveness of pruning will be evaluated in Section 6.3.2.

4.7. Joining the Network

In order to successfully join the Chord network in the presence of attackers, we need to make a few changes. We must know of some set of uncompromised bootstrap Chord nodes that are already in the system in order to join it securely. If we try to join with

26 compromised bootstrap nodes, they can simply put us into any Chord network they would like. These nodes must be found out of band, and this is the only situation in which we require that trust exist between nodes. We use a set of bootstrap nodes instead of a single node because even with our security mechanisms in place some incorrect lookups occur in the presence of attackers. In order to populate our initial finger table, we will ask each bootstrap node to perform a lookup of our finger pointer identifiers ( nodeid+ 2i for i from 0 to m-1). We will use the node with the ID closest to the finger table pointer for that finger table entry. The reason for this is, again, because nodes can lie about nodes that are too far away, but cannot lie about nodes that are too close.

4.8. Updating Finger Table Entries

The next issue we have to deal with is the case where a finger table update call (fix_fingers() ) gets an incorrect update for a finger table entry. To avoid this, we modify fix_fingers() . If we receive a new finger entry that is closer than the old one to the finger pointer during an update, we accept it and make the change. However, if a new finger is further away than the old one, then we will check the old entry and the nodes in its successor list to make sure that all nodes with preceding identifiers to the new node’s identifier have actually left the network. If none of these nodes are in the network, we accept the update, otherwise we reject it and use the closest succeeding node that we were aware of as the finger table entry.

27

5. Simulator Design

In order to test and evaluate the proposed changes to the Chord protocol, we built our own Chord simulator from scratch. The simulator is written in Java and requires Java version 1.5 or later to compile and run. Individual tests can be set up and run with an included GUI utility, and batches of tests can be written in Java by using the ChordController class. The source code is available online at http://www.csh.rit.edu/~keithn/thesis.htm.

Our simulation works in an iterative fashion rather than being multithreaded. Running a separate thread for each node is not reasonable with large Chord networks, and we are able to achieve the same functionality with an iterative simulation. Each node has a tick() method that is repeatedly called by the simulator, and the tick method calls the functions that need to be periodically called for the Chord protocol to run correctly.

The original Chord protocol as described in [13] is implemented in the class ChordNode . Three malicious node classes have been written that extend ChordNode to test the effectiveness of three types of attacks against the unmodified Chord protocol. The first is MDropperChordNode , which is a node that drops all lookup requests. The second is MRandomChordNode , which is a node that works with other MRandomChordNode s to forward lookup requests around the Chord ring randomly without ever reaching the final destination. The last is MColludingChordNode , which is a node that colludes with other MColludingChordNode s to form a sub-ring of the main Chord ring in order to try to capture lookup requests and forward them through the sub-ring.

The changes that we propose are implemented in the class SecureChordNode , which extends ChordNode . The same types of attacks have also been implemented in the modified protocol in order to test the effectiveness of these attacks against our changes, and the nodes that perform them are called MSDropperChordNode , MSRandomChordNode , and MSColludingChordNode . These classes all extend SecureChordNode and implement the same attacks as the nodes that extend the base ChordNode class. This class structure is illustrated in Figure 5.1.

It should be noted that the nodes of the default protocol are not compatible with the nodes of the extended protocol. ChordNodes cannot be mixed with SecureChordNodes due to protocol differences.

28

Figure 5.1. An overview of the various Chord node classes

During a simulation, a ChordController object is responsible for managing all of the nodes in the system. ChordController objects store nodes in ChordRing objects. A ChordRing acts as a data structure for the nodes and contains convenient operations that may be performed on the entire system. A ChordController may have also have a ChordGUIUtil and a StatKeeper object associated with it. The ChordGUIUtil displays graphical results for an experiment and the StatKeeper collects statistics during an experiment and displays them when requested.

We designed and developed all classes ourselves except for BigSquareRoot , which is a free utility class for calculating the square root of Java BigIntegers taken from [6]. This class is necessary for computing the standard deviation of the average node distance.

There are two main mechanisms for running tests. The first is to use the included ChordGUIUtil class, which contains a main method that displays a GUI allowing the user to run individual tests. The GUI utility will create a network with the specified number of nodes and the specified parameter, fill in the finger tables of all those nodes, and then simulate a specified number of tests. Output is displayed to the screen. The second test mechanism involves coding tests in Java that make use of the ChordController class. This allows batches of tests to be set up and run, and the output is in comma separated value format which allows the data to easily be imported into a spreadsheet application. We will explain how to use both of these mechanisms in the next two sections.

5.1. Using the GUI Utility

The included GUI utility is a convenient way to run simple experiments and to graphically observe how the unmodified Chord protocol and the secured Chord protocol operate. The tests you can run with this utility are very useful, but to have full control over your tests you will need to code your own tests to make use of the ChordController class.

29

Figure 5.2. A screenshot of the GUI utility

To use the Chord GUI utility, you will need to obtain the simulator source code from http://www.csh.rit.edu/~keithn/thesis and extract the provided archive file. On a machine with Java 1.5 or later installed, you can compile this code by typing:

>javac *.java

And you can run the GUI utility by typing:

>java ChordGUIUtil

When you execute the utility, the GUI window will appear. The utility consists of four main components:

• The left side of the screen contains the experiment setup panel. This panel allows you to enter parameters for an experiment and then run it. • The middle of the screen is the Chord display panel. After an experiment is complete, you can graphically view the routes that lookups took during the experiment. • The right side of the screen is the lookup panel. After an experiment is complete, this will display some information about each lookup that occurred during the

30

experiment. You can click on one of the lookups and view it in the Chord display panel. • The bottom of the screen contains the status area. After an experiment is complete, this box will display information about the experiment, such as number of correct lookups and average hop counts.

Figure 5.2 shows a screenshot of the GUI utility.

5.1.1. Experiment Setup

The experiment setup panel allows you to set up and run a new experiment. In the current version of the software, an experiment consists of creating a Chord network with the given number of each type of nodes, forcing it to converge, and then performing a given number of random identifier lookups from random nodes. More advanced experiments can be run by coding them manually, which will be described in section 5.2.

Here are the fields on the experiment setup panel and their descriptions:

• Uncompromised Nodes: This is the number of good, uncompromised nodes you want in the system. These nodes always behave and run the protocol in the correct manner. • Malicious Dropper Nodes: This is the number of bad “dropper” nodes. A dropper node simply drops any lookup request that it receives. • Malicious Random Routing Nodes: This is the number of nodes that collude to forward lookup requests among themselves randomly. The goal is to cause a lookup request to be forwarded indefinitely without ever reaching its destination. • Malicious Colluding Nodes: This is the number of bad “colluding” nodes. Colluding nodes all work together to try to trick a good node that is performing a lookup by sending his lookup request only through other colluding nodes. Colluding nodes form a sub-ring of the overall network. When a lookup request hits one of these nodes, the lookup request goes into the sub ring and never comes out. This is supposed to guarantee that the final node found will be a compromised node. To the node performing the lookup, it appears as if all is well. • Default or Secure Nodes: This radio button lets you choose between the default, unmodified Chord protocol and the secure protocol that has been proposed by this thesis. • Standard Deviation Parameter: This parameter is used only in secure node mode. This is the number of standard deviations over the average node distance you will allow a finger pointer/finger entry distance to be. This is described in section 6. If left blank, no hop verification will be performed. You may enter a non-integer value here. Negative values are also allowed. • Pruning Parameter: This parameter is used only in secure node mode. This controls how tightly you want each node to prune its node distance sample set. In a perfect world, the standard deviation of node distances should be the same as the average node distance itself, since node distances form an exponential

31

distribution. The pruning parameter is how many multiples of the average distance you will allow the standard deviation to be at a maximum. The highest distance values are thrown out of a node’s sample set until this requirement is met (or half of all samples have been thrown out, in which case pruning stops). Normal values range from 0.8 to 1.2, with lower values working better if there are more nodes in the system. If left blank, no pruning will be performed. You may enter a non-integer value here. See section 6 for more information on the pruning parameter. • Maximum Number of Hops: This parameter is used only in secure node mode. Since secure node lookups may (rarely) go on for a very long time trying to find routes around many bad hops, this puts a limit as to how many hops you are willing to allow a lookup to go. • Number of Test Lookups: This is the number of test lookups that you want to perform in the experiment. Using powers of 10 is recommended so that you can easily compute percentages from the statistics shown in the experiment’s output. • Random Seed: This is the random seed that will be used. This allows experiments to be repeated. It is recommended that you use a seed so that you can replicate any interesting behavior that is observed. Leaving this field blank will use a random seed.

Once you have entered all of the parameters needed, you can click the Run Test button to run the experiment. This will set up all node finger tables and generate statistical information if secure node mode is being used. It will then perform the number of random lookups that you specified. A progress bar will tell you what task is being performed and how much of that task is complete.

Once the experiment is complete, data will be displayed on the other three GUI panels.

5.1.2. Viewing Experiment Results

After an experiment run has been completed, you will see aggregate experiment data printed to the status panel, information for each lookup shown in the lookups panel, and nodes displayed along the ring in the Chord display panel. Blue dots are uncompromised nodes and red dots are compromised nodes.

The aggregate data shown on the bottom will show you the total number of lookups performed, and how many of them were successful and unsuccessful. Successful lookups are lookups that completed and returned a reference to the correct destination node. Incorrect lookups are lookups that completed but were fooled by colluding nodes into finding a reference to a compromised node. Failed lookups are lookups that simply did not complete. You will also see hop count data and a hop count histogram, along with data about hop verification accuracy.

The lookup panel shows you lookups that have completed. From the panel, you can see which lookups were successful, corrupt (returned a reference to a wrong and malicious

32 node), and incomplete, along with how many hops these lookups took. Click on a lookup and then click “View” to see the route that lookup took displayed on the Chord panel. Blue lines represent hops that passed the verify_hop() algorithm, and red lines indicate lines that did not. Invalid hops are not counted in the hop count because the node does not actually contact the next node on that hop.

The Zoom In and Zoom Out buttons are useful for zooming in when the hop count is high for a particular lookup and the screen is cluttered.

5.2. Writing Tests in Java

While the provided GUI utility is useful for running quick tests and for visualizing how the proposed algorithms work, in order to run batches of tests to get more meaningful results you must write your own tests in Java. Tests are written using the ChordController class.

The major methods that are used from a ChordController object are listed below, along with brief descriptions of what they do. In order to get a complete understanding of how to use ChordController, you should read the Javadoc documentation that is available with the code. The tests used to obtain the evaluation results in Section 6 can be found in the file Tests.java. Viewing this file should help understand how to write your own tests.

The major ChordController methods are:

• Constructor: o Takes no arguments. • addChordNodes(int num): o Adds num Chord nodes to the network that implement the default, unmodified Chord protocol outlined in [13]. • addMDropperChordNodes(int num): o Adds num Chord nodes to the network that run the unmodified protocol and simply drop all lookup requests. • addMRandomChordNodes(int num): o Adds num Chord nodes to the network that run the unmodified protocol and randomly route lookup requests among each other. • addMColludingChordNodes(int num): o Adds num Chord nodes to the network that form a malicious sub-ring. All lookup requests received by one of these nodes is forwarded into the malicious sub-ring instead of through the full network. • addSecureChordNodes(int num, double std_param, double prune_param, int maxhops): o Adds num Chord nodes to the network that are implementing the security modifications proposed by this thesis. The two system parameters are described in detail in Section 6. The maxhops parameter specifies the

33

maximum number of hops we are willing to attempt before giving up on a lookup request. • addMSDropperChordNodes(int num, double std_param, double prune_param, int maxhops): o Adds num Chord nodes that run the same attack as MDropperChordNodes except using the secured protocol. • addMSRandomChordNodes(int num, double std_param, double prune_param, int maxhops): o Adds num Chord nodes that run the same attack as MRandomChordNodes except using the secured protocol. • addMSColludingChordNodes(int num, double std_param, double prune_param, int maxhops): o Adds num Chord nodes that run the same attack as MColludingChordNodes except using the secured protocol. • startTracking(): o Instructs the simulator to start keeping track of lookup statistics. You should call this before expediateConvergence() if you want to keep track of the average node distance and standard deviation statistics computed by the individual nodes. • expediateConvergence(): o Fills in the finger tables of all nodes in the system with the correct entries. This is meant to get the network up and running quickly so that tests can be performed. • performRandomLookups(int num): o Performs num lookups of random identifiers from random uncompromised nodes in the system. You should have called startTracking() first so that the results of these lookups will be kept track of. • showHumanStats(): o Prints the recorded lookup statistics in a human readable format. • showCSVStats(): o Prints the recorded lookup statistics in comma separate form. This format is easy to import into spreadsheet applications. • runTicks(int num): o Simulates the calling of the periodic Chord protocol functions (fix_fingers, stabilize, and check_predecessor). This simulates how the network performs over time as nodes join the network.

34

6. Evaluation

This section will evaluate the proposed Chord modifications. Our evaluation seeks to answer the following questions:

• How well does the system perform in the presence of malicious nodes that simply drop lookup requests? (Section 6.1) • How well does the system perform in the presence of malicious nodes that route lookup requests to random malicious nodes in the system instead of the correct next hop? (Section 6.2) • How well does the system perform in the presence of malicious nodes that form a sub-ring (partition) of the overall Chord network and forward all received lookup requests into this sub-ring? (Section 6.3) • How does the size of the network impact the success rate of the system? (Sections 6.1-6.3) • How does the standard deviation parameter impact the success rate of the system? How does it impact the average hop count? How does it impact the false positive/false negative rate of the verify_hop algorithm? (Section 6.3.1) • How does the pruning parameter impact the success rate of the system? How does it impact the average hop count? How well does it estimate the average size of the network? (Section 6.3.2)

6.1. Dropped lookup requests

One simple attack that a malicious user has available to them is to refuse to respond to any lookup requests that they receive. Chord deals with nodes that have failed through the stabilize() and check_predecessor() functions, but a malicious node can respond to failure detection mechanisms to remain in the Chord ring and yet not respond to lookup requests. Our routing algorithm was designed to deal with these types of nodes, as well as any failed nodes that may not have yet been cleaned up through the failure checking functions at some point in time.

Testing the performance of the system in the presence of nodes that drop lookup requests allows us to test the routing algorithm independently of the hop verification algorithm. For the following tests, the standard deviation parameter was set to infinity, which has the effect of disabling the hop verification functionality while still keeping the modified routing functionality. The networks consist of 1000 nodes, and the number of nodes dropping lookup requests was varied from 0 to 500. With each variation in the number of nodes dropping lookup requests, we created 100 different networks and performed 1000 random lookups on each network from a non-malicious node.

The experiment was run by converging each network to a stable state and then enabling the lookup request drop attack on randomly selected nodes. For comparison, we also

35 tested the default Chord protocol in the same scenario. The percentage of all lookups that resulted in success are shown in Figure 6.1.

Figure 6.1. Success rate when testing in the presence of nodes that drop lookup requests.

With the default Chord protocol, if any one node on the path to the destination drops the lookup request, the lookup fails. Without any malicious nodes, our simulations show that the average path length in a 1000 node Chord network is approximately 4.8 hops. If f is the fraction of uncompromised nodes, then the chance of there being no malicious node on the path to the destination is f4.8 , and this is reflected in the test results.

The modified Chord protocol that we have proposed in this thesis fared much better than the unmodified version. With even 50% of nodes in the system dropping all lookup requests, we were able to route around those compromised nodes and find the node responsible for a key approximately 95% of the time. Routing around all those malicious nodes takes more hops, of course, and Figure 6.2 shows the effect of the fraction of compromised nodes on the average hop count for the modified Chord protocol.

Figure 6.2 shows that the average hop count doubles when 28% of nodes in the system are dropping lookup requests. The average hop count doubles again when 50% of nodes in the system are dropping lookup requests. The advantage, however, is that 95% of lookup requests will succeed, compared to less than 10% with the default Chord protocol.

36

Figure 6.2. Hop count results when testing in the presence of nodes that drop lookup requests.

6.2. Incorrect Random Routing

While the dropped lookup request tests show that the routing algorithm works well, it does not test the hop verification algorithm. This next type of attack tests both the routing algorithm and the hop verification algorithm. With this type of attack, the malicious nodes in the system direct lookup requests to other random malicious nodes in the system. The effect is to bounce the lookup request all over the network while never reaching the destination.

We test the effectiveness of this attack against both our modified system and the default Chord protocol. Since the modified system requests entire finger tables rather than single hops, our modified system’s random routing attackers will return a finger table full of random malicious nodes instead of the properly constructed finger table. Although this attack does test the hop verification algorithm, we would expect the hops to easily fail verification. This is because a random node in the system is likely to be many standard deviations of the average node identifier distance away from that entry’s finger pointer.

This experiment was run on a network of 1000 nodes. The number of compromised nodes was varied from 0 to 500. For each variation, we created 100 networks and ran 1000 tests on each network. This time, we needed to set the standard deviation parameter and pruning parameter. The standard deviation parameter was set to 3.0 and the pruning parameter was set to 1. We set the default Chord protocol to give up after 1000 hops to prevent an infinite loop. The results are shown in Figure 6.3.

37

Figure 6.3. Success rate results of testing in the presence of nodes that randomly misroute lookup requests among themselves.

We can see that the success rate is high for the modified Chord protocol when random routing nodes are present. The verify_hop() algorithm easily detects malicious hops when the hop is to a random location. The unmodified protocol, on the other hand, fails when a malicious node appears on the route to the destination. Its lookup requests will be routed throughout the set of malicious nodes endlessly. When a large number of nodes in the system are compromised, the modified system does increasingly return incorrect result nodes, but the success rate is still over 90%.

Figure 6.4 shows the effect of random router nodes on the average hop count in the modified protocol. Again, the average hop count in a Chord network of 1000 nodes running the default protocol is 4.8 when there are no malicious nodes in the system. Notice how even with no nodes compromised, our system had an average hop count of 6.3. This is due to the fact that we are now using an SD_PARAM of 3.0. This increases our average hop count because we ocassionally have false negatives in the verify_hop() algorithm, causing us to backtrack even when it is not actually necessary.

38

Figure 6.4. Average hop count when testing in the presence of nodes that randomly misroute lookup requests among themselves.

6.3. Malicious Sub-ring Routing

While the random routing tests show that the verify_hop() algorithm works, the hops given to check were easy to reject because they were completely random and often very far from where the legitimate next hop should be. These next tests are designed to test against a difficult type of routing attack to detect. This is an attack where all of the malicious nodes in the system collude and form a sub-ring of the overall ring. Each malicious node does this by maintaining an extra finger table. In this finger table, each entry contains the first malicious node that succeeds each finger pointer. This malicious finger table is the finger table returned to uncompromised nodes performing a lookup. Any lookup request that reaches a malicious node on its route will be captured by the colluding sub-ring.

The goal of the attacker is to emulate correct behavior as closely as possible, but still control the lookup process. The attacker wants to cause as many lookups as possible to end at a malicious node, so that the attacker can either deny the existence of the sought after data or provide compromised data. Since the nodes used in the finger table are the closest malicious nodes to the correct entry, our verify_hop() algorithm should have a more difficult time recognizing malicious hops, and should be less accurate.

This first test shows the effect of varying the number of these compromised nodes from 0 to 500 in a 1000 node network. For each variation, we created 100 networks and ran 1000 test lookups from random uncompromised nodes on random identifiers on each

39

network. We tested our modified protocol with a standard deviation parameter of 1.75 and 3.0. The pruning parameter was kept at 1.0. We also tested the unmodified Chord protocol for comparison. Figure 6.5 shows the lookup success rate achieved when running this experiment.

Figure 6.5. Success rate when testing in the presence of malicious sub-ring nodes on a 1,000 node network.

While the lookup success rate was not as high as with the previous two tests, it was still much better than the unmodified Chord protocol. For example, with 20% of the nodes compromised, only 48.6% of lookups succeeded using the unmodified protocol while 86% succeeded with the modified protocol using a standard deviation parameter of 1.75.

It should be noted that as the fraction of colluding nodes increases, the protocols eventually see their success rate “improve.” To see why, consider the case when 40% of nodes are compromised. Even if we were fooled by the attackers every time, 40% of our lookups would still return the correct destination node, because 40% of the correct destination nodes are malicious nodes. Therefore, we can never perform worse than y=x on this graph, and that is our lower bounds.

Figure 6.6 shows the average hop count for a 1000 node system as the fraction of colluding sub-ring nodes varies.

40

Figure 6.6. Average hop count when testing in the presence of malicious sub-ring nodes on a 1,000 node network.

The modified protocol increases the average hop count even when no nodes are compromised, due to false positives with the verify_hop algorithm. When we use an SD_PARAM of 1.75, we have a higher hop count than when we use an SD_PARAM of 3.0, but we also have a higher success rate. The SD_PARAM provides a tradeoff of false positives vs. false negatives, and we will see this more clearly in section 6.3.1.

We repeated this experiment for a larger network to investigate how network size affected performance. A network in this experiment has 10,000 nodes, and we varied the percentage of compromised nodes between 0 and 50. For each variation, we created 10 networks and ran 1000 test lookups from random uncompromised nodes on random identifiers on each network. Figure 6.7 shows the results of this experiment. While this figure shows that our success rate is slightly less with 10,000 nodes than it was with 1,000 nodes, the difference between the success rates of the modified and unmodified protocol is greater.

Figure 6.8 shows the average hop count for these experiments. Here we see the average hop count actually start to decrease after a high enough percentage of nodes have been compromised. This is because we are often being fooled by the malicious nodes and not correctly backtracking around them.

41

Figure 6.7. Success rate when testing in the presence of malicious sub ring nodes on a 10,000 node network.

Figure 6.8. Average hop count when testing in the presence of malicious sub-ring nodes on a 10,000 node network.

42

6.3.1. Effect of the Standard Deviation Parameter

While the previous experiments demonstrate some of the effects of modifying the standard deviation parameter, this section will look it more closely. The purpose of the standard deviation parameter is to allow the user to trade off security for performance. A higher standard deviation parameter value will allow more invalid hops to slip through the cracks, and a lower value will erroneously flag more valid hops as invalid.

In order to see the effects of the standard deviation parameter more clearly, we ran an experiment where all aspects of the experiment were kept constant while the standard deviation parameter was varied. For the experiment, we created networks with 1000 nodes where 750 were correctly running the Chord protocol and 250 were colluding nodes performing the sub-ring attack. The pruning parameter was kept constant at 1. We varied the standard deviation parameter between -1 and 9 in steps of 0.1. For each value, we creates 100 networks and performed 1000 random lookups from random non- compromised nodes. The success rate of these experiments is shown in Figure 6.9.

Figure 6.9. Success rate with various SD_PARAM values

With a very small standard deviation parameter value, most lookup attempts fail, but not because the node performing the lookup was deceived. These lookup attempts fail because there are so few hops that can pass the verify_hop test, and no route can be found to the destination that consists of hops that all pass verification. As the standard deviation parameter is increased, more lookups succeed as the false positive rate is reduced. At the same time, however, the false negative rate is increasing and more invalid hops begin to pass verification, resulting in more lookup requests that end in deception (meaning a node was found, but it was not the correct node.) The successful

43 lookup rate tops out at a little over 80%, which is significantly higher than the success rate of 45% that results when no hop verification is used.

The advantage of using a higher standard deviation parameter value is, of course, to obtain lower hop counts. Figure 6.10 shows the average hop count for the varying values of standard deviation parameter. We see the initial increase in the hop count for small SD_PARAM values because increasing the SD_PARAM results in less “dead ends” where there are no more valid hops that can be taken. After peaking at an SD_PARAM value of around 0, the average hop count decreases as the SD_PARAM increases since less hops are failing verification. When you compare the average hop count graph to the success rate graph, you can see that for SD_PARAM values between 1.0 and 3.0, increasing the SD_PARAM slightly decreases lookup success while significantly decreasing the average hop count.

Figure 6.10. Average hop count with various SD_PARAM values

To further see the effects of varying the standard deviation parameter, we recorded the false positive and false negative rate for all calls to the verify_hop function. Our definition of the false positive rate is the total number of false positives divided by the total number of hops that were tested, and likewise the false negative rate is the total number of false negatives divided by the total number of hops detected. The results are shown in Figure 6.11.

44

6.11. False positive and false negative rates with various SD_PARAM values

The effects of the SD_PARAM on the false positive and false negative rates are as expected. Figure 6.11 gives us a better feel for how accurate the verify_hop function really is.

6.3.2. Effect of the Pruning Parameter

The other major system configuration parameter is the pruning parameter. Pruning, as you will recall, is when we attempt to remove incorrect data samples that were computed using information provided by malicious nodes. The distance samples in the Chord ring occur in an exponential distribution. In an exponential distribution, we expect the standard deviation of the samples to be the same as the mean. Pruning is done by removing the largest data samples until the computed standard deviation is less than the average scaled by the pruning parameter.

To test the pruning parameter, we ran an experiment where we varied the pruning parameter between 0.5 and 2.0 while keeping the standard deviation parameter constant at 1.75 for a network of 1000 nodes with 25% colluding to run the sub-ring attack. We show the success rates that result in Figure 6.12.

45

6.12. Success rate with various pruning parmeter values

When the pruning variable is very small, too much data is pruned and the computed average that results is too small. This causes the verify_hop() function to fail often and results in a high false positive rate. As a result, it is difficult to find a path from source to destination that consists of verified hops, and the failure rate is high. As the pruning parameter is increased, the computed average distance becomes larger and lookups succeed more often. When the pruning parameter is too high, data from the malicious nodes in the system results in an average distance computed that is too high, and the false negative rate increases resulting in more deceived lookups.

An individual node can estimate the size of the network by dividing the maximum identifier value by the average distance between nodes that the node has computed. In Figure 6.13, we show what the average estimated size of the network is for the various pruning parameters in this experiment.

46

6.13. Average estimated size for various pruning parameters

Obviously, we would like the average estimated size to be the actual estimated size of 1000. This is achieved with a pruning parameter value of 0.8. We have found that 0.8 is a good pruning parameter value in nearly every test we ran, although when a very small number of nodes are compromised a value of 1.0 may be a better choice.

47

7. Conclusion

Misbehaving nodes in distributed hash tables can cause severe disruptions, even when they only exist in small numbers. Several security concerns must be addressed in order to use distributed hash tables in situations where users cannot be trusted. In this thesis, we proposed a mechanism for mitigating the effects of one of those concerns: routing attacks.

Our solution does not require any overly complex modifications to the target DHT (Chord.) We make use of information that is readily available to each node and we do not require a trust relationship to exist between nodes. This relative simplicity means that this defense mechanism should work well in parallel with other modifications of the Chord algorithm and it can easily be combined with other security features.

Our system easily deals with simple attacks, such as dropped lookup requests and randomly misrouted lookup requests. For difficult to detect attacks, such as the sub-ring attack, our success rate drops but is still much greater than the success rate of the unmodified Chord protocol, especially in cases where less than 20% of nodes in the network are compromised.

Our security features do increase the number of hops required to reach destinations. Different users have different security and performance needs, and so we provide parameters that can be used to balance these two needs.

Security in structured peer-to-peer networks is difficult because of their fully distributed nature, but we have shown that routing security can be greatly improved using only a relatively small amount of locally known information. Our technique should transfer to other DHTs that make use of constrained routing, and can serve as a crucial piece to a total security solution.

48

8. References

1. Castro, M., Druschel, P., Ganesh, A., Rowstron, A., Wallach, D.: Security for Peer-to-Peer Routing Overlays. Proc. Of the Fifth Symposium on Operating Systems Design and Implementation (OSDI '02), Boston, MA (2002) 2. Danezis, G., Lesiewski-Laas, C., Kaashoek, M. F., Anderson, R.: Sybil-resistant DHT routing. Proc. Of the 10th European Symposium on Research In Computer Security . Milan, Italy (2005) 3. Dinger, J. and Hartenstein, H.: Defending the Sybil Attack in P2P Networks: Taxonomy, Challengers, and a Proposal for Self-Registration. Proc. Of the First International Conference on Availability, Reliability, and Security. Washington, DC (2006) 4. Douceur, J.: The Sybil Attack. Proc. of the First International Workshop on Peer-to-Peer Systems (IPTPS '02) , Cambridge, MA (2002) 5. Fiat, A., Saia, J., Young, M.: Making Chord Robust to Byzantine Attacks. Proc. Of the European Symposium on Algorithms , Heidelberg, Germany (2005) 6. Gilleland, M. Big Square Root. Accessed via the web: http://www.merriampark.com/bigsqrt.htm . 7. Gnutella, Accessed via the web: http://www.gnutella.com 8. Heinbockel, W., and Kwon, M.: Phyllo: A Peer-to-Peer Overlay Security Framework. Proc. Of the First Workshop on Secure Network Protocols (NPSec) , Boston, MA (2005) 9. Napster, Accessed via the web: http://www.napster.com . 10. Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S.: A Scalable Content Addressable Network. Proc. Of ACM SIGCOMM’01 , San Diego, CA (2001) 11. Rowstron, A., Druschel, P.: Pastry: Scalable, Distributed Object Location and Routing for Large Scale Peer-to-Peer systems. Proc. Of IFIP/ACM Middleware 2001 , Heidelberg, Germany (2001) 12. Sit, E. and and Morris, R.: Security Considerations for Peer-to-Peer Distributed Hash Tables. Proc. Of the First International Workshop on Peer-to-Peer Systems (IPTPS '02) , Cambridge, MA (2002) 13. Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H.: Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications. Proc. Of ACM SIGCOMM’01 , San Diego, California (2001) 14. Wallach, D.: A Survey of Peer-to-Peer Security Issues. Proc. Of International Symposium on Software Security , Tokyo, Japan (2002)

49