Sharing Network Measurements on Peer-to-Peer Networks

Bo FAN

A thesis submitted in fulfilment of the requirements for the degree of Master of Philosophy (Research) in Electrical Engineering and Telecommunications

School of Electrical Engineering and Telecommunications The University of New South Wales February, 2007

PLEASE TYPE THE UNIVERSITY OF NEW SOUTH WALES Thesis/Dissertation Sheet

Surname or Family name: Fan

First name: Bo Other name/s:

Abbreviation for degree as given in the University calendar: MPhil

School: School of Electrical Engineering and Faculty: Engineering Telecommunications

Title: Miss

Abstract 350 words maximum: (PLEASE TYPE)

With the extremely rapid development of the Internet in recent years, emerging peer-to-peer network overlays are meeting the requirements of a more sophisticated communications environment, providing a useful substrate for applications such as scalable file sharing, data storage, large-scale multicast, web-cache, and publish-subscribe services. Due to its design flexibility, peer-to-peer networks can offer features including self-organization, fault-tolerance, scalability, load-balancing, locality and anonymity. As the Internet grows, there is an urgent requirement to understand real-time network performance degradation. Measurement tools currently used are ping, traceroute and variations of these. SNMP (Simple Network Management Protocol) is also used by network administrators to monitor local networks. However, ping and traceroute can only be used temporarily, SNMP can only be deployed at certain points in networks and these tools are incapable of sharing network measurements among end-users. Due to the distributed nature of networking performance data, peer-to-peer overlay networks present an attractive platform to distribute this information among Internet users.

This thesis aims at investigating the desirable locality property of peer-to-peer overlays to create an application to share Internet measurement performance. When measurement data are distributed amongst users, it needs to be localized in the network allowing users to retrieve it when external Internet links fail. Thus, network locality and robustness are the most desirable properties. Although some unstructured overlays also integrate locality in design, they fail to reach rarely located data items. Consequently, structured overlays are chosen because they can locate a rare data item deterministically and they can perform well during network failures. In structured peer-to-peer overlays, Tapestry, Pastry and with proximity neighbour selection, were studied due to their explicit notion of locality. To differentiate the level of locality and resiliency in these protocols, P2Psim simulations were performed. The results show that Tapestry is the more suitable peer-to-peer substrate to build such an application due to its superior localizing data performance. Furthermore, due to the routing similarity between Tapestry and Pastry, an implementation that shares network measurement information was developed on freepastry, verifying the application feasibility. This project also contributes to the extension of P2Psim to integrate with GT-ITM and link failures.

Declaration relating to disposition of project thesis/dissertation

I hereby grant to the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or in part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all property rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation.

I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstracts International (this is applicable to doctoral theses only).

…………………………………………………………… ……………………………………..……………… ……….……………………...…….… Signature Witness Date

The University recognises that there may be exceptional circumstances requiring restrictions on copying or conditions on use. Requests for restriction for a period of up to 2 years must be made in writing. Requests for a longer period of restriction may be considered in exceptional circumstances and require the approval of the Dean of Graduate Research.

FOR OFFICE USE ONLY Date of completion of requirements for Award:

THIS SHEET IS TO BE GLUED TO THE INSIDE FRONT COVER OF THE THESIS

ORIGINALITY STATEMENT

‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’

Signed ……………………………………………......

Date ……………………………………………...... i

Acknowledgement

Most of all I would like to thank my supervisor Dr. Tim Moors for giving me this opportunity to do research with him and for his encouragement, patience and guidance through the whole research journey. His insightful opinion has always led me to make right decisions in the critical stage of my research, keeping me on the right track.

I would like to thank Tim Hesketh, Head of Electrical Engineering School to introduce this Master of Philosophy programme to me. I also wish to give my thanks to Phillip Allen from Electrical Engineering School for his help to build the experimental network and all the convenience of equipment access he provided. I would also like to express my gratitude to my research colleagues in 343E, Tim Hu for his valuable hints and Sameer Qazi to proofread my thesis.

Especially, I would like to thank my parents for their emotional and long-term financial support.

ii Abstract

With the extremely rapid development of the Internet in recent years, emerging peer-to-peer network overlays are meeting the requirements of a more sophisticated communications environment, providing a useful substrate for applications such as scalable file sharing, multicasting and publish-subscribe services. Peer-to-peer networks can offer desirable features such as self-organization, fault-tolerance, scalability, load-balancing, locality (proximity) and anonymity. As the Internet grows, there is an urgent requirement to understand real-time network performance degradation. While measurements can be made using tools such as ping and traceroute, and network administrators can monitor local networks using SNMP (Simple Network Management Protocol), there are currently no tools for sharing measurements among end-users. Due to the distributed nature of networking performance data, peer-to-peer overlay networks present an attractive platform to distribute this information among Internet users.

This thesis investigates the desirable locality property of peer-to-peer overlays to determine their suitability for creating an application to share Internet measurement performance. When measurement data are distributed amongst users, users need to be able to retrieve localise copies to improve the chance of access when external Internet links fail. Thus, network locality and robustness are the most desirable properties of a peer-to-peer network for sharing measurements. Although some unstructured overlays provide localised access, they fail to reach rarely located data items. Relevant measurement data may be hard to find when a client seeks measurements about a niche network element (e.g. obscure server) or seeks measurements that are highly relevant to itself by virtue of having been made by nearby nodes. Consequently, structured overlays are chosen because they can locate a rare data item deterministically and they can perform well during network failures. In structured peer-to-peer overlays, Tapestry, Pastry and Chord with Proximity Neighbour Selection, iii

were studied due to their explicit notion of locality. To differentiate the level of locality and resiliency in these protocols, P2Psim simulations were performed. The results show that Tapestry is the more suitable peer-to-peer substrate for building such an application due to its superior localizing data performance. Furthermore, due to the routing similarity between Tapestry and Pastry, an implementation that shares network measurement information was developed on Freepastry, verifying the feasibility of the application. This project also contributes by extending P2Psim to integrate it with the GT-ITM network topology generator and by enabling simulation of link failures.

iv Table of Contents

Acknowledgement ...... i Abstract...... ii Table of Contents ...... iv Chapter 1 Introduction ...... 1 1.1. Background and motivation...... 1 1.2. Contribution and challenge...... 4 1.3. Thesis organization ...... 5 Chapter 2 Background ...... 6 2.1 Introduction ...... 6 2.2 Overlay algorithms...... 6 2.2.1 Chord...... 7 2.2.1.1 Chord protocol routing scheme ...... 8 2.2.1.2 Chord Maintenance mechanisms...... 10 2.2.1.3 Chord Proximity Neighbour Selection...... 11 2.2.2 Plaxton ...... 12 2.2.3 Tapestry...... 14 2.2.3.1 Tapestry basic routing scheme...... 14 2.2.3.2 Fault handling...... 14 2.2.3.3 Tapestry surrogate routing...... 15 2.2.3.4 Tapestry maintenance algorithm...... 15 2.2.3.5 Tapestry locality ...... 17 2.2.4 Pastry...... 18 2.2.4.1 Pastry basic routing scheme ...... 18 2.2.4.2 Pastry maintenance mechanism...... 20 2.2.4.3 Locality in Pastry...... 21 2.2.5 Tapestry and Pastry Comparison...... 22 2.2.6 Locality discussion and optimization...... 23 2.2.6.1 Proximity neighbour selection...... 23 2.2.6.2 Optimizations in Tapestry...... 23 2.3 Simulation tools...... 26 2.3.1 P2Psim ...... 26 2.3.2 GT-ITM...... 27 2.4 Relevant overlay applications ...... 30 2.4.1 Resilient Overlay Networks...... 30 2.4.2 M-Coop...... 33 2.4.3 PeerCQ...... 34 2.5 Conclusion...... 37 Chapter 3 Simulation Comparison of DHTs...... 39 3.1 Simulation methodology ...... 39 3.1.1 Counting underlay hops ...... 39 3.1.2 Calculating path stretch...... 41 v

3.1.3 Calculating success rates/failure rates ...... 42 3.2 Simulation setup...... 43 3.3 Simulation analysis...... 46 3.3.1 Results on networks with static link connections ...... 46 3.3.1.1 Impact of network size ...... 47 3.3.1.2 Distribution of path length ...... 51 3.3.1.3 Path length factors...... 55 3.3.1.4 Impact of stabilization frequency...... 63 3.3.2 Results on networks with link failures...... 64 3.3.2.1 Impact of link failures frequency ...... 64 3.3.2.2 Impact of network size ...... 65 3.3.2.3 Impact of stabilization frequency...... 68 3.4 Conclusion...... 70 Chapter 4 Implementation with Pastry ...... 71 4.1 Introduction ...... 71 4.2 Implementation design ...... 71 4.3 Implementation details ...... 73 4.4 Implementation results discussion...... 76 4.5 Future Work...... 80 4.6 Conclusion...... 80 Chapter 5 Conclusions and Future work...... 81 5.1 Results summary ...... 81 5.2 Future work ...... 82 Reference Appendix I P2Psim extension 1. Extending P2Psim to support GT-ITM 2. Adding link event Appendix II Implementation Procedure and Results

Chapter 1 Introduction 1

Chapter 1 Introduction

1.1. Background and motivation

Internet has been growing extremely fast during the last few decades. It has evolved from a relatively obscure, experimental research and academic network to a commodity, mission-critical component of the public telecommunication infrastructure [1]. As a consequence, the study of network performance and failures that affect users experience becomes crucial in order to make an efficient use of network resources and detect and recover from failures. Furthermore, end-users are more willing than before to learn about the network performance in real-time to check the availability for their own interests (e.g. the connection availability from UNSW to AARNet).

Network performance can be inferred from the metrics of path latency, available bandwidth, loss rates, jitter profile and packet reorder probability between two network end points. The basic approaches to measure network performance can be classified into two categories: passive monitoring and active probing [1]. Passive monitoring measures network performance with a management protocol such as Simple Network Management Protocol (SNMP) which does not add traffic to disturb routine network operations.

On the other hand, active probing requires the packets to be injected into the network and to be collected later on to infer some network metrics such as delay, loss rates, bandwidth and etc [2]. The most common probing tools include ping, traceroute, pathchar [3, 4], cing [5] and etc. Repeated pings can be used to test the delivery of packets. Packet delays and packet loss from ping results can infer the general network performance [6].

Chapter 1 Introduction 2

Current works related to Internet performance measurement concentrate on the statistical analysis of backbone failures [7-10]. In these works, the frequency and duration of backbone failures, together with failure classifications are studied. Whereas the results are useful for developing failure models for simulations and understanding the impact of failures on network availability, little attention has been given for sharing network failures among Internet end-users.

To solve the problem of allowing end-host users learning network performance information continuously and immediately, we propose an idea of utilizing the distributed nature of peer-to-peer networks to gather network measurement information and share the information with other peers/users in the network. The heuristic is that users are more interested in the information from users nearby (e.g. in the same network) because they may experience similar networking performance such as delay, loss rates and etc. E.g. a user in UNSW is more interested in the network performance that other UNSW users have experienced, rather than users from the MIT laboratory. On the face of network failures to the external Internet, users in the local area network should still be able to share measurement information with each other. Thus, it requires certain correlation between the measurement data and the location of these data holders and the measurement data to be localized.

Peer-to-Peer networks [11] have emerged in the last few years to cope with the rapid growth of the Internet and to meet the requirement for a more complex and chaotic communication environment. A peer-to-peer system, unlike the traditional client-server mode, considers each host as a symmetric peer. Without any hierarchical or centralized administrative control, each peer shares its resources with other peers and forms a self-organized which is built on top of Internet Protocol networks. Peer-to-peer systems can provide a variety of features and advantages over the traditional client-server systems, including a robust routing infrastructure, a more efficient search scheme, fault tolerance, redundant storage, permanence, trust and authentication, anonymity and scalability. Chapter 1 Introduction 3

A typical design of P2P(peer-to-peer) overlay network architecture can be found in [11]. Two classes of P2P overlay structure have been identified as unstructured and structured overlays. Unstructured overlays, such as Gnutella [12], Freenet [13], Fasttrack [14]/KaZaA [15], BitTorrent [16] and Overnet/eDonkey [17], use flooding or random walk as the mechanism to query content stored by overlay peers. Unstructured overlays can neither guarantee to find a data item in a certain number of hops, nor guarantee to find a data item at all if it is rarely located. There is also no coupling between data items and peers. Thus, unstructured overlays are not suitable as the substrate to share measurement data. On the other hand, structured overlays, such as Chord [18], Tapestry [19], Pastry [20], Content Addressable Network (CAN) [21], [22], Viceroy [23], assign keys to data items and arrange the peers in the network to form a graph that maps each key to a peer. Structured overlays are also referred to as DHT () [24]-based overlays. Compared with unstructured overlays, structured overlays can efficiently locate a rare item in a certain distance. Some DHT designs of structured overlays have also considered network proximity which is more appealing to meet the requirement of sharing network measurement.

There has been a proposed infrastructure M-Coop [25] discussing the similar problem of sharing measurement on peer-to-peer overlay networks. M-Coop consists of two layers: measurement layer and DHT layer. While M-Coop relies on assigning the closest AS to a joining node with the locally available information in the measurement layer, it does not specify the property of the underlying DHT.

We aim at designing a system sharing the network measurement information on top of a specific DHT that has implemented locality property. The end-hosts on which this system has been deployed collect measurement information either actively through pinging (and its many varieties) or passively through SNMP or daily uses applications such as emails or web browsers. Then the end-hosts publish the relevant information by joining existing peer-to-peer networks or starting a peer-to-peer network. To fulfil Chapter 1 Introduction 4

this task, we first investigate the existing DHTs and their optimizations for an appropriate substrate. It can be seen that both of Chord (with Proximity Neighbour Selection [26] )and Tapestry have implemented locality in their routing tables. To find out the better DHT for sharing measurement information, we integrate P2Psim [27] and GT-ITM [28] to simulate and compare Chord and Tapestry in their locality properties. Then a small application on the Pastry substrate is developed to verify the feasibility of such an application.

1.2. Contribution and challenge

As is discussed in the previous section, five contributions are recognized in this project. 1) Detailed investigations of Chord, Tapestry and Pastry, including their improvements and optimizations have been studied and concluded. 2) GT-ITM [28] and P2Psim [27] are studied and then integrated. This is important because there has been a huge demand from researchers and little support for the integration of GT-ITM and P2Psim. 3) A link event is added into P2Psim to simulate network with unstable link changes, compensating for the shortcoming of P2Psim which does not support underlying link failures. 4) Comparisons of Chord and Tapestry have been simulated in P2Psim under the circumstances of stable and unstable underlying networks and a few useful results have been obtained. 5) A prototype of sharing network measurement has been implemented on Freepastry [29] interface to test the feasibility of the application.

One of the big challenges in this project lies in the understanding and enhancement of current version of P2Psim. P2Psim, starting from the end of 2004, is still in its early stage of development (alpha version) and there is no organized documentation helping to comprehend the meaning of each class and variable. Therefore, it is very Chapter 1 Introduction 5

time-consuming to correlate protocols and their programmes to extract useful information. Since each protocol has different routing schemes and is developed by different researchers, there is no shortcut to understand and compare protocols without knowing the basic principles and programme organizations in each protocol as well.

1.3. Thesis organization

The thesis is organized as follows: Chapter 1 reveals the issue of the necessity of sharing network measurement information among Internet users. It reviews the previous work in this area and provides the solutions of using structured peer-to-peer networks. Chapter 1 has also proposed that network proximity (locality) is the key property that needs to be considered for this application. Chapter 2 reviews the detailed routing scheme and node joining/leaving scheme of the current predominant structured overlay protocols: Chord, Tapestry and Pastry. A comparison of Tapestry and Pastry, together with their optimizations provides a reason to use Pastry in Chapter 4. Chapter 2 also provides background knowledge of our simulators that will be used in Chapter 3 and refers to some relevant overlay applications of sharing network measurement information later on. Chapter 3 uses the integration of P2Psim and GT-ITM to compare Chord and Tapestry in their average underlay distance, stretch in distance1 and their stability in an unstable-link network.2 According to the results in Chapter 3, a small application is implemented on Freepastry in Chapter 4. Chapter 5 provides some concluding remarks and suggestions for future work.

1 Stretch in distance is defined as the ratio of the distance a query travels through the overlay network to an object and the minimal distance to that object (i.e. through IP)

2 This “unstable” refers to the link status in IP layer. In overlay network, “stable” could also mean node

status referring to the frequency for node joining/leaving the overlay network. Chapter 2 Background 6

Chapter 2 Background

2.1 Introduction

This chapter reviews background materials for the remainder of the thesis. We first review Distributed Hash Tables used for constructing overlay networks. Then the P2Psim simulator that is used to simulate such overlay networks is studied. We also review the GT-ITM topology generator that we use to create the network topologies to cooperate with P2Psim. Finally, we review some papers that are related to the application that we describe in Chapter 4, in that they use overlay networks for sharing network measurements.

2.2 Overlay algorithms

The basic idea of structured P2P overlays is to hash each node to a nodeID and hash each data object to a key so that the object is placed onto a specific node to make the subsequent queries deterministic and efficient, comparing with the aimless flooding or random walk routing in unstructured overlays. The structured P2P overlay algorithms have implemented a key-based routing interface and support higher level interfaces such as a DHT [30, 31] or DOLR [19] layer. However, they are different in routing strategies and organization schemes.

One key difference between these overlay algorithms is that neither Chord (original design) nor CAN has considered locality when the routing overlay is being constructed. A latter version of Chord uses Proximity Neighbour Selection to maintain its routing tables. On the other hand, Tapestry and Pastry take network proximity into account when the routing table is built initially and also during maintenance so that the path stretch can be reduced as much as possible. Chapter 2 Background 7

This locality property is essential when considering our application of sharing network measurements among Internet end-users. This is because not only are Internet end-users more interested in the networking experience from other geographically close-by users, but also because the measurement sharing application should be robust despite network link failures. In this chapter, we will also be specifically study the locality property of each protocol.

2.2.1 Chord

Chord is the most prominent protocol in the second generation of peer-to-peer overlay algorithms. It solves the problem of efficiently locating the node that stores a particular data item in peer-to-peer applications. The Chord protocol supports one important operation: It associates each data item with a key and maps the key onto a node. Based on the different applications that use Chord, each node might be responsible for one value associated with a key [18].

Chord uses consistent hashing to assign keys to nodes in the network. It is assumed that each node is equivalent in its ability to donate resources in Chord. Consistent hashing tends to balance load since each node receives approximately the same number of keys. These keys could possibly move between nodes due to nodes joining and leaving. Research shows in [18] that Chord has solved the problems of load balancing, decentralization, scalability, availability and flexible naming successfully. Chord [18] can also provide good foundations for the applications of cooperative mirroring, time-shared storage, distributed indexes and large-scale combinatorial search.

Chord is used in Internet Indirection Infrastructure (I3) [32] as the overlay routing protocol. The project aims at eliminating barriers to provide the service of multicast, anycast and mobility.

Chapter 2 Background 8

2.2.1.1 Chord protocol routing scheme

The basic design of Chord protocol includes the approach of finding the locations of keys, the approach of handling node joining and departing the system as well as node failures. Each node maintains the table of its successor, predecessor and m-entry finger table (Ideally, the number of the finger table entries is the same as the number of node identifiers). Since one node only needs to know a small amount of routing information from other nodes, the Chord systems could scale well with consistent hashing.

Basic routing design Each node identifier and key identifier is calculated under one hash function such as SHA-1. An m-bit node identifier is chosen by hashing this node’s IP address whereas an m-bit key identifier is chosen by hashing the key. The length of the identifier is usually large enough to avoid the conflict of hashing to the same identifier. These identifiers are ordered in an identifier circle modulo 2m. The node that is responsible for the key k should be the first node whose identifier equals to or follows k on this identifier circle. This node can also be referred to as successor(k). Thus successor(k) is the first node clockwise from k on the identifier circle with numbers from 0 to 2m-1.

If each node is only aware of its own successor pointer on the identifier circle, it is guaranteed in the absence of faults that all lookups can find the first nodes that succeed the keys correctly. In addition, Chord maintains additional routing information for each node to improve the routing efficiency. This information is formed by maintaining an m-bit entry for each node in finger table, where m is the number of bits in identifiers. Each entry in this table is calculated as follows. The ith entry s in the finger list at node n is the successor of (n + 2i-1) on the identifier space,

where 1 ≤ i ≤ m. A finger table entry consists of both the Chord identifier and the IP address (and the port number) of the node.

Chapter 2 Background 9

A finger interval is the start of the current finger to the start of the next finger, i.e. finger[k].start to finger[k+1].start.

finger[k].start = (n + 2k-1) mod 2m.

Each node knows about its own finger intervals. This is essential when node n does not have any information about the key k. An example of finger interval is shown in Figure 1. An example of finger table is shown in Figure 2.

The lookup works as follows. When a node n does not know the successor of a key k, it will search the finger table for a node whose identifier is more closely precedes k and ask for a closer node in ID space. This process is repeated until the predecessor of k is found.

finger [3]. interval = [finger [3].start , 2)

0 7 1

6 2 finger [3]. start = 6

5 3 finger [2]. interval = 4 [finger [2].start , finger [ 3].start) finger [1].start = 3 finger [1].interval = finger [2]. start = 4 [finger [1].start, finger [2].start)

Figure 1 Finger interval with m = 3 in Chord

For instance, in Figure 2, suppose that node 5 wants to find the successor of key identifier 2, but it has no information of key ID 2. However, node 5 knows that identifier 2 belongs to the interval of [1, 5) after searching its finger list. Node 5 then forwards the query to the successor of the third entry which is node 1, expecting node 1 knows about key ID 2. By searching the first entry in the finger list, node 1 infers that key ID 2 is hosted by node 2 and returns the result to node 5.

Chapter 2 Background 10

finger table for node 1 start interval successor 2 [2,3) 2 0 3 [3,5) 5 5 [5,1) 1 7 1 No. of keys 4

finger table for node 2 6 2 start interval successor 3 [3,4) 5 4 [4,6) 5 6 [6,2) 1 5 3 finger table for node 5 No. of keys 1 start interval successor 4 6 [6,7) 1 7 [7,1) 1 1 [1,5) 1 No. of keys 3

Figure 2 Finger table with node 1, 2 and 5 in Chord

2.2.1.2 Chord Maintenance mechanisms

Node Joins As a major challenge, Chord needs to preserve correctness and consistency for the routing tables in a dynamic network environment. With node joining, three steps need to be fulfilled to preserve the ability of Chord to locate every key in the network. 1) The new node n initializes its predecessor and fingers by asking another node to find it in the network. This process can be simplified by asking a neighbour in the identifier space for a copy of its fingers and predecessor. Due to the similarity in tables between node n and its neighbour, n could get some hints to build its own routing tables. 2) Fingers of other nodes have to be updated when seeing node n joining. The update starts with the ith finger of node n, proceeding counter-clock-wisely and finishes at the node whose ith finger precedes n. 3) Keys, together with data items need to move to the new responsible nodes. Node n needs to contact its successor for all the keys that it is responsible for.

As a whole, research in [18] shows that Chord needs to take O(logN) hops to reach destinations. The size of the routing table is O(logN). Upon node joining, the number of nodes to be updated is O(logN) and the time is O(log2N) [18].

Chapter 2 Background 11

Node fails The stabilization process runs at the background periodically to discover the failed nodes. The essential step to recover from failures is to maintain correct successors. If the newly joined node n finds out that its successor’s predecessor p falls between n and its successor during stabilization, then it replaces its successor with p and also inform its successor to change predecessor to p. To improve the correctness performance in a dynamic network environment, a successor list is added to Chord ring. With a node’s r closest successors in ID space forming a successor list, the node could easily route queries to the first live successor in the list as a new successor when encountering failures for its successor. After sometime, both of successor list and finger tables have correct values for their entries. During the stabilization process, their functions compensate for each other to guarantee a lower-fault and faster-response overlay network.

2.2.1.3 Chord Proximity Neighbour Selection

Although the original Chord design [18] does not provide locality, a later version [26] achieves locality by employing the mechanism of Proximity Neighbour Selection to reduce the lookup latency and network resources consumption.

Chord with Proximity Neighbour Selection introduces another parameter base b to

control the routing table size. Chord node n keeps a finger list of (b-1)logbN fingers where N stands for the namespace (264 in our simulation). Therefore, with the increasing value of b, there will be more finger table levels causing a larger finger table. The original version of Chord [18] effectively uses a base b 2. Each node maintains 64 fingers in a network with 264 nodes in the original version of Chord.

i +1 ⎛ b − 1 ⎞ In ChordFingerPNS, Any node whose ID lies within the range n + ⎜ ⎟ ∗ 2 64 ⎝ b ⎠

i ⎛ b − 1 ⎞ 64 th and n + ⎜ ⎟ ∗ 2 64 modulo 2 , can be used as the i finger of n [33]. The PNS(x) ⎝ b ⎠ Chapter 2 Background 12

algorithm checks the latency of the first x nodes among these candidate nodes and selects the one with the lowest latency. An ideal PNS would use x equal to the number of nodes in the simulated network. However, an ideal PNS could be expensive to implement in a large network since the latencies from one node to all the other nodes need to be collected. The simulation results in [26] show that x=16 can approximate the ideal PNS well to find the closest fingers. It is also found that the latency using PNS approximates 1.5 times of the average underlay round trip time regardless of the number of DHT nodes. Our simulation in Chapter 3 simulates Chord with Proximity Neighbour Selection.

2.2.2 Plaxton

Tapestry and Pastry are structured overlays which are designed based on Plaxton mechanisms. Plaxton et al. present in [34] a distributed data structure optimized to support a network overlay for locating named objects and routing of messages those objects[35]. Plaxton employs the neighbour map which is local routing map at each node. The map resolves a lookup by incrementally routing overlay messages to the destination ID one digit at a time, from right to left. The neighbour map at each node has multiple levels to match the suffix in an ID. For instance, the 3th entry of the 3rd level for node 86AB is one of the closest nodes in network proximity among *3AB. One Plaxton routing example is shown in Figure 3.

D6AB L4

56AB L4 26AB L3 D3AB L4 86AB

L3 L2 F49B E8AB

L1 C137

Figure 3 Plaxton routing with key 86AB (L denotes routing level) Chapter 2 Background 13

Plaxton mesh network uses root node R and server S for an object O. When server S hosts O, it publishes the information to R and the relevant information of the mapping between O and S is stored at each node. To locate the object O for a client, a query message is sent to root node first. Upon detecting the mapping referred to the responsible server for O, the message redirects to the server for the object instead. This routing scheme exploits locality and improves system responsiveness to a great extent. Furthermore, the root node is chosen using a globally deterministic consistent hashing and has no correlation with the object. However, the use of root node imposes single point of failure to the Plaxton system.

As well as single point of failure, the Plaxton system is also unable to adapt to dynamic query patterns since the design is based on the assumption of a static data structure.

Both Tapestry and Pastry are built on top of Plaxton algorithm with extra fully self-organized and failure recovery schemes. They are similar in the prefix (Tapestry)/suffix (Pastry) address routing, the approach to deal with node insertion and deletion and the storage overhead costs. Both Tapestry and Pastry have the

deterministic overlay path of O(logbN) hops (where N is the number of peers) and

induce logBNB hops when peers join or leave the system. They also have the same level of fault resiliency.[11]

Plaxton system also used a base b as a parameter to control the routing table size. Base b determines what numeric base it uses to interpret the string of identifiers in bits. For example, if a node ID is viewed as a sequence of l base-b digits, then its routing table as l levels with b entries in each level. Thus, the number of routing

64 entries can be concluded as b*logbN where N stands for the namespace (2 in the simulation). Therefore, the routing table size increases with an increasing base b.

Chapter 2 Background 14

2.2.3 Tapestry

2.2.3.1 Tapestry basic routing scheme

The location and routing scheme in Tapestry is similar to the one in Plaxton. The routing table in each Tapestry node can be seen in Figure 4. It is also organized into multiple routing levels. Each level contains the pointers of a group of nodes in network distance that match the suffix of that level. Each node also maintains backpointer list of neighbouring nodes. Furthermore, Tapestry allows specifying how objects are chosen in the upper application as opposed to choosing the first available object within some distance as in Plaxton.

2.2.3.2 Fault handling

Handling routing faults Tapestry has the ability to detect and recover from multiple failures. The failures that possibly occur during routing process consist of server overloading, router outage and underlying link faults. To deal with these failures,

Neighbor Map (not including all pointers) Object 0261 x061 xxx0 Location xx01 Pointers 1261 x161 xx11 xxx1

2261 x261 xx21 xxx2 Object 3261 x361 xx31 xxx3 Store 4261 x461 xx41 xxx4

5261 x561 xx51 xxx5 Hotspot Monitor 6261 x661 xx61 xxx6

BackPointers

Figure 4 Tapestry routing table for node 5261

Tapestry sends heartbeat packets to nodes in its backpointer list and keeps the status from each node up-to-date. Instead of maintaining one primary neighbour, Tapestry also maintains the information from two other neighbours. These two neighbours Chapter 2 Background 15

become ready to use when the primary one fails to respond.

Handling location faults To solve the problem of single point of failure, Tapestry assigns multiple roots for the same object. By employing a “salt” value to each object ID, multiple roots could be produced by hashing the object ID together with the salt value (e.g.1, 2, 3). With these multiple roots, there is more chance to locate an object in the network even under a severe network failure.

2.2.3.3 Tapestry surrogate routing

Tapestry improves Plaxton routing scheme with a different routing mechanism surrogate routing to eliminate the requirement for global knowledge of other nodes and to cope with a dynamic network environment.

Surrogate routing uses the object ID as the root node and attempts to route to this node first. Since it is unlikely that this node exists, routing selects the route with a node at which the neighbour map has a non-empty entry to the object ID. Then the routing procedure terminates and this node is assigned as the root node to this object. It is proved that with surrogate routing, any object identifier can be mapped to a unique node in the network.

2.2.3.4 Tapestry maintenance algorithm

Tapestry uses adaptive algorithms with soft state to maintain fault tolerance in the face of changing node membership and network faults[19]. In performing the process of maintenance, soft-state approach of republishing at regular intervals usually introduces a large amount of network overheads. A method of proactive explicit update is consequently introduced to reduce the overhead load while keeping location pointers up-to-date. This combined approach is also utilized to handle mobile objects which tend to move between two servers. Chapter 2 Background 16

Node insertion There are two major steps for nodes to join a Tapestry overlay network. Firstly, the new node contacts a close bootstrap node. Then all the entries of different levels in the new node’s neighbourhood map will be populated by routing towards its new ID. During this process, the entry in each level i of its neighbourhood map is gathered from its ith hop. Each entry contains up to c nodes, sorted by latency. The closest node is marked as the primary neighbour. Tapestry periodically checks the availability of the primary neighbour in each routing entry and replaces it with the next-closest neighbour if the primary one is found to be dead. Then the data with which the new ID should be in charge of is moved to the new node.

Secondly, the relevant nodes are informed of the new node’s existence. The surrogate’s backpointers are transferred back level by level to the nodes that have an empty entry to the new nodes to fill in. Also, other nodes in proximity, when receiving the multicast message from the new node, are able to decide if the new node is more appropriate than the entries in a certain level.

Node disconnection or failure Tapestry proposed a new approach which we will now describe of proactive explicit republishing in addition to regular update to both of nodes and objects. The new mechanism includes LasthopID and epoch numbers to object location mapping.

The node that plans to leave the network proactively notifies the servers whose objects mappings this node stores, as is shown in Figure 5.

S P E FG D

ABC

Figure 5 Scenarios for node failures Chapter 2 Background 17

Node E stores the location pointers of which Server S maps. By receiving the leaving information from E, Server S republishes a new epoch number and E’s nodeID to the network. Any node that is not affected by E’s leave should forward this refresh message and update its epoch number. The previous hop P, when receiving this update message, automatically selects the secondary available route for all messages. During the process of routing republish messages on the new route, every hop recognizes and records the epoch number until encountering the crosspoint G with the old route. This crosspoint then notifies every node on the original route to delete the location mappings. However, a soft-state approach still needs to be utilized to cope with node failures that may happen on the old route when deleting the location mappings.

To conclude, Tapestry improves the adaptability to dynamic network environment, fault-tolerant to simultaneous faults and provides optimization solutions for hotspots. Tapestry also presents special mechanisms that can quickly detect network connectivity and hotspots and provide suggestions to solve the problems.

Tapestry infrastructure has been used for many applications such as a global-scale storage utility application OceanStore [36], a self-organizing application-level multicast system Bayeux [37] and a decentralized spam-filtering system Spamwatch [38].

2.2.3.5 Tapestry locality

Tapestry’s notion of network proximity is based on scalar proximity metrics such as the number of underlay hops or geographical distance. The locality property in Tapestry is similar to the one in Plaxton. In Plaxton routing, resolving each digit at one time could reduce the number of potential candidates geometrically. Thus, the path taken to the root node by the publisher (or server S) and the path from the client will possibly converge quickly. Therefore, queries for local objects are likely to quickly find a pointer to the location of the object. Chapter 2 Background 18

Since Tapestry employs surrogate routing to deal with the dynamic networking environment, it may take a small number of additional hops to reach a root when compared with the Plaxton algorithm.

2.2.4 Pastry

Pastry is another Plaxton-style overlay substrate for a variety of Internet peer-to-peer applications. It can be used for global data storage, data sharing, group communications and naming. Several applications have been developed based on top of Pastry; including a global, persistent storage utility PAST [39], a scalable publish-subscribe system SCRIBE [40] and a decentralized web cache system Squirrel [41]. [20]

The routing mechanism in Pastry is similar to Plaxton. Each node has a 128-bit unique identifier nodeID. Given a message with a key, a Pastry node tends to find the numerically closest nodeID among all the other live Pastry nodes. Pastry is also designed to minimize the underlying network distance such as number of underlay hops. Pastry is completely self-organized, adapting node arrival and departure automatically.

2.2.4.1 Pastry basic routing scheme

Apart from a routing table which has the same function of a neighbouring map in Plaxton, each Pastry node also maintains the routing information in the form of a

b neighbourhood set, and a leaf set. The routing table is organized into [log2 N] rows with 2b-1 for each row, where N denotes the number of nodes in the network and b is a configuration parameter3. Each of the nodes in the 2b-1 entries at row n shares the first common n-digit prefix with the current node and the n+1th digit in these entries consists of the values without the current nodeID’s corresponding value, as is shown

3 b corresponds to _base in our simulations in Chapter 3 Chapter 2 Background 19

in Figure 6.

Figure 6 also shows the neighbourhood set and the leaf set. The neighbourhood set contains the identifiers and IP addresses of the nodes which are closest to the local node in network proximity. The leaf set is a group of nodes with half numerically closest larger nodeIDs and half smaller nodeIDs, comparing with the local nodeID. The leaf set and the routing table are used for message routing whereas the neighbourhood set is used for maintaining Pastry’s locality property.

NodeId 31200123 Leaf set 31200120 31200101 31200200 31200210 31200020 31200001 31200301 31200321

Routing table -0-1032031 -1-2102130 -2-1030213 3 3-0-132012 1 3-2-021301 3-3-101202 31-0-23101 31-1-01320 2 31-3-10121 0 312-1-0121 312-2-0310 312-3-0101 Plus an IP address for 0 3120-1-021 3120-2-103 3120-3-033 each entry 31200-0-20 1 31200-2-00 31200-3-21 312001-0-1 2 3120012-0- 3

Neighborhood set 13020122 22301213 10231213 31031002 30012302 13201001 23012022 32012031

Figure 6 The routing table for a Pastry node with NodeID 31200123

During the routing procedure, a node first examines the leaf set upon receiving a request. If the key is within the range of the nodeIDs in the leaf set, then the message is directed to the destination node. Otherwise, this message is forwarded to a node in the routing table whose nodeID shares at least one more digit with the key than the local node. If the appropriate node is not existent or is unreachable, the message will be forwarded to the one that shares the same prefix with the key but is numerically closer than the local node.

Chapter 2 Background 20

b It is proved in [20] that the size of the routing table is approximately log b N × (2 -1) 2

and the upper bound of overlay hops for an N-size network is log b N . 2

2.2.4.2 Pastry maintenance mechanism

As with all of the other self-organized peer-to-peer DHT substrates, Pastry also has its mechanism to deal with node insertion and disconnection/failure without any centralized control.

Node insertion The approach to populate the routing table in Pastry is similar to the one in Tapestry. When a new node joins a Pastry network, it searches for an existing nearby Pastry node by IP multicast or expanding ring first. Assume the new node is N and this nearby bootstrap node is A. N popularizes its routing table by asking A to route a special “join message” with key N. Assume M is the node who hosts this key eventually. The ith entry row in the routing table of N is being popularized by selecting the ith row from the ith node encountering on the route to M. For instance,

If N0 and N1 denote the zero and first entry row in the routing table of N, then N can copy the zero entry row of A for N0 and the first entry row of B (the first node encountered) for N1.

N asks the neighbourhood set from node A initially since they are close in network proximity. Furthermore, since the destination node M is close to N in node identifier space, N builds its leaf set based on the leaf set of M.

Finally, N informs of its resulting state to all the other nodes in its routing table, neighbourhood set and leaf set. These nodes will update their routing tables upon receiving the information.

Node disconnection The failure of a node in the routing table does not usually affect Chapter 2 Background 21 the message forwarding since the message can be directed to another node. However, upon node departures or failures, a replacement entry must be found to maintain the integrity of routing tables, neighbourhood sets and leaf sets.

d To repair a failed routing table entry Rl , the node first contacts another node referred

i d to by entry Rl (i ≠ d) and asks this node about the entry for Rl . If the entries from row l fail simultaneously, the node contacts the entry at l+1 row and ask for the routing information in a wider range. To replace a dead node entry in a leaf set, the local node contacts the live node with the largest index on the side of the failed node and copy its leaf set table. An appropriate entry will then be selected through comparing these two leaf sets. The neighbourhood set is updated periodically to refresh its entries.

Pastry is proved to be robust against concurrent node failures. Eventual delivery can be guaranteed unless half of the leaf set with adjacent node identifiers fails simultaneously.

2.2.4.3 Locality in Pastry

Pastry’s notion of network proximity is similar to that of Tapestry. It realizes localities in three different ways: locality in the routing table, route locality and locating the nearest replica among k nodes. First of all, a node initializes its routing table and neighbourhood set by a joining procedure. During this procedure, this node asks an existing node to send a query message using its nodeID as a key. Then the routing table of this node can be initialized by obtaining the i-th row of the routing table from the i-th node encountered along the query path. This procedure has approximately achieved the desired locality property in the routing table and neighbourhood set. Moreover, the node will also request information from each node in its routing table and neighbourhood set and update its own state upon finding any closer nodes. Chapter 2 Background 22

Second, since the routing entries in each Pastry node are chosen to be close to that node among all nodes with the same prefix, messages from that node tend to go through the minimum distance while moving towards their destination in the nodeID space. Third, along a route from the source to the numerically closest node in the identifier space, the message first reaches a node which is close to the source among all numerically close k nodes.

While Pastry realizes its locality property in the quite similar way as Tapestry, the differences still exist in other aspects in these two protocols which we will be discussing in the next section.

2.2.5 Tapestry and Pastry Comparison

Tapestry and Pastry differ in some essential properties such as the approach to achieve locality property and replications.

1) Replications: Pastry replicates the objects onto several nodes without the control from the owners of the objects. These node identifiers are closest to the object identifier in the namespace. Tapestry places the pointers of the object location on the nodes between servers and the roots.

2) Locality: Pastry clients use objectID as a key to find the closest node where object replicas are stored, while Tapestry only stores the pointers of the objects on the route from the servers to the object roots. Pastry certainly reduces retrieving latency by placing real objects at multiple nodes in Pastry. Yet there is the overhead of every server needing to store an extra object, and Pastry needs to address object security, confidentiality and consistency.

Chapter 2 Background 23

2.2.6 Locality discussion and optimization

The difference between Chord and Tapestry/Pastry is that Chord does not have an in-built locality property, thus it realizes its locality by a runtime heuristics called Proximity Neighbour Selection. In this section, we will investigate this mechanism. We will also study the mechanism by which Tapestry optimizes its locality property.

2.2.6.1 Proximity neighbour selection

As proposed in Chord, Proximity Neighbour Selection has also been implemented on Pastry, together with heuristic approximations which are perfect PNS and constrained gossiping (PNS-CG). These PNS mechanisms function as the neighbourhood set in the original Pastry.

Perfect PNS can achieve low delay stretch and local route convergence fairly well, but it is too expensive to implement in a large and dynamic network due to the large number of overheads. Constrained gossiping PNS, on the other way, reduces the overheads comparing with the original Pastry and proposed a new algorithm to locate the seed node for joining by the existing routing state [42].

2.2.6.2 Optimizations in Tapestry

The routing design of Tapestry is locality-aware and attempts to locate an object in the local area first, before forwarding the query to a remote area. However, it still takes one to two overlay hops to route to destinations even if the object is the local area with the source of the query. This is particularly inefficient for locality sensitive applications (e.g. the measurement sharing applications). Tapestry has proposed three mechanisms to reduce the distance stretch [43].

Maintaining backup neighbours Instead of maintaining one primary neighbour, Tapestry keeps multiple backup neighbours in the entry. During the publication Chapter 2 Background 24 process, the local node forwards the publish message to as many neighbours as possible together with the primary neighbour. Since only the first few nodes from the object source are likely to be within the object’s local area, this optimization is only used along the first few hops of the publication path. This optimization functions the same as the neighbourhood set in Pastry, except that neighbourhood set is only used for maintenance. An example can be seen in Figure 7.

In Figure 7, node 57832 is a backup neighbour for node 532AB in node 3A12C’s routing table. Node 57832 will receive the location mapping for object 56788 during publication process if the optimization applies. When node 57832 begins its query, it would be able to route to node 3A12C directly to reduce the routing latency. i.e. RDP4

5678E 5671B 56789

561DC 8319B Query Path 532AB Location Mapping

3A12C 57832 Tapestry pointers Item (ObjectID 56788)

Figure 7 Publishing location pointers of Item 56788 to backup neighbours

Publishing objects to nearest neighbours Rather than flooding the object pointers to backup neighbours as discussed in the first technique, object pointers can also be copied to any nearby neighbour at a given level in the routing table. This should also be restricted to the first few hops along the publishing path for the sake of consumption of overheads.

4 RDP (Relative Delay Penalty) is defined as the ratio of the delay on the overlay path to the delay on the shortest path through IP layer unicast. Chapter 2 Background 25

Assigning local surrogate To prevent a query venturing into a wider area network for an object when the object exists in the local area, a local surrogate node is assigned to the object during publication process. This surrogate node is checked first during a lookup process to prevent an expensive wide area hop. The approach to justify if the query goes out of the local area is by comparing the current latency with the latency from the previous hop. It is considered to be a wide area hop if this latency is t times larger than the previous latency. This optimization process can be seen in Figure 8

In this figure, to avoid the query to go out of UNSW campus network for its root node in Melbourne University, a pointer to object 56788 is placed on its local surrogate node 5C32E. So that node 5C32E can find the object directly in UNSW without referring to a remote network. It can be concluded that three techniques to optimize Tapestry reduce the distance stretch incrementally and dramatically and the last optimization appeals our use to share measurement data closely.

USYD

5671B

5678E 561DC

Tapestry pointers 56789

5C32E Location Mapping UMEL 532AB Query Path UNSW Item (ObjectID 56788)

Figure 8 Assigning local surrogate to reduce stretch

Chapter 2 Background 26

2.3 Simulation tools

2.3.1 P2Psim

Usually researchers who are investigating P2P routing algorithms tend to write customized code for their simulations. There are many peer-to-peer simulators which are used by researchers, e.g. PeerSim [44], PlanetSim [45] and P2Psim [27]. However, P2Psim can easily support various protocols and is more popular and better recognized by research community.

There are seven modules interacting with each other in P2Psim: main, protocols, topologies, observers, events, eventgenerators and failuremodels. i. Chord and Tapestry are being supported in the protocol module; together with other popular overlay algorithms such as Kelips [46], Koorde [47], Accordion [48], Vivaldi [49], etc. ii. P2Psim supports different types of conceptual topologies. Consitdisttopology specifies the network size and a constant latency between any two nodes. E2egraph assigns symmetric link latencies between two IP addresses, whereas G2graph selects random values to these latencies. In randomgraph, every node has a few links to other nodes, with randomly chosen latencies. Euclidean topology calculates the Euclidean distance as latencies based on the coordinates of nodes on the 2-dimensional topology plane. GT-ITM generates topologies using randomgraph in its stub domains but calculates the Euclidean distance as latencies. iii. P2Psim only supports p2pevent which includes the action of node join, leave and lookups, and simevent which controls simulation exit. After the extension of P2Psim, it will be able to support link events (i.e. link failure and rejoining the network). iv. In eventgenerators, churneventgenerator is being chosen for our simulations. It determines the frequency and duration of node churn. Chapter 2 Background 27

v. Failuremodels support nullfailuremodel, constantfailuremodel and roundtripfailuremodel in which we only consider the situation without failures nullfailuremodel.

However, P2Psim only simulates key lookups, without any data exchange. Thus, it does not provide any interface for building any applications on DHTs. It also neglects any link transmission and queuing delays which may shorten query latency but will not affect the number of underlay hops a query takes. These two issues are beyond the scope of our study due to the time limitation.

2.3.2 GT-ITM

GT-ITM is an acronym for Georgia Tech Internetwork Topology Models. It is an accurate model of the real networks. GT-ITM is also widely used to simulate overlay networks [50, 51]. a) Use of GT-ITM topology generator GT-ITM provides two types of network topologies: random graphs and transit-stub graphs, in which the transit-stub graph model is closer to a good abstraction of real networks and the random graph models the topology within a transit or stub domain properly. The model of the topology is shown in Figure 9.

The source file specifies the type of graph model (random or transit-stub), the number of graphs to be generated, the number of transit domains, the average number of stub domains attached per transit node, average number of nodes per domain, method for generating edges.

Chapter 2 Background 28

Stub domain

Transit domain

Stub domain Stub domain

Stub domain

Figure 9 Transit-stub topology model in GT-ITM

After a topology file is generated, it can be converted to a .txt file to be read and be evaluated by GT-ITM evaluation process. b) Understanding of topology files. A typical topology file in GT-ITM is shown in Figure 10.

Figure 10 GT-ITM topology file Chapter 2 Background 29

VERTICES describe the node information such as its index, position (transit or stub domain) and coordinates on the 2-dimensional topology plane. Each transit node is indexed to identify which domain it belongs to. For example, T: 0.1 means that this transit node is node 1 in transit domain 0. Similarly, each stub node is also indexed to indicate which domain it belongs to and which transit node it connects with. For example, S:0.0/0.1 means that this stub node is connected with transit node 0.0 and is number 1 in stub domain 0. EDGES show the latency and minimum hops between two nodes that are indicated with their indices. The last column of minimum hops is the most useful information in measuring underlay hops between arbitrary nodes. After topology files are generated, the edriver program is used to evaluate the node degree, hop-depth, length-depth, routing-depth distribution and number of bi-connected components. A typical evaluation file (14 nodes) is shown in Figure 11. The evaluation file reflects the topological properties. For a topology with n nodes and m edges, avgdeg is the average node degree 2m/n. diam-hh is the largest hop-depth and avgdepth-hh is the average hop-depth. The hop-depth at node n is the depth of the shortest path tree which rooted at node n to the other nodes. All edges are measured in unit weight. diam-ll and avgdepth also reflect hop-depth with the length

Figure 11 GT-ITM topology evaluation file metric. diam-hl and avgdepth-hl are route-based diameter and average route-based Chapter 2 Background 30 depth. They are the depth of shortest path tree which is constructed with the routing policy weights. bicomp is the number of bi-connected component and reflects the degree of ‘connectedness’ and ‘edge redundancy’ in a graph.

It should be noticed that GT-ITM identifies different nodes with indices rather than IP addresses, which are consistent with P2Psim.

2.4 Relevant overlay applications

As was discussed in Chapter 1, peer-to-peer overlay network is a potential ideal platform to share network measurements due to its distributed nature. However, this is not a complete novel idea. Some researchers from Georgia Institute of Technology proposed M-Coop [25] in 2003 as a measurement infrastructure for distance metrics such as latency or hop count. PeerCQ [52] has been proposed in the same year to solve the problem of monitoring information updating on real-time. Resilient Overlay Networks (RON) [53] provide an overlay architecture to detect and recover from link and path outages. In the second part of this chapter, we will give details of these peer-to-peer applications and describe how they correlate with our design.

2.4.1 Resilient Overlay Networks

An essential aim of sharing network measurements information on the Internet is to improve the reliability of Internet packet delivery on the occurrence of link or path failures.

Today’s Internet is vulnerable to router and link failures, configuration errors and malicious attacks because of the inefficient failure recovery in BGP-4. Resilient Overlay Networks [53], on the other hand, utilizes distributed Internet applications to detect and recover from link or path outages and periods of degraded performance within only a few seconds, improving the performance over the Internet that uses current routing protocols. Chapter 2 Background 31

Three goals were claimed to achieve in the design of Resilient Overlay Networks. The first goal is to enable a group of nodes to communicate with each other normally when the underlying links connecting with them fail to function properly. RON first detects link problems by active probing or passive monitoring. Then it has to make a decision of choosing the direct-connecting path or finding a detour path via a RON node. The second essential goal is to incorporate overlay routing and path selection more tightly with the applications than before. In this sense, the requirement of specific distributed applications is considered and RONs can be deployed in various ways. The third goal of designing RON is to enforce policy routing, so that the policy routing can be more precise and more sets of paths will be available in the case of failures. Furthermore, RON can potentially provide fine-grained policy routing.

The principal of Resilient Overlay Networks is shown in Figure 12.

Figure 12 Conceptual design of Resilient Overlay Networks [53]

Every module in Figure 12 stands for an RON node. Each RON node is deployed at different locations on Internet and communicates with each other through overlay virtual links. Each RON node monitors the quality of the underlying path between itself and other nodes on overlay networks, and determines the best routes to transfer packets. Resilient Overlay Networks follow a specific routing protocol to discover the overlay topology and exchange link information between overlay nodes. The link information can be latency, packet loss rate, and throughput etc. It is claimed that Chapter 2 Background 32 forwarding packets via at most one intermediate RON node is sufficient to overcome faults and improve performance in most cases [53].

The path selection in Resilient Overlay Networks is the most essential procedure for effective overlay routing. Three different routing metrics can be followed to minimize latencies and loss, and optimize TCP-throughput. An active probing method is employed to collect samples for estimating latencies, loss and bandwidth. The other active probing method outage detection is also deployed to examine the virtual link status on overlay networks. To reduce the overheads for detailed routing information, a RON performance database needs to be used to store samples in history. The system also needs a summarization mechanism to use the same data for different purposes.

Resilient Overlay Networks aim at improving the reliability and robustness of the Internet. However, it has some essential shortcomings that may hinder the wide-area deployment of RONs. Firstly, RON may violate AUPs5 and BGP transit policies. This issue can be solved when overlay ISPs make certain agreement with each other. Cryptographic authentication and access controls should also be deployed to prevent misuse of an established RON. Secondly, RON faces the issue of scalability because of the aggressive maintenance procedure and the path selection mechanism. Finally, RONs need to solve two problems from NATs. A host behind an NAT does not have a globally unique IP address. This problem can be solved by using address/port pair for that host. Furthermore, the firewalls or NATs usually block the active probing from the other side of the Internet, invalidating the communication between two hosts behind the firewalls and causing a fault path outage and sub-optimal routing.

In addition, it is said that RONs are deployed between small groups of cooperating entities [53]. It can be expected that instead of each node selecting its own optimal paths, the edge RON router makes the decision for its group members so that each

5 AUP-- acceptable use policies

Chapter 2 Background 33 node in a small group experiences similar routing metrics (latencies, loss, throughput etc). This might also cause sub-optimal routing.

RON is designed to compensate the shortcomings of the BGP4 routing protocol. However, to improve the scalability and avoid sub-optimal routing, RON needs to cooperate with an application which has the flexibility to obtain and to share measurement information with less overhead.

2.4.2 M-Coop

A network measurement cooperative enterprise M-Coop (Measurement Cooperative) has been proposed to share network measurement information on peer-to-peer networks [25, 54]. M-Coop is a scalable, incrementally-deployable, peer-to-peer measurement infrastructure to estimate distance metrics (such as latency, hop count, etc.) between any pair of IP addresses on the Internet [25].

M-Coop is a system architecture created to answer the queries about measurement type between two IP addresses IP1 and IP2 where measurement type can be latency, hop count, bandwidth, jitter and etc. M-Coop is designed to build on peer-to-peer networks infrastructure, thus the peer-to-peer nature of no global coordination between nodes is inherited and nodes can join and leave the system by themselves.

M-Coop architecture consists of two overlay networks which are control overlay (c-overlay) and measurement overlay (m-overlay). Each node in the network needs to participate both of the overlays. The m-overlay is connected using the AS level graph of the Internet. Each node is assigned to an area within which it can answer queries from other IP addresses. This area is named the area of responsibility (AOR). Each node in M-Coop collects measurement information by actively sending probing packets to other nodes. The m-overlay is also in charge of node arriving and leaving. The c-overlay, used as a control layer, is based on a Distributed Hash Table (DHT) Chapter 2 Background 34 such as Chord [18], CAN [21], Pastry [20], Tapestry [19] for efficient information storing and searching.

The cooperation of m-overlay and c-overlay works as follows. A query sent in the form of (IP1, IP2, measurement type) is first taken by control-overlay to the node which is in the same AOR (area of responsibility) with IP1. If this node has the measurement information, then it will return the relevant value to this query. Otherwise, a new query will be sent along the path to the node which is in the same area with IP2. The metric data of all the traversed links is collected and returned as a reply to the new query and subsequently to the original query. Upon the failure of all queries, a new measurement may also be triggered to supply the query.

The m-coop system is built on top of any distributed hash table (Chord, Tapestry, Pastry, CAN etc). However, m-overlay deals with node joining and leaving and is formed based on BGP policies of each AS along the path, whereas the distributed hash table is only used as forwarding queries and storing information items. Therefore, the design of m-overlay is essential for successfully finding the correct information but no specific property for the c-overlay DHT is referred to or demonstrated in M-coop. Our proposed design, however, investigates the property of DHT and relies on the supporting DHT to localize data items and is capable of retrieving data items locally when the external links are disconnected.

2.4.3 PeerCQ

PeerCQ is proposed as a totally decentralized peer-to-peer system that performs information monitoring tasks with a large group of heterogeneous peers [52]. The information monitor subscriptions are called continual queries (CQs) and triggered when the information is updated.

In PeerCQ, each node works for evaluating continual queries, and any peer can post a Chapter 2 Background 35 continual query (CQ) of its own interest. When a new CQ is issued by a peer, this peer needs to find out the appropriate peer to handle this CQ so that the system resources can be most efficiently utilized and the load on peers can be balanced.

The mechanisms that differentiate PeerCQ from other peer-to-peer protocols are its ability to handle load balance and utilize system resources efficiently. PeerCQ is an extension of existing P2P protocols, such as Chord or Pastry to incorporate capability-sensitive service partitioning scheme. PeerCQ is capability-sensitive which means that peer-awareness and CQ-awareness are included in the protocol. Peer-awareness takes into account of the heterogenous ability to donate resources (CPU, memory, disk, and network bandwidth) and assigns more identifiers to a peer that donated more resources. CQ-awareness assigns CQs of the similar trigger conditions to same peers.

The distinct feature of service partition scheme can be characterized by creating CQ identifiers and peer identifiers, and the two-phase matching (strict matching and relaxed matching) between CQs and peers. Strict matching is simply an extension of consistent hashing with peer-awareness and CQ-awareness. Relax matching incorporates additional characteristics of the information monitoring applications, such as the network distance of peers from the source, cache affinity and peer load. This algorithm enables efficient CQ processing, reduces overall bandwidth requirement and finely balances the workload in peers.

The conceptual design of the PeerCQ system seen from the end-host is shown in Figure 13. The PeerCQ end-host middleware consists of two layers. The lower layer is

Chapter 2 Background 36

Figure 13 PeerCQ Architecture [52]

PeerCQ Protocol Layer aiming at peer-to-peer communications, whereas the higher layer is Information Monitoring Layer aiming at CQ subscription, trigger evaluation, and change notification. The system works as shown in Figure 13.

An end-user composes a CQ from its own interests and posts it on the entry Peer A first. Based on the service partition mechanism exclusively in PeerCQ, Peer A triggers the lookup function in PeerCQ and determines the appropriate peer (Peer B) to execute this CQ. After Peer B receives the query, it starts to detect the interested information update. A query is issued upon detecting the update and the owner of this CQ Peer A can be notified through email or direct contact.

It can be seen from the design of PeerCQ shown in Figure 13 that Peer B is responsible for collecting data for similar continual queries in the same group (with Peer A). This implies that all the clients in this group share the same kinds of information and only one peer is in charge of collecting the related data.

Chapter 2 Background 37

2.5 Conclusion

This chapter provides background knowledge for this thesis. It has summarized Chord, Pastry and Tapestry and compared their locality property. In summary, locality is a heuristic existing in the original design of Pastry and Tapestry. Furthermore, Chord and Pastry have also integrated Proximity Neighbour Selection to select the closest entries in vicinity for routing tables. Pastry has also implemented constraint gossiping PNS to reduce the overheads and to help finding the seed node for joining the network. As far as is concerned, Tapestry has the most appealing property for a location-sensitive application. E.g. a measurement publishing application, after the optimization has been employed. This survey has provided a solid background for choosing the appropriate overlay protocols for our application. For more convincing results of locality and reliability property, comparison simulations between Chord and Tapestry are carried out in Chapter 3. P2Psim is the most popular peer-to-peer overlay DHTs simulator in research community nowadays whilst GT-ITM is a topology generator that can generate transit-stub topologies to approximate the Internet closely. P2Psim and GT-ITM are introduced in this chapter to provide a background for our simulation in Chapter 3. The integration of these two tools and the extension of P2Psim are followed in Appendix I.

This chapter also introduces some applications that are potentially useful to our application proposal. The resilient overlay network deals with the network underlying link failures since BGP4 is not able to handle link failures efficiently [55]. It provides a heuristic approach in our simulations (Refer to section 3.1.3). M-Coop solves the problem of distance measuring between two IP addresses. However, the measurement layer shares the responsibility of searching for the nearest neighbour for useful measurement information. Unlike the desirable locality property in our proposal, there is no specific requirement for control layer. PeerCQ is used for monitoring the information change on Internet. The information update is only triggered upon Chapter 2 Background 38 receiving the requests from other peers. It is updated actively, whereas our proposal could update the measurement information along with the other daily use applications. PeerCQ is also being constraint for sharing the same type of information in one group. None of these designs have solved our specific problem of updating and sharing network measurement information on real-time. In Chapter 4, we will be giving details of our own design.

Chapter 3 Simulation Comparison of DHTs 39

Chapter 3 Simulation Comparison of DHTs

Three popular structured DHTs, Chord, Tapestry and Pastry, were investigated in the previous chapter. All of these protocols share the property of all DHT-based systems and can locate a data object in a small O(logN) overlay hops on average. As was described in Chapter 2, Tapestry and Pastry has considered network locality whereas Chord only considers locality issues in its later version by integrating Proximity Neighbour Selection. However, one-hop DHT such as Accordion, due to the short overlay path, does not need to consider this locality problem.

To investigate the differences in locality properties among these popular protocols, by either Proximity Neighbour Selection or in-built network proximity, simulations are carried out to measure underlying network paths and to determine the robustness of these protocols when network link failures occur. For our simulations, we use a publicly-available and widely-used peer-to-peer simulator P2Psim [27] to evaluate the two overlay protocols. We use GT-ITM [28, 50] to generate the underlying network topology. The usage of these two tools has been described in the previous background chapter. Due to the lack of Pastry resource in P2Psim and the routing similarity between Pastry and Tapestry (as was described in Chapter 2), we only compare Chord and Tapestry.

3.1 Simulation methodology

3.1.1 Counting underlay hops

In P2Psim, three mandatory files are needed to commence the simulation, which are Chapter 3 Simulation Comparison of DHTs 40 protocol, topology and event files. At the initialization of P2Psim, these three files, together with some optional arguments, were parsed to the main thread to create nodes with the protocol and a network with the appropriate underlying topology. In the simulation, we only apply the same number of overlay nodes with the number of underlay nodes, e.g. 100 overlay nodes on a 100 underlay nodes. However, this is not realistic since one physical node could be a router/switch which does not support high-level peer-to-peer DHTs. Due to the complexity of differentiating transit and stub nodes in P2Psim, we leave the implementation of applying fewer overlay nodes on a larger underlay network for future work.

With the initialization of GT-ITM topology, a GT-ITM topology is created and the minimum underlay hops between two arbitrary nodes are calculated using Floyd-Warshall algorithm [56] and stored in arrays. Then the event of node joining, leaving or sending queries will be created with the event file and the simulation starts from this point.

During initialization, the code in churneventgenerator routine is invoked and nodes start to join the network randomly (bootstrap node starts at time 1). Upon joining, nodes begin to schedule their first queries. After a node has been alive for a period, it leaves the network. The time duration to issue queries and to join or leave the network is calculated according to exponential distribution. All of the node actions, ‘join’, ‘crash’ and ‘lookup’ are scheduled based on their frequency or intervals which are set in the event files.

In node churneventgenerator mode, queries are issued upon invoking of lookup events. If the source node is not the node that is responsible for the key, then the query is sent to the closest node in identifier space towards the destination node which is supposed to host the key.

Upon reception of each query at each of the intermediate overlay nodes, the number Chapter 3 Simulation Comparison of DHTs 41 of overlay hops increases and the IP addresses (also known as indices) of the overlay nodes are registered in the query. As can be seen from Figure 14, the number of underlay hops between the previous node and the current node is retrieved thus contributes to the total number of underlay hops for a query. Note that in our simulation; only successful queries are counted in the final result of number of underlay hops. This is because only a small portion of the queries (less than 5%) failed with the simulation settings. And with queries which cannot find their destinations, the network will give up resending after the maximum lookup time. Thus the unsuccessful queries will not overwhelm all the network resources and consequently affect the results for successful queries.

Previous Current hop hop

Figure 14 Scenarios without link failures

3.1.2 Calculating path stretch

The original definition of stretch in P2Psim is the interval for each query divided by round trip time for complete and correct lookups. Some peer-to-peer networks researchers use this stretch metrics in their research [18, 42, 57]. However, we found that this metrics of delay measurement is not suitable in our simulations as our interest lies in the locality in geography. A more applicable definition in [43]: stretch: the ratio between the cost of the path taken by the routing protocol and a minimum cost path from source to destination is employed in our simulations. Chapter 3 Simulation Comparison of DHTs 42

We use the number of underlay hops as cost in the above stretch definition. The number of underlay hops that we obtained in the section 3.1.2 is the cost of path taken by the routing protocol. The minimum cost path from source to destination can also be calculated by Floyd-Warshall algorithm. The final path stretch result could be obtained by dividing the number of underlay hops using a DHT by the number of underlay hops using a direct IP connection.

3.1.3 Calculating success rates/failure rates

Two kinds of failures could possibly occur on the underlying IP path. Link failures happen in a router or a link connecting two routers due to software error or hardware outages. Path failures could occur under many circumstances including DOS attack or traffic congestion. However, BGP-4, on which today’s Internet routing system is based on, does not handle failures well [53, 55]. It usually takes a few minutes to converge to a new valid route after a link failure causes an outage. This provides a heuristics for our simulations as a query is considered failed if one of the links that it traverses is out of use, as can be seen in Figure 15.

Previous node

Current node

Figure 15 Scenarios with link failures

Presently P2Psim only supports the scheme when nodes join and leave the overlay network. However, we want to investigate the robustness of different DHTs (Chord Chapter 3 Simulation Comparison of DHTs 43 and Tapestry) when links between two nodes break down and restore. To simulate the scenario when links break down, several functions and classes need to be added in P2Psim. (Refer to Appendix I)

In simulations with link events, minimum number of underlay hops between two nodes has to be calculated every time when one hop of the overlay lookup is successful. And if this number is not the same as the initial number of underlay hops when the network was stable, the query is considered to have failed. This means that if a query happens to go through a failed link, then it is considered to be failed. The query succeeds otherwise.

3.2 Simulation setup

This simulation uses the implementation and extension of Chord and Tapestry in P2Psim. We use the later version of Chord which has implemented Proximity Neighbour Selection.

For all the simulations that are carried out, we use GT-ITM as the topology model to generate networks with 14, 36, 100, 300, 600, 1300, 3066 nodes so that we can study a large range number of nodes. These strange numbers of nodes in a network result from attempting to generate topologies with powers-of-two nodes (16, 32, 128...) but the number of nodes that are possible is constrained by the GT-ITM transit-stub structure, which prevents exact powers of two nodes in networks. P2Psim does not support more than 3000 nodes for some protocols such as Tapestry. GT-ITM topology uses Dijkstra to calculate the network latency in P2Psim.

Each node joins and leaves the network alternately and the time interval between two consecutive actions follows an exponential distribution with a mean time of 60 simulated minutes. Each node issues a query with a unique key every 10 minutes. This is the default simulation settings in P2Psim. However, it is hard to know real Chapter 3 Simulation Comparison of DHTs 44 frequency characterization that nodes join or leave or issue queries in the Internet since the situation may vary dramatically from one network to another. For example, the computers at work could be online much longer than the computers at home. Thus, we use the default time intervals provided by P2Psim.

The keys generated by each node are distributed randomly. For an application of sharing network measurements, by hashing the measurement destination IP address as a key using hash function SHA-1, there is a high possibility to have these keys distributed randomly provided the queries are sent to a wide area and the application is running for a long time.

Based on our earlier discussion in Chapter 1, only backbone failures characterization has been investigated. In paper [10], the failure distribution characteristics among any two links are studied, while we need to simulate the failure distribution happened on the same link. Furthermore, the failure characteristics of links may vary due to its maintenance, age, technology and other specific traits. Due to the complexity of network link failures, we only consider a probable situation with independent link failures and investigate the impact on different protocols. Each link stay active for the average time of 1 hour (denoted as MTTF6) from the beginning of simulation and subsequently fails with a mean time of 18, 24, 36, 45, 72, 120, 360 seconds (denoted as MTTR7) according to the exponential distribution. This process repeats until the end of the simulation. In this case, the ratio of MTTF and MTTR ranges from 10 to 200 so that we may learn the behaviour difference in Chord and Tapestry under different link characteristics. Although the ratio of MTTF to MTTR here is higher than typical link conditions, we are investigating the locality property for situations of network failure in which users would have most need to access network measurements. We set the stabilization interval as 72 seconds as it is the most frequent stabilization that Tapestry could handle for any network size in our simulations

6 MTTF = Mean Time To Fail 7 MTTR = Mean Time To Repair Chapter 3 Simulation Comparison of DHTs 45 settings. A more frequent stabilization interval would create more Tapestry overheads which could overwhelm our entire 8-gigabyte RAM, terminating the simulation procedure. In Chord, both the PNS timer and the basic timer are set to 72 seconds. The stabilization process runs for every 72 seconds to provide an opportunity for each DHT to adapt around faults that caused by network failures. Each simulation runs for 6 hours with 19 queries issued by each node.

Five topology files, each generated from the same topology source file are applied on Chord and Tapestry overlay networks in order to obtain general results. These five files are similar in gross topology features since they are generated from the same source file but slightly different in their average node degrees, number of bi-connected components, average and largest hop-depths with or without length metric. (Refer to section 2.3.2).

The size of finger tables in Chord and routing tables in Tapestry are both set to 64 to match the number of digits of nodeID and keys in both protocols. Chord also looks for 16 of its successors to fill in the finger tables.

Chord can process a query in iterative or recursive style. In the iterative style, a node sends queries to a few nodes in its finger tables, each time moving closer to the desirable successor. In the recursive style, the intermediate node just forwards the query to the next node until it finds the correct successor. Our simulations employ the standard style which is recursive as it takes as shorter hops as possible. After the node is found for the key, a notification message is sent back to the sender directly without going through the overlay networks.

We evaluate the performance of Chord and Tapestry using three different metrics, number of underlay hops, stretch in distance (path stretch) and failure rates under two different network scenarios: with and without link failures. When there are no link failures, we are interested in the difference in underlying network distance between Chapter 3 Simulation Comparison of DHTs 46

Chord and Tapestry. When link failures are involved in the network, we are more interested in the percentage of successful queries which reflects the robustness of the DHTs. (i.e. the less that the success rate is affected, the more robust the DHT is).

In the scenario with no link failures in the underlay network, only successful lookups are taken into consideration for calculating the average number of underlay hops, including timeout penalties. Both Chord and Tapestry resend the failed lookups if the timeout of 8 seconds is not reached. In the scenario with link failures, we only focus on success rates or failure rates.

3.3 Simulation analysis

The simulations are carried out under the circumstances of networks on static underlying link connections and networks with frequent link failures. However, nodes still join and leave due to the design limitation of P2Psim simulator. In our simulations, we use the default parameter settings of node events in P2Psim (described in simulation setup). This means that only underlying topologies are static in the first scenario. The simulated peer-to-peer networks are dynamic in terms of node churning through the whole process of simulations. We repeated all simulations five times on each of those five topology files and averaged out the results. When we repeated simulations ten times on each topology file, the results did not vary by more than 5% from those obtained when repeating five times on each topology file.

3.3.1 Results on networks with static link connections

The underlying distance from source to destination in a peer-to-peer network is essential when locality property is investigated. The further the distance from source to destination, the more likely that the underlying path is affected due to the possible link changes in the network. Two metrics which are number of underlay hops and stretch in distance are employed to characterize and compare the locality property in Chapter 3 Simulation Comparison of DHTs 47

Chord and Tapestry.

3.3.1.1 Impact of network size a) Underlay hops Three different graphs are produced to characterize and compare the physical distance from source to destination in Chord and Tapestry. We sort the number of underlay hops used for different queries and collect the data at 10 percent, 90 percent and median values. At last, we calculate the average number of underlay hops and plot it in the graph for different network size, as can be seen in Figure 16 for Chord and Figure 17 for Tapestry and compare the mean values of Chord and Tapestry in Figure 18.

90 IP_hops_90th 80 70 s IP_hops_mean 60 IP_hops_median 50 IP_hops_10th 40 30 Number of IP hop IP of Number 20 10 0 0 500 1000 1500 2000 2500 3000 3500 Number of nodes

Figure 16 Number of underlay hops in 10th, mean, median and 90th and its approximation in Chord

The distributions of 10 percent, 90 percent, mean and median values in Chord are plotted in Figure 16. In 10 percent 90 percent, mean and median value lines, the number of underlay hops increases linearly up to 300 nodes. After 300 nodes, it increases linearly with a flatter slope. This is understandable because the simulation is operated on 2-Dimensional plane topologies. With the linear increase in the total Chapter 3 Simulation Comparison of DHTs 48 number of nodes, the number of nodes in one dimension tends to increase logarithmically.

As Figure 17 shows, Tapestry holds the same trend in the number of underlay hops. The number of underlay hops in 10th, 90th and mean value increases quickly with the increasing size of networks until the size of 300 nodes. Then it increases more slowly after 300 nodes. Median number of underlay hops increases more slowly after the 600-node turning point. They are all approximately logarithmic. For the second largest network with 1300 nodes in our simulation, 90 percent of underlay hops takes 49 or fewer hops to reach the destination. However, Tapestry is not able to accommodate more than 3000 nodes with 8GB memory because of too many overheads when nodes join, leave or run stabilization process frequently. This may be due to a memory leak in P2Psim.

60

50 IP_hops_90th s 40 IP_hops_medi IP_hops_meaan 30 n

20 IP_hops_10th Number of IP hop 10

0 0 500 1000 1500 Number of Nodes

Figure 17 Number of underlay hops in 10th, mean, median and 90th and its approximation in Tapestry

Chapter 3 Simulation Comparison of DHTs 49 s 60 50 IPhops_mean_ Chord 40

30 IPhops_mean_ 20 Tapestry 10 0 Number of underlying IP hop IP underlying of Number 0 200 400 600 800 1000 1200 1400 Number of Nodes

Figure 18 Comparison of average number of underlay hops between Chord and Tapestry

Figure 18 shows the difference in the average number of underlay hops with the increasing network size between Chord and Tapestry. We also use logarithmic distribution to approximate these two curves. As can be seen, the number of underlay hops ascends steeply with the increasing network size up to 300-node and then turns flatter afterwards. In small networks with less than 100 nodes, a Chord network bears twice the number of underlay hops of Tapestry does. A Tapestry network experiences approximately 20 underlay hops less than a Chord network does for networks with more than 100 nodes. Not only can Tapestry protocol find a destination for a query with less overlay hops, but it is also able to find an overlay node closer to the query’s origination, comparing with the performance of Chord with Proximity Neighbour Selection.

b) Stretch in distance Stretch defined in P2Psim is the delay stretch measured in time domain while path stretch in space domain is more relevant and accurate when locality property is investigated. Path stretch is the ratio between the underlying IP path using a specific overlay protocol and the shortest IP path without overlay interference. The shorter that a message has to travel along the path to find the destination, the less likely that this Chapter 3 Simulation Comparison of DHTs 50

3

2.5 Chord 2

1.5 Tapestry

path stretch 1

0.5

0 0 500 1000 1500 Number of Nodes

Figure 19 Comparison in path stretch between Chord and Tapestry message is affected by the potential link break in the network. The smaller the path stretch is, the better the locality a DHT presents.

Our simulations measure the trend of path stretch with the increasing network size. As can be seen from Figure 19, in smaller networks with less than 30 nodes, Tapestry is almost able to find a destination without any extra path cost which is as effective as a one-hop DHT. However, Chord still needs to route around the ring overlay topology which doubles the IP paths. Then the path stretch increases with the increasing number of nodes in the network in both of Chord and Tapestry. The path stretch remains ascending trend up to 300 nodes. After 300 nodes, the path stretch in both protocols remains almost the same. Chord remains at 2.7 while Tapestry stays at 1.5.

The comparison in path stretch between Chord and Tapestry shows that Tapestry gives shorter underlying IP path and has better locality performance than Chord in any size of networks. Tapestry is also more efficient in overlay routing comparing with Chord.

Chapter 3 Simulation Comparison of DHTs 51

3.3.1.2 Distribution of path length a) Previous results Chord paper [18] presents that the overlay path length increases logarithmically with the increasing network size, as shown in Figure 20. This simulation was performed on a network with N = 2k nodes and 100×2k keys in which k varies from 3 to 14. Each node in the simulation selected a random group of keys to query and then the overlay path length to resolve a query was being measured. Figure 20 [18] plots the mean, and the 1st and 99th percentiles of overlay path length as a function of network size. It can be seen that not only the average path length increases logarithmically with the number of nodes but also the 1st and 99th percentiles of path length.

Figure 20 The overlay path length in Chord as a function of network size [18]

Figure 21 [18] also plots the PDF for a network with 212 nodes. It shows that the overlay path length for a 212-node network peaks at 6 hops and does not exceed 12 hops. It verifies the theory in Chord structured network that it takes less than the upper bound of O(logN) overlay hops to find the destination, where N is the number of nodes in the overlay. This simulation was performed in a static overlay network without node churning. No result of overlay path length and underlying IP path length distribution with node churning has been shown.

Chapter 3 Simulation Comparison of DHTs 52

Figure 21 The PDF of overlay path length in Chord in a 212-node network [18]

The Tapestry papers [19, 35] only investigate the correlations between • The Relative Delay Penalty and increasing object distance • The Relative Delay Penalty and Tapestry Base, • The Average Time to Coalesce and the number of fragments requested with and without link failures, • The latency and client distance to object, etc. They do not consider the distribution of overlay or underlying hops. Comparisons between Chord and Tapestry have only been conducted in terms of query latencies. Previous work [33] has not considered overlay and underlay path length comparison. b) Overlay path length Chord can be simulated in a static network environment without node churn by tuning variable static_sim as true. Tapestry in P2Psim does not support static simulation. Our simulation deals with the situation of node joining and leaving the network in both of Chord and Tapestry protocols. We find that simulations in dynamic network environment are more applicable to real situations in the network.

This simulation is operated in a 600-node network on Chord and Tapestry overlay

networks. In theory, it takes the upper bound of O(logN) and O(logBN)B [11] overlay Chapter 3 Simulation Comparison of DHTs 53 hops in Chord and Tapestry to complete a query. We chose 64 as node identifiers in both protocols. In paper [18], it shows that the average number of overlay hops in

Chord protocol follows the formula 1/2log2N where N stands for the network size. However, this result was obtained for the original Chord protocol (not ChordFingerPNS) under the assumption that the overlay network is stable. In our simulation, we have used an unstable network with node churning. We have also used a different base parameter which results in a different overlay and underlay path length. As a result, it takes 2~3 hops and 1~2 hops in Chord and Tapestry respectively to find the desired node in our simulations. More details about the effect of the base parameter on the path length are described in section 3.3.1.3.

As can be seen in our simulation results in Figure 22, the number of overlay hops falls in 1(26%) and 2(68.8%) hops mostly and peaks at 2 hops in Tapestry. A smaller fraction of queries (4.76%) takes 3 hops to reach destinations. Very few queries traverse more than 3 hops in Tapestry. The shape of PDF in Chord is similar to Tapestry but one hop bias to the right. Very few queries are able to find the desire nodes within 1 hop (0 and 1 hop). Most lookups find the destination in 2(27.7%), 3(65.7%) and 4(5.8%) hops. A query is seldom completed in more than 4 hops in Chord.

0.8 TapestryPDF 0.7 ChordPDF 0.6 0.5 0.4 PDF 0.3 0.2 0.1 0 012345678 Number of Overlay hops

Figure 22 The PDF of overlay path length in a 600-node network

Chapter 3 Simulation Comparison of DHTs 54

Some of the queries may need more than 4 hops to find the destination which has exceeded the upper bound in theory. This is because the upper bound comes from a static network environment. When nodes leave the network and the routing state on related nodes has not changed on time, a query forwarded from related nodes may fail to reach the destination. In this situation, the previous node will retry other entries in its routing table but this failed hop will still be counted into overall hops. This mechanism induces the bias towards the theory. c) Underlying IP path length We collect the number of underlay hops and investigate the PDF distribution in Chord and Tapestry. As we can see from Figure 23, it relates well to Figure 22 of overlay path length.

In Tapestry, there are four crest values positioning at 0, 18, 36, 54 underlying hops which correspond to 0, 1, 2, 3, 4 overlay hops. A small part (0.37%) of queries are able to find itself the desired node for the key (0 hop) in Tapestry. Both of the nodeID and key identifier are 64-bits and nodes issue lookups with a randomly selected key identifier. As a result, there is a very small chance in identifier collision when the total number of lookups is within 210,000. However, it is possible for one node to host many key identifiers. The reason for this is the network consists of only 600 nodes whereas the total number of keys is 264. This accounts for the fact that some queries can find the desirable destination on its origination. Then the curve descends to the lowest point at 5 hops and ascends to the first peak at 10 hops. About 0.48% of queries take 10 hops to reach destination. The second peak at 18 hops occupies 4.09% of total queries. After a deep descendent at 27 hops, Tapestry PDF curve reaches the highest peak at 36 hops. This group of lookups accounts for 5.91% in total. There is a little peak at 54 hops which takes up 0.25% of the entire lookups.

Chord presents a more dispersed trend in underlay hops distribution. There are three crests in the positions of 21, 38 and 54 underlay hops which correspond to 1, 2 and 3 Chapter 3 Simulation Comparison of DHTs 55 overlay hops. These three peaks occupy 0.45%, 3.13% and 4.36% of total queries respectively. It is found that Chord does not comply with the rule of O(logN) as well as Tapestry with node churning.

7

6 TapestryPDF 5 ChordPDF 4

3 PDF(%)

2

1

0 0 102030405060708090100 Number of underlying IP hops

Figure 23 The PDF of underlying IP path length in a 600-node network

Each peak in the PDF chart corresponds to an integer number of overlay hops. It is clearly seen in Figure 23 that Chord takes more underlying hops (21hops) than Tapestry (18hops) for 1 overlay hop. Tapestry also performs better for 2 overlay hops with 36 underlay hops compared with 38 underlay hops in Chord. However, the performance of Chord and Tapestry is similar for more than 3 overlay hops. It can be found in Figure 23 that the reduced number of overlay hops contributes most to the locality of Tapestry. Another reason for a better locality in Tapestry is the fewer underlying hops per overlay hop.

3.3.1.3 Path length factors

In this section, we will outline the factors which affect the path length in both Chord and Tapestry protocols. We will also explain the difference between the theoretical values and the simulation results. The overlay path length (measured in overlay hops in this thesis) is affected by two main factors: firstly and most importantly, by the size Chapter 3 Simulation Comparison of DHTs 56 of the routing table maintained in each peer-to-peer node and secondly, by each protocol’s unique routing scheme. In this section, we will discuss these two factors in detail and give evidence to support them. In this section, “Chord” refers to the revised Chord protocol with Proximity Neighbour Selection. a) Impact of base The size of the routing table is mainly affected by base in both of Chord and Tapestry protocols. As was introduced in section 2.2, each Chord node keeps a finger list of

(b-1)logbN fingers and each Tapestry node keeps b*logbN routing entries. The theoretical routing table sizes are outlined in Table 1. It should be noted that the theoretical values only apply when the network is comparable to the namespace. For example, millions of nodes in a 64-bit network.

Number of Routin g table entri es Base Chord Tapestry 2 64 128 4 96 128 8 149 168 16 240 256 32 396 409 64 672 682 128 1161 1170

Table 1 Theoretical average number of routing table entries

Table 1 shows that the difference in the number of routing table entries for ChordFingerPNS and Tapestry decreases with the increasing base b. With more routing table entries, the overlay path length should also be decreased accordingly. Figure 25 and Figure 26 show the effect of base on overlay path length (measured in hops) in ChordFingerPNS and Tapestry respectively. The simulation was performed in a 1024-node network due to the limited memory in our simulation settings. Chapter 3 Simulation Comparison of DHTs 57

Effect of Base in ChordFingerPNS

5 finger list 140 4.5 entries 120 4 3.5 overlay hops 100 3 80 2.5

hops 60

2 entries 1.5 40 1 Number of Overlay

20 Number of finger list 0.5 0 0 0 20406080100120140 Base

Figure 24 Effect of Base on overlay path length in ChordFingerPNS

Effect of Base in Tapestry

5.5 140 5 120 4.5 routing entries 4 100 3.5 overlay hops 3 80 2.5

60 entries 2 1.5 40 1 Number of Overlay hops 20 Table Routing of Number 0.5 0 0 0 20406080100120140 Base

Figure 25 Effect of Base on overlay path length in Tapestry

Both Figure 24 and Figure 25 show that the average overlay path length decreases with the increasing base. Figure 24 also shows that the average path length is 4.8 when base equals to 2 in ChordFingerPNS. This value is consistent with the 1 theoretical value of 5 which was obtained through the formula log N in the 2 2 original version of Chord (N = 1024)[18]. Furthermore, it can be seen that the numbers of routing entries are far less than the theoretical values Table 1. This is Chapter 3 Simulation Comparison of DHTs 58 caused by the significant difference between the network size (210) and the namespace (264). b) Impact of redundancy Both ChordFingerPNS and Tapestry maintain redundant routing table entries to deal with network failures. This has been introduced in protocol maintenance in section 2.2 previously. In our simulations, both Chord and Tapestry must contain at least 4 nodes in order to ensure that most queries are successful under frequent churn.

By having r nodes in a Chord node’s successor list, ChordFingerPNS also allows up to r nodes in each level of its finger list, sorted by latency. These nodes are not only used on the face of network failures, but also contribute to shorten the overlay path since they provide extra routing choices, as can be seen in Figure 26. However, redundancy in Tapestry only enhances the protocol resiliency during network failures:as can be seen in Figure 27, it does not affect the overlay path length.

Effect of redundancy in ChordFingerPNS

6 35 5.8 finger table 30 5.6 size 5.4 25

5.2 20 5

overlay hops hops 4.8 15 4.6 10 finger table sizefinger table

4.4 number of overlay 5 4.2 4 0 0 5 10 15 20 25 Base

Figure 26 Effect of redundancy on overlay path length in ChordFingerPNS Chapter 3 Simulation Comparison of DHTs 59

Effect of redundancy in Tapestry

6 140 5.8 120 5.6 5.4 100 overlay hops 5.2 80 5

4.8 60 hops 4.6 routing table 40 routingsize table 4.4 number of overlay size 20 4.2 4 0 0 5 10 15 20 Base

Figure 27 Effect of redundancy on overlay path length in Tapestry c) Impact of routing schemes As was explained in Chapter 2, the routing schemes of Chord and Tapestry are completely different due to the different routing table structures and identifier matching rules. This difference can also affect the performance difference in various applications.

In Chord, a message would usually know where the responsible node is as long as it arrives at the responsible node’s predecessor whereas in Tapestry, this message would only know when it cannot find any numerically closer node in the responsible node’s routing table. In Chord, a query never overshoots the key. This means the query will never be routed to a node whose identifier is larger than the key identifier. For example, if a node x that issues a query for key k happens to be k’s successor, then the query still needs to go along the ring clock-wise and finds k’s predecessor according to x’s successor list and finger table. The routing strategy is based on Chord’s unidirectional routing scheme.[58] This is to clarify that the destination for Chord is the responsible node’s predecessor whereas the destination for Tapestry is the responsible node itself.

Chapter 3 Simulation Comparison of DHTs 60

In our simulation of key lookup, we terminate the routing procedure as long as the key finds its responsible node’s predecessor in ChordFingerPNS.[33, 59] Then we recorded the number of overlay hops taken by queries for each base from 2 to 128 and their corresponding number of routing entries as is shown in Figure 28 and Figure 29. The simulation was performed in a 100-node network. The redundancy is 4 in both ChordFingerPNS and Tapestry.

Overlay hops comparison for queries

4.5 4 3.5 3 2.5 2 Tapestry 1.5 1

query destinations ChordFinger 0.5 PNS-1

Number of Overlay tohops 0 0 20 40 60 80 100 120 140 Base

Figure 28 Overlay hops comparison for queries

Routing Table Comparison

50 ChordFinger 45 PNS 40 35 Tapestry 30 25 20 entries 15 10 5 Number of Routing Table Table Routing of Number 0 0 20 40 60 80 100 120 140 Base

Figure 29 Routing Table Comparison

As we can see from Figure 28, the overlay path for queries to reach the destination is Chapter 3 Simulation Comparison of DHTs 61 always longer for Tapestry than for ChordFingerPNS. This is because Tapestry holds less routing table state. However, we can also discover that with the increasing base, their routing table entries as well as their average number of overlay hops are getting closer to each other. With the base 128, ChordFingerPNS and Tapestry holds approximately the same number of routing entries and bear similar overlay path length. This is consistent with the theoretical routing table sizes in Table 1.

In this scheme, a Chord message does not need to visit the responsible node itself. This scheme should work perfectly for applications which need to transfer large amount of data. For example, if an application needs to publish a large file onto a certain node according to Chord routing scheme, it first sends a key in a query without containing the file. As soon as the key gets to know where the responsible node is by finding its predecessor, it will inform the source so that the source node can put the file onto this responsible node directly through the underlay networks. This approach will certainly reduce the latency as well as the burden on peer-to-peer networks.

However, this approach of routing a message to a responsible node’s predecessor is not suitable for our application of sharing network measurements for three reasons. First of all, network measurements messages are usually comparable in size to key messages (e.g. a few bytes), thus it will create more overheads during a back-and-forth routing procedure. Second, the key feature of our application is to update the measurement information in real-time, so this approach will certainly prolong either the message insertion or the message retrieval process. Third, for message routing from the source to its responsible node, even the first few overlay nodes are close to the source in network proximity, it is still not guaranteed that the responsible node’s predecessor is close to the source node. These nodes are possibly further away from each other (e.g from two continents). Therefore, it is essential for our application to route the measurements message together with the key message to the responsible node in Chord instead of only its predecessor. With this application analysis, we compare the performance in ChordFingerPNS and Tapestry in Figure 30. Chapter 3 Simulation Comparison of DHTs 62

Overlay Hops comparison for measurements messages

4.5 4 3.5 ChordFinger 3 PNS 2.5 2 1.5 1

object destinations object Tapestry 0.5

Number of Overlay tohops 0 0 20 40 60 80 100 120 140 Base

Figure 30 Overlay hops comparison for measurements messages

As we can see from Figure 30, when the base is larger than 2, Tapestry takes fewer number of overlay hops than Chord although its routing table is also smaller than Chord’s. This is because if we use key’s successor as a query’s destination, then the minimum number of overlay hops taken for Chord is 1. This happens when the key identifier sits in between the source node and its successor. With the same amount of routing information, this scenario is equivalent to 0 hops in Tapestry.

Figure 30 explains the results of the overlay hops difference between Chord and Tapestry in the previous section. In our simulations, we used 64 for our base, thus the average overlay hop taken by Chord exceeds Tapestry for about 1 hop.

From the above discussion, we can conclude that while Tapestry holds less routing state, it still bears a shorter overlay path than Chord when these two protocols are utilized in our application of sharing network measurements. At the same time, Chord can manage to have a shorter overlay path by employing a bigger successor list.

Chapter 3 Simulation Comparison of DHTs 63

3.3.1.4 Impact of stabilization frequency

With node joining and leaving in the network, protocols need to run stabilization process to keep the routing table up-to-date and to ensure all lookups to find destinations within the maximum lookup time. As is referred in section 2.2.1.2, Chord runs stabilization periodically and replaces each node’s predecessor or successor upon nodes joining or leaving if necessary. The stabilization mechanism in Tapestry (section 2.2.3.4) is more complicated than Chord. Each new node joining Tapestry network needs to multicast its existence to all the other nodes in proximity. All the other nodes, when receiving the notification message, will check the correctness of their routing tables and surrogate routes for some nodes.

0.25

0.2 Tapestry 0.15

0.1 Failure Rate Failure

0.05 Chord 0 0 200000 400000 600000 800000 1000000 1200000 1400000 stabilization interval

Figure 31 Comparison of Stabilization Frequency Effect without link failures in a 600-node network

Figure 31 shows the performance difference under different stabilization frequency in a 600-node network. In the network, the average time for a node to stay online and offline is set to 1 hour. Each node issues a query every 10 minutes. We vary the churn rate in both Chord and Tapestry to investigate their performance difference in the metric of failure rates. The stabilization intervals are set to 72, 144, 288, 576, and 1152 seconds. It can be seen that the performance in Chord does not degrade Chapter 3 Simulation Comparison of DHTs 64 obviously with less frequent stabilization. The success rate in Tapestry is similar to Chord when stabilization runs most frequently. However, the performance degrades rapidly with the decreasing frequency in running stabilization.

The result shows that the stabilization mechanism in Tapestry is not as effective as the one in Chord. It not only introduces a large portion of overheads to the network, but hardly handles frequent node churning as well.

3.3.2 Results on networks with link failures

Link failures can occur anytime and anywhere over the network. In this section, we investigate the locality performance in these two protocols under the circumstances that the underlying network is unstable. Presumably there will be more chance for a query to fail if it has to go through more underlying links to find the desired node. The performance metric that we employ is the success rate during a 6-hour simulation.

MTTF and MTTR are the time intervals that a link is available or failed respectively. It is found that the absolute value of MTTF and MTTR affect success rates to a small extent in both of Chord and Tapestry. However, the ratio of MTTF and MTTR affect success rates dramatically.

3.3.2.1 Impact of link failures frequency

We use MTTF as a constant value of 1 simulated hour and vary MTTR to obtain different ratios of MTTF to MTTR as 10, 30, 50, 80, 100, 150 and 200. Due to the extreme long hours (even to a few days) of simulations on a network with more than 600 nodes, a 300-node network is chosen to operate on this group of simulations. The difference in success rate under different MTTF/MTTR is shown in Figure 32.

Chapter 3 Simulation Comparison of DHTs 65

0.8

0.7

0.6

0.5

0.4 Chord 0.3 Failure Rates 0.2 Tapestry 0.1

0 0 50 100 150 200 Ratio of Lifemean to Deathmean

Figure 32 Comparison of Effect of different MTTF/MTTR in Chord and Tapestry in a 300-node network

It is clearly seen in Figure 32 that with the increasing ratio of MTTF to MTTR, the mean time that a link failures decreases and the success rates increase in both of Chord and Tapestry. However, as is clearly seen in the figure, Chord performs better than Tapestry when the ratio increases to 150 and more. This result is consistent with the previous result that Chord performs better than Tapestry in a network without link failures.

It can be inferred from this result that Chord is expected to perform better than Tapestry when it is built on a network with stable link connections.

3.3.2.2 Impact of network size

The networks tend to be more unstable if there are more node churns and more link failures involved. In this section, we investigate the trend of success rates with the increasing number of nodes in the network. We vary the ratio of MTTF to MTTR from 10 to 100 and draw Figure 33 and 34 in Chord and Tapestry respectively. The stabilization interval in this group of simulations is set to 72000 milliseconds.

Chapter 3 Simulation Comparison of DHTs 66

0.8

0.7 MTTF/MTTR =10 0.6

0.5

0.4 MTTF/MTTR =50 0.3 Failure Rates Failure 0.2

0.1 MTTF/MTTR =100 0 0 50 100 150 200 250 300 350 Number of Nodes

Figure 33 Failure rates with different link event ratio of MTTF/MTTR in Chord

It can be seen in Figure 33 and Figure 34 that the success rates decrease with the increasing size of the network under any ratio of MTTF/MTTR in these two protocols. The success rates tend to decrease more quickly with a small value of MTTF/MTTR (10). However, the success rates lines are getting closer to each other with larger value of ratios of MTTF/MTTR (50 and 100).

0.7 MTTF/MTTR=10 0.6

0.5

0.4

0.3 MTTF/MTTR=50

Failure Rates Failure 0.2

0.1 MTTF/MTTR=10 0 0 0 50 100 150 200 250 300 350 Number of Nodes

Figure 34 Failure rates under different link event ratio of MTTF/MTTR in

Tapestry Chapter 3 Simulation Comparison of DHTs 67

As is seen in previous sections, the larger the network is the more underlying hops a query has to travel. Thus, the query is more likely to be affected by link failures. This accounts for this simulation result.

We also compare the difference in success rates when different ratios of MTTF to MTTR apply. We still choose MTTF/MTTR as 10, 50 and 100 to draw Figure 35, 36 and 37. It is clearly seen that the success rates of queries decline with the increasing size of networks in both of Chord and Tapestry. Meanwhile, it can be seen that Tapestry always gains larger success rates than Chord when MTTF/MTTR is less than 100. The smaller the ratio of MTTF/MTTR is, the bigger the gap between the success rates in Chord and Tapestry is. When the ratio increases to 100, as is seen in Figure 37,

0.8

0.7 Chord 0.6

0.5 Tapestry

0.4

0.3 Failure Rates 0.2

0.1

0 0 50 100 150 200 250 300 350 Number of Nodes

Figure 35 Comparison of Failure rates when MTTF/MTTR = 10 the success rates lines obtained from these two protocols almost overlap with each other. The smaller the ratio of MTTF/MTTR is, the longer that a link stays failed. This result suggests that Tapestry is prone to perform better than Chord in an unstable network with lots of link failures involved.

Chapter 3 Simulation Comparison of DHTs 68

0.3

0.25 Chord

0.2 Tapestry 0.15

0.1 Failure Rates Failure

0.05

0 0 50 100 150 200 250 300 350 Number of Nodes

Figure 36 Comparison of Failure rates when MTTF/MTTR = 50

0.16 Chord 0.14

0.12 Tapestry 0.1

0.08

0.06 Failure Rates Failure 0.04

0.02

0 0 50 100 150 200 250 300 350 Number of Nodes

Figure 37 Comparison of Failure rates when MTTF/MTTR = 100

3.3.2.3 Impact of stabilization frequency

In this subsection, we investigate the effect of stabilization frequency in a network with frequent link failures. As what we did with a stable underlying network in the previous section, we tune the stabilization intervals to 72000, 144000, 288000, 576000 and 1152000 milliseconds and examine the success rates in both of Chord and Tapestry protocols. We also set the ratio of MTTF to MTTR as 100 since the success Chapter 3 Simulation Comparison of DHTs 69 rates difference between Chord and Tapestry in this range is not so obvious so that the effect of stabilization intervals is able to differentiate the performance in these two protocols.

Figure 38 shows that Tapestry performs much worse than Chord with a less frequent maintenance procedure while Chord performance stays the same with the increasing maintenance intervals. The failure rates shown in Figure 38 demonstrate a combined effect of node failures and link failures. Comparing with the result of stabilization frequency effects without link failures in Figure 31, it is clearly seen that the failure rates in Chord increase more than the one in Tapestry in every stabilization interval. It can be expected that the performance of Chord degrades quickly with the decreasing

0.45 0.4 0.35 Tapestry 0.3

0.25 0.2

Failure rates Failure 0.15 0.1 Chord 0.05 0 0 200000 400000 600000 800000 1000000 1200000 stabilization interval

Figure 38 Comparison of Stabilization Frequency Effects with link failures (MTTF/MTTR = 100) ratio of MTTF to MTTR while the performance of Tapestry degrades quickly with the decreasing stabilization frequency.

Chapter 3 Simulation Comparison of DHTs 70

3.4 Conclusion

This chapter described the simulation procedure to compare two popular DHTs: Chord and Tapestry. The simulation has used the most popular simulator P2Psim and topology generator GT-ITM. The integration of P2Psim and GT-ITM and the extension of P2Psim to support link events are described in Appendix I. Methodology and setups are also specified to form the foundation of our simulations. Simulation results show that in a stable network environment without link failures, Tapestry not only bears shorter overlay hops comparing with Chord, but also has shorter underlay distance per overlay hop. This indicates that Tapestry has a better locality than Chord. Simulation results also show that, in a network with link failures, Tapestry is more resilient than Chord especially with more frequent link failures and longer failure durations. However, Tapestry has a vulnerable maintenance mechanism to deal with node departures. It performs poorly if the stabilization procedure does not run frequently enough. This may be due to the complex maintenance mechanism in Tapestry.

Furthermore, a frequent and complex maintenance procedure creates more overheads in a Tapestry network. For example, in our simulation, Chord only takes 0.3% of our 8-gigabyte RAM whereas Tapestry takes up to 4.3% of the RAM in a 300-node network. Thus, it is clearly seen that Tapestry is less scalable than Chord.

Chapter 4 Implementation with Pastry 71

Chapter 4 Implementation with Pastry

4.1 Introduction

We have concluded in Chapter 3 that Tapestry performs better than Chord in terms of locality property. In this chapter, a design of peer-to-peer overlay networks to share network measurements is proposed. Due to the popularity of the Pastry interface Freepastry and the routing scheme similarities between Tapestry and Pastry protocols that has been revealed in Chapter 2, a small overlay network built on Freepastry substrate is created to share the network measurement information.

4.2 Implementation design

Our utilization of peer-to-peer networks focuses on sharing network measurement information among peers on Internet. The conceptual design can be seen in Figure 33. The new system consists of end-hosts and a variety of servers, such as file servers and web servers. Each end-host installs software which consists of two layers: measurement collection layer and peer-to-peer protocol layer.

The top layer is measurement information collecting layer which is able to collect measurement information through active probing or passive monitoring. This activity can be achieved either by periodically sending probing packets to the specific server to get some useful information or by collecting the relevant measurement information through some daily use applications, such as web-browsing or file downloading. In our implementation, we use a type of ping utility hrping [60] to collect this information. Then this layer passes the information to the lower layer. The lower layer Chapter 4 Implementation with Pastry 72 is the peer-to-peer protocol layer. Each end-host publishes its own network experience through this layer for other end-hosts to share. The functions of these two layers are independent from each other.

File Server

Measurement Information End-host E End-host D Measurement End-host B Collection Layer Peer-to-Peer Protocol Layer

End-host A Peer-to-Peer Networks 2 Joining a First peer-to-peer retrieve Peer E2 network Peer B1 Peer B2 Second retrieve Peer D2 Peer C 1 Peer A1

Publish Measurements Peer A2 Peer-to-Peer Publish Measurements Networks 1

End-host C

Figure 39 Conceptual Design of Network Measurement Sharing System

The approach to localize the measurement information in this design is to create multiple overlays, as is demonstrated in Figure 39. The aim of creating multiple overlays is to be able to access the measurement information during network failures. A host should retrieve the information measured by the other local hosts first if possible, and then look at the information measured by hosts in a larger network. Furthermore, when searching for the measurement information, a host should search for its local area overlay first, and it should still be able to retrieve the information when the link to the external Internet fails.

The system in Figure 39 works as follows. End-host A, B and C stand for any end-host in peer-to-peer network 1. End-host A collects the measurement information Chapter 4 Implementation with Pastry 73 from the file server and publishes it to peer-to-peer networks, so that other end-hosts can retrieve it by joining the overlay network if they are interested in the experience of end-host A to the file server. Local network (e.g UNSW campus network) creates peer-to-peer network 1 whilst a larger network (e.g Sydney network) consisted of end-host A, B, C, D and E creates peer-to-peer network 2. Peer-to-peer network 2 covers peer-to-peer network 1 geographically.

When end-host A collects the network information (delay, bandwidth etc) to the file server, it voluntarily publishes this information on multiple overlays (peer C1 in p2p network 1 and peer D2 in p2p network 2 in the diagram). As a result, end-host B which locates in the same local network with end-host A will be able to retrieve this information by participating in the same peer-to-peer networks with end-host A (Peer A1 and Peer A2) and thus it could infer its own experience to the file server when the link to the external Internet fails.

First of all, end-host B joins the local peer-to-peer network 1 as peer B1 and sends request to this overlay first to check if the local overlay network has any information to the file server. If peer C1 holds the information, B will retrieve it. Second, if no such measurement information to the file server exists in peer-to-peer network 1, B needs to initiate a query in a larger scope of peer-to-peer network which is peer-to-peer network 2 by joining as peer B2 to check if such measurement information can be found. Third of all, during network failures to the external Internet, end-host B should still be able to retrieve the information to the file server if such information exists in the local peer-to-peer overlay network.

4.3 Implementation details

The aim of the implementation is to build a network measurements sharing system on top of Pastry. Freepastry is an implementation of Pastry which can be deployed on the Internet. We use it as a platform to build our application. Chapter 4 Implementation with Pastry 74

We use active network probing ping for simplicity of the program. Due to the small scale of the network, ping in Windows is not accurate enough to reflect the time difference between different hosts. Hrping is employed to differentiate the ping time in a higher resolution (micro-seconds) to different hosts in windows XP. The screen shot for hrping is shown in Figure 40.

Figure 40 hrping screenshot to a host in another subnet

The number of echo requests can be specified with options –n. It can be seen that the ping time returned by the first echo packet is extremely large which does not reflect the real ping time to a destination. It will affect the ping time to a host in the local subnet to a great extent. As a result, we ignore the first packet and average over n-1 echo replies in our program.

The implementation setup is shown in Figure 41. Three subnets were set up in the implementation with four end-hosts, two switches and two routers, denoted with three different IP prefixes. These two routers are connected with a low speed cable and they both connect two switches with Ethernet cables. Chapter 4 Implementation with Pastry 75

Low-speed 10.3.0.1 10.3.0.2 cable

10.1.0.1 10.2.0.1 Ethernet Ethernet 8 0 0 Cable 5 Cable S t r ta o R r p t 9 e P 1 0 9 t a k 0 r 0 i s r 5 e o t 0 tr r v 5 y w T t o e t n e p h r e f n o 2 e r t o w y n p r k t m o r s o n h r t k a i w e p t n 2 t e w t e w r w n t 2 a o w i t y t s r r h t o k k r d s r p

k o n a 2 o a p r

1 w t a n p t t i 5 l o e jo p 0 e r n o 0 t D d 9 r 5 y n t r g a 0 t in 5 0 s p 0 r 9 a 0 H P 8

f n i i r o s J t Alpha Delta 10.1.0.2 Beta Gamma 10.2.0.3 10.1.0.3 10.2.0.2

Figure 41 Setup of Implementation to share network measurement information

In our implementation, we create two scopes of Pastry overlay networks by using different port numbers. Host Alpha and Beta form a local peer-to-peer network with port 5008 while host Alpha, Beta, Gamma and Delta create a larger scope of peer-to-peer network with port 5009.

Host Alpha creates a local Pastry network using itself as a bootstrap node with port 5008 and then hrpings host Delta. Host Beta later on joins this local network using alpha as the bootstrap node with port 5008. The ping time information is then published in this local Pastry network. At the same time, Delta and Gamma form another Pastry network with port 5009. Alpha and Beta joins this network by bootstrapping host Delta with port 5009. Alpha then publishes its measurement information by inserting the key to the network. The key is produced by hashing the IP address of host Delta 10.2.0.3.

Chapter 4 Implementation with Pastry 76

When Beta wants to retrieve the measurement information to host Delta, it first sends a request with the same key to the local Pastry network with port 5008. When receiving no response, it initiates another query to a larger scope of Pastry network and tries to locate the relevant information.

The programme is written on a Freepastry java platform with JDK5.0 in windows XP. The measurement program implements three classes in Freepastry: DistMeasurement, MeasurementApp and MeasurementData. All of the computers which run for sharing measurement need to install Freepastry together with these three classes. DistMeasurement accomplishes the tasks of creating a Pastry network or joining a Pastry network by specifying a few arguments. These arguments specify the action of inserting, retrieving and also number of messages to send to the network, a different port number if other nodes need to attend the same overlay network, IP address for a bootstrap node. This class also describes the method of constructing a Pastry node and collecting measurement data towards a certain IP address by hrping if the action of inserting is taken. The MeasurementApp describes the action of the message on current node. The node issues, en routes or processes the message if it is the source, intermediate or destination node for this message, respectively. Message processing at the destination node includes reading or writing to a file according to the message type (inserting or retrieving). The MeasurementData defines the message format to store in a file. The programme can be initiated with command “java DistMeasurement ping address –insert/retrieve”. Details can be found at Appendix II.

4.4 Implementation results discussion

Since Freepastry is implemented on distributed computers, log files are generated separately. We will demonstrate the results according to the implementation process described in section 4.3.

Host Alpha and Beta first form a local Pastry network and Beta initiates an insertion Chapter 4 Implementation with Pastry 77 request for ping information to Host Delta. Figure 42 shows the log file to create a Pastry network based on a chronological order.

Figure 42 Create a local Pastry network

As can be seen in Figure 42, Alpha initiates the Pastry network on port 5008 with node ID <43A93C…> for Beta to join with node ID . Then Alpha adjusts its route set and leaf set with the new joining node.

Figure 43 and Figure 44 show the procedures for Beta to insert measurement information and retrieve later on. Chapter 4 Implementation with Pastry 78

Figure 43 Insert information in the local Pastry network

As can be seen in Figure 43, Beta sends the insert request to the local network and finds out that Alpha should hold this measurement information. Then this information is received by Alpha and is written to a file named with the IP address of the hrping destination Delta/10.2.0.3 and stored in the local computer.

Figure 44 Retrieve information in the local Pastry network

The Delay field is specified with -1.0 first in a retrieve message. When the retrieve message is send out, the source host creates a server socket and wait for response. Chapter 4 Implementation with Pastry 79

Upon finding the desired information, the destination host creates a client socket and send the information back through this socket. In Figure 44, Beta/10.1.0.3 has received the answer from Alpha/10.1.0.2 and prints out the retrieved ping time.

If the information cannot be found in the local area Pastry network, as is shown in Figure 45 that no relevant information has been found, host Beta will search in a wider area Pastry network.

Figure 45 Failed to retrieve information from the local Pastry network

At the same time, another wider area Pastry network is created with port 5009. The process of inserting and retrieving is the same as in the local Pastry network. More details are given in Appendix II.

Chapter 4 Implementation with Pastry 80

4.5 Future Work

Our implementation only involves four Pastry nodes which only demonstrate the operation at a small scale. For future work, we could consider using PlanetLab [61] for a large-scale implementation. PlanetLab is a collection of computer machines distributed across the globe and hosted by research institutions. It currently consists of 723 machines, spanning over 25 countries.

Among the services provided by PlanetLab, OpenDHT provides a publicly accessible distributed hash table (DHT) service. OpenDHT runs on a collection of 300 nodes on PlanetLab. Each of these nodes continuously runs the Bamboo DHT implementation [62], which is a variation of the Pastry protocol. A PlanetLab could build upon Scriptroute [63], which provides implementations of ping, sprobe, traceroute and bandwidth measurement tool.

4.6 Conclusion

This chapter has proposed a new design of sharing network measurement in different scopes (in network proximity) of peer-to-peer overlay networks and is implemented in a small-scale network with two scopes. This implementation has demonstrated that it is a feasible solution to use a locality-aware DHT Pastry to share network measurement information. However, a larger network needs to be setup to test the stability and efficiency of this approach. As the future work, we could consider using PlanetLab to build our applications.

Chapter 5 Conclusions and Future work 81

Chapter 5 Conclusions and Future work

This chapter outlines the results of the thesis and lists possible future work related to this area.

5.1 Results summary

The thesis investigated the basic approach to measure network performance and reviews the failure characterization in IP backbones. To meet the increasing requirement for the Internet users to understand network performance degradation, we intend to use structured peer-to-peer networks to share network measurement data among end-users continuously and immediately. To fulfil this task:

1. The algorithms of Chord, Tapestry and Pastry, along with their optimizations, have been studied to conclude that the performance of Tapestry is most applicable to share network measurement information. 2. The simulator of P2Psim [27] has been enhanced with GT-ITM topology and the link event class. 3. By comparing the underlay distance and stretch in distance between Chord and Tapestry in a network with stable link connections, it has been found that Tapestry bears shorter underlay distance than Chord to handle queries. However, Tapestry is not as resilient as Chord in a network with constant link failures. It has shown in Chapter 3 that the performance of DHTs is affected by the ratio of average link live time to average link failure time. The smaller the ratio is, the worse Tapestry performs. 4. A small-scale application using hrping to measure network performance is deployed on Freepastry [29]. Chapter 5 Conclusions and Future work 82

5.2 Future work

Due to the time constraint, several tasks several tasks have been left for future work.

1. Comparisons of other DHTs which incorporate locality, such as Leopard [64], Skipnet [65] and Foreseer [66] require a common platform to achieve reliable results. 2. The link failure model in the simulation needs to be refined. 3. For the implementation, instead of using Freepastry, we could consider using the service OpenDHT and the tool Scriptroute in PlanetLab to build our applications on a global-scale network. 4. The measurement information that is collected can also help solving networking problems in other performance-sensitive applications such as Resilient Overlay Networks or other services such as integrated and differentiated services.

Reference

1. Huston, G., Measuring IP Network Performance-The Internet Protocol Journal. The Internet Protocol Journal, 2003. 6(1). 2. Kar, D., Internet path characterization using common internet tools. J. Comput. Small Coll., 2003. 18(4): p. 132. 3. Jacobson, V. pathchar-a tool to infer characteristics of Internet paths. April, 1997 [cited; Available from: ftp://ftp.ee.lbl.gov/pathchar/. 4. Downey, A. Using pathchar to estimate Internet link characteristics. in SIGCOMM '99: Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication. August,1999: ACM. 5. Anagnostakis, K.G., M. Greenwald, and R.S. Ryger. cing: measuring network-internal delays using only existing infrastructure. in INFOCOM 2003. Twenty-Second Annual Joint Conference of the IEEE Computer and Communications Societies. IEEE. March, 2003. 6. Chen, T., Increasing the observability of Internet behavior. Commun. ACM, 2001. 44(1): p. 98. 7. Internet Performance Measurement and Analysis project(IPMA). [cited; Available from: http://www.merit.edu/ipma. 8. Craig Labovitz, A.A., Farnam Jahanian. Experimental study of Internet stability and backbone failures. in Fault-Tolerant Computing, 1999. Digest of Papers. Twenty-Ninth Annual International Symposium on. June,1999. 9. Iannaccone, G., et al. Analysis of link failures in an IP backbone. in IMW '02: Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment. November, 2002: ACM. 10. Markopoulou, A., et al. Characterization of failures in an IP backbone. in INFOCOM 2004. Twenty-third AnnualJoint Conference of the IEEE Computer and Communications Societies. March, 2004.

11. Eng Keong Lua, et al. A Survey and Comparison of Peer-to-Peer Overlay Network Schemes. in IEEE COMMUNICATIONS SURVEY AND TUTORIAL. Second Quarter 2005. 12. Gnutella development forum, the gnutella protocol. 2001 [cited; Available from: http://www9.limewire.com/developer/gnutella_protocol_0.4.pdf. 13. Ian Clarke, O.S., Brandon Wiley, Theodore W. Hong, Freenet: A Distributed Anonymous Information Storage and Retrieval System. Freenet White Paper. 1999. 14. Fasttrack peer-to-peer technology company. 2001 [cited; Available from: http://www.fasttrack.nu/. 15. KaZaA media desktop. 2001 [cited; Available from: http://www.kazaa.com/. 16. Bittorrent. 2003 [cited; Available from: http://bitconjurer.org/BitTorrent. 17. Overnet/edonkey2000. 2002 [cited; Available from: http://www.overnet.com/. 18. Stoica, I., et al., Chord: a scalable peer-to-peer lookup protocol for Internet applications. Networking, IEEE/ACM Transactions on, February, 2003. 11(1): p. 32. 19. Zhao, B.Y., et al., Tapestry: a resilient global-scale overlay for service deployment. Selected Areas in Communications, IEEE Journal on, January, 2004. 22(1): p. 53. 20. Rowstron, A. and P. Druschel. Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems. in Middleware '01: Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg. November, 2001: Springer-Verlag. 21. Ratnasamy, S., et al. A scalable content-addressable network. in SIGCOMM '01: Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications. August, 2001: ACM.

22. Maymounkov, P., Mazi, and D. res. Kademlia: A Peer-to-Peer Information System Based on the XOR Metric. in IPTPS '01: Revised Papers from the First International Workshop on Peer-to-Peer Systems. March, 2002: Springer-Verlag. 23. Malkhi, D., M. Naor, and D. Ratajczak. Viceroy: a scalable and dynamic emulation of the butterfly. in PODC '02: Proceedings of the twenty-first annual symposium on Principles of distributed computing. July, 2002: ACM. 24. Karger, D., et al. Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. in STOC '97: Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. May, 1997: ACM. 25. Srinivasan, S. and E. Zegura. M-coop: a scalable infrastructure for network measurement. in Internet Applications. WIAPP 2003. Proceedings. The Third IEEE Workshop on. June, 2003. 26. Frank Dabek, J.L., Emil Sit, James Robertson, M. Frans Kaashoek, and Robert Morris. Designing a DHT for Low Latency and High Throughput. in NSDI '04. June, 2004. 27. P2Psim. [cited; Available from: http://pdos.csail.mit.edu/p2psim/. 28. Georgia Tech Internetwork Topology Models. [cited; Available from: http://www-static.cc.gatech.edu/projects/gtitm/. 29. Freepastry interface. [cited; Available from: http://freepastry.org/FreePastry/download.html. 30. Chawathe, Y., et al. A case study in building layered DHT applications. in SIGCOMM '05: Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications. August, 2005: ACM. 31. Rhea, S., et al. OpenDHT: a public DHT service and its uses. in SIGCOMM '05: Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications. August, 2005: ACM.

32. Internet Indirection Infrastructure. [cited; Available from: http://i3.cs.berkeley.edu/. 33. Li, J., et al. Comparing the performance of distributed hash tables under churn. in The 3rd International Workshop on Peer-to-Peer Systems. February, 2004. San Diego, CA, USA. 34. Plaxton, G., R. Rajaraman, and A. Richa. Accessing nearby copies of replicated objects in a distributed environment. in SPAA '97: Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures. June,1997. Newport, Rhode Island, United States: ACM. 35. Ben Y. Zhao, J.K., and Anthony D. Joseph, Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and Routing. 2001, Computer Science Division, University of California, Berkeley. p. 28. 36. Kubiatowicz, J., et al. OceanStore: an architecture for global-scale persistent storage. in ASPLOS-IX: Proceedings of the ninth international conference on Architectural support for programming languages and operating systems. November, 2000: ACM. 37. Shelley Q. Zhuang, B.Y.Z., Anthony D. Joseph, Randy H. Katz, John D. Kubiatowicz. Bayeux: an architecture for scalable and fault-tolerant wide-area data dissemination. in NOSSDAV '01: Proceedings of the 11th international workshop on Network and operating systems support for digital audio and video. January, 2001: ACM. 38. Spamwatch. [cited; Available from: http://www.cs.berkeley.edu/zf/spamwatch/. 39. Druschel, P. and A. Rowstron. PAST: a large-scale, persistent peer-to-peer storage utility. in Hot Topics in Operating Systems, 2001. Proceedings of the Eighth Workshop on. May, 2001. 40. Castro, M., et al., Scribe: a large-scale and decentralized application-level multicast infrastructure. Selected Areas in Communications, IEEE Journal on, October, 2002. 20(8): p. 1499.

41. Iyer, S., A. Rowstron, and P. Druschel. Squirrel: a decentralized peer-to-peer web cache. in PODC '02: Proceedings of the twenty-first annual symposium on Principles of distributed computing. July, 2002: ACM. 42. Castro, M., et al., Proximity neighbor selection in tree-based structured peer-to-peer overlays. 2003, Microsoft Research, Rice University, Purdue University: Redmond. p. 12. 43. Jeremy Stribling, Kirsten Hildrum, and J.D. Kubiatowicz, Optimizations for Locality-Aware Structured Peer-to-Peer Overlays. 2003, Computer Science Division (EECS), University of California, Berkeley. p. 10. 44. Peersim. [cited; Available from: http://peersim.sourceforge.net/. 45. Planetsim. [cited; Available from: http://planet.urv.es/planetsim/. 46. Indranil Gupta, et al., Kelips: Building an Efficient and Stable P2P DHT, in 2nd International Workshop on Peer-to-Peer Systems (IPTPS '03). February, 2003: Berkeley, CA, USA. 47. M. Frans Kaashoek and D.R. Karger, Koorde: A simple degree-optimal distributed hash table, in 2nd International Workshop on Peer-to-Peer Systems (IPTPS '03). February, 2003: Berkeley, CA, USA. 48. Li, J., et al. Bandwidth-efficient Management of DHT Routing Tables. in NSDI '05. May, 2005. Boston, MA. 49. Dabek, F., et al. Vivaldi: a decentralized network coordinate system. in SIGCOMM '04: Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications. August, 2004: ACM. 50. Calvert, K., et al. Extending and enhancing GT-ITM. in MoMeTools '03: Proceedings of the ACM SIGCOMM workshop on Models, methods and tools for reproducible network research. August, 2003: ACM. 51. Zhao, B., et al. Brocade: Landmark Routing on Overlay Networks. in IPTPS '01: Revised Papers from the First International Workshop on Peer-to-Peer Systems. March, 2002: Springer-Verlag.

52. Gedik, B. and L. Liu. PeerCQ: A Decentralized and Self-Configuring Peer-to-Peer Information Monitoring System. in ICDCS '03: Proceedings of the 23rd International Conference on Distributed Computing Systems. May, 2003: IEEE. 53. Andersen, D., et al. Resilient overlay networks. in SOSP '01: Proceedings of the eighteenth ACM symposium on Operating systems principles. October, 2001: ACM. 54. Sridhar Srinivasan and E. Zegura, Network Measurement as a Cooperative Enterprise, in International Workshop on Peer-to-Peer Systems. March, 2002: MIT Faculty Club, Cambridge, MA, USA. 55. Labovitz, C., et al. Delayed Internet routing convergence. in Networking, IEEE/ACM Transactions on. August, 2000. 56. Floyd-Warshall Algorithm. [cited; Available from: http://www.algorithmist.com/index.php/Floyd-Warshall's_Algorithm. 57. Miguel Castro, P.D., Y. Charlie Hu, Antony Rowstron, Topology-aware routing in structured peer-to-peer overlay networks. 2002, Microsoft Corporation: Redmond, WA 98052. p. 19. 58. Jun-jie, J., et al., Using bidirectional links to improve peer-to-peer lookup performance. Journal of Zhejiang University SCIENCE A, May,2005. 2006 7(6): p. 945-951. 59. Li, J., et al. A performance vs. cost framework for evaluating DHT design tradeoffs under churn. in IEEE Infocom 2005. March 2005. Miami. 60. hrping. [cited; Available from: http://www.cfos.de/ping/ping.htm. 61. PlanetLab. 2002 [cited; Available from: http://www.planet-lab.org/. 62. The Bamboo Distributed Hash Table. 2004 [cited; Available from: http://www.bamboo-dht.org/. 63. Scriptroute: A facility for distributed Internet debugging and measurement. 2002 [cited; Available from: http://www.cs.washington.edu/research/networking/scriptroute/.

64. Rezvani, P., et al. LEOPARD: a Logical Effort-based fanout OPtimizer for ARea and Delay. in ICCAD '99: Proceedings of the 1999 IEEE/ACM international conference on Computer-aided design. November, 1999: IEEE. 65. Nicholas J.A. Harvey, M.B.J., Stefan Saroiu, Marvin Theimer, Alec Wolman. SkipNet: A Scalable Overlay Network with Practical Locality Properties. in 4th USENIX Symposium on Internet Technologies and Systems. March, 2003. Seattle,WA. 66. Cai, H. and J. Wang. Foreseer: a novel, locality-aware peer-to-peer system architecture for keyword searches. in Middleware '04: Proceedings of the 5th ACM/IFIP/USENIX international conference on Middleware. October, 2004: Springer-Verlag.

Appendix I

P2Psim extension

1. Extending P2Psim to support GT-ITM

The current version of P2Psim contains GT-ITM in its topology folder, but it is yet to be supported by authors. To integrate GT-ITM with P2Psim. a) Install libgb.a to /usr/local/lib and eval.h, gb_dijk.h, gb_flip.h, gb_graph.h, gb_save.h, geog.h to /usr/local/include; b) Revise external functions extern long dijkstra(); to extern long dijkstra(Vertex*, Vertex*, Graph*, long(*)); revise external functions extern Graph *restore_graph(); to extern Graph *restore_graph(char*); c) Revise latency function in gtitm.C from Time ret = 1000*dijkstra(a, b, g, NULL) to Time ret = dijkstra(a, b, g, NULL) to ensure milliseconds accuracy; d) Reconfigure and recompile P2Psim.

2. Adding link event

To simulate the scenario when links break down, several functions and classes need to be added in P2Psim. a) Once a topology file is parsed at the beginning of the simulation, a GT-ITM graph was restored from the input file. And then every link in this graph was assigned a linkID in its utility field. b) An event class called linkevent is created as one type of events in events folder.

Linkevent is also added into eventfactory and churneventgenerator. In churneventgenerator, there are two link actions “down” and “up”. c) update_link function is created in gtitm.C to process the link event. The length of links was set to infinite or initial value based on the “down” or “up” action respectively. d) At the beginning of the simulation, all links start to join the network. After the links go up or down for sometime according to exponential distribution, they leave or join the network by invoking linkevent. Linkevent process the request by calling update_link function in gtitm.C

Appendix II

Implementation Procedure and Results

Installing Freepastry: 1) Freepastry is installed in windows XP 2) Download Freepastry from http://freepastry.org/FreePastry/download.html 3) Unzip the file and install it into folder C:\develop 4) Add MeasurementData.java, MeasurementApp.java, DistMeasurement.java into folder C:\develop\pastry\src\rice\pastry\testing> Installing ant: 1) Download ant tool from http://ant.apache.org/ 2) Unzip it and install ant. 3) Add the ant lib folder path C:\ant-current-bin\apache-ant-1.6.5\lib into the environment variable CLASSPATH and bin folder path C:\ant-current-bin\apache-ant-1.6.5\bin into PATH. Create a new PATH variable JAVA_HOME = where JDK is installed. Installing hrping:

1) Download hrping from http://www.cfos.de/ping/ping.htm and unzip to folder C:\develop 2) Add C:\develop into PATH so that hrping can work under any folder prompt. 3) Turn off the firewall on each computer. Compiling freepastry: 1) Under the path C:\develop\pastry> (where build.xml exists), use “ant” command to compile. If compiling is succeeded, the following result will be shown. C:\develop\pastry>ant Buildfile: build.xml init: compile: BUILD SUCCESSFUL Total time: 3 seconds Running freepastry: 1) The execution is held under the folder C:\develop\pastry\classes> 2) Program can be terminated anytime with ctrl+C. 3) Each computer only holds one overlay node. 4) Running Measurement program

I. Commands To insert/retrieve the result of pinging host Delta by specifying only one node and one message on each computer and port 5009 in a wider area network. We specify the bootstrap node as host Gamma. java rice.pastry.testing.DistMeasurement –insert/retrieve Delta –bootstrap Gamma –nodes 1 –msgs 1 –port 5009

II. Results Pastry network with four hosts created, host Alpha inserts the hrping measurement to host Delta and host Beta needs to retrieve it. The screen shot on each host shows as below.

Host Delta node Delta/10.2.0.3 starts a pastry network

---The Address is EE343A-DELL-07/10.2.0.3:5009 Error connecting to address EE343A-DELL-07/10.2.0.3:5009: java.net.ConnectExcept ion: Connection refused: no further information Couldn't find a bootstrap node, starting a new ring... -----The Node Handle is null Node <0x9B3FDD..> ready, waking up any clients created SocketNodeHandle (<0x9B3FDD..>/EE343A-DELL-07/10.2.0.3:5009 [-242837193044443812]) idTarget <0x9BAF73..> 1 nodes constructed after node Gamma/10.2.0.2 joins

In <0x9B3FDD..>'s route set, node <0x97DC4B..> was added In <0x9B3FDD..>'s leaf set, node <0x97DC4B..> was added after node Alpha/10.1.0.2 joins and insert pinging information

In <0x9B3FDD..>'s route set, node <0x6F60DE..> was added In <0x9B3FDD..>'s leaf set, node <0x6F60DE..> was added Enroute {10.2.0.3#1 Delay :95.29867! from (<0x6F60DE..>) to *<0x9BAF73..>} at <0x9B3FDD..> Received {10.2.0.3#1 Delay :95.29867! from (<0x6F60DE..>) to *<0x9BAF73..>} at <0x9B3FDD..> Writing to file

key10.2.0.3.txt File 10.2.0.3.txt after node Beta/10.1.0.3 joins retrieving the information

In <0x9B3FDD..>'s route set, node <0x73450E..> was added In <0x9B3FDD..>'s leaf set, node <0x73450E..> was added Enroute {10.2.0.3#1 Delay :-1.0! from (<0x73450E..>) to *<0x9BAF73..>} at <0x9B3FDD..> Received {10.2.0.3#1 Delay :-1.0! from (<0x73450E..>) to *<0x9BAF73..>} at <0x9B3FDD..> start reading items File 10.2.0.3.txt Connected to requesting side...sending...

Host Gamma node Gamma/10.2.0.2 starts joining the pastry network

---The Address is /10.2.0.3:5009 -----The Node Handle is [SNH: <0x9B3FDD..>//10.2.0.3:5009 [-242837193044443812]] In <0x97DC4B..>'s route set, node <0x9B3FDD..> was added In <0x97DC4B..>'s leaf set, node <0x9B3FDD..> was added Node <0x97DC4B..> ready, waking up any clients created SocketNodeHandle (<0x97DC4B..>/EE343A-DELL-08/10.2.0.2:5009 [7148284710053293696]) idTarget <0x9BAF73..> 1 nodes constructed after node Alpha/10.1.0.2 joins

In <0x97DC4B..>'s route set, node <0x6F60DE..> was added In <0x97DC4B..>'s leaf set, node <0x6F60DE..> was added after node Beta/10.1.0.3 joins

In <0x97DC4B..>'s route set, node <0x73450E..> was added In <0x97DC4B..>'s leaf set, node <0x73450E..> was added

Host Alpha bootstrap from Delta/10.2.0.3 and insert measurement information

---The Address is /10.2.0.3:5009 -----The Node Handle is [SNH: <0x9B3FDD..>//10.2.0.3:5009 [-242837193044443812]] In <0x6F60DE..>'s route set, node <0x97DC4B..> was added In <0x6F60DE..>'s leaf set, node <0x9B3FDD..> was added In <0x6F60DE..>'s leaf set, node <0x97DC4B..> was added created SocketNodeHandle (<0x6F60DE..>/EE343A-DELL-10/10.1.0.2:5009 [-1878654285473777314]) idTarget <0x9BAF73..> Node <0x6F60DE..> ready, waking up any clients 1 nodes constructed Sending message from <0x6F60DE..> with key <0x9BAF73..> Enroute {10.2.0.3#1 Delay :95.29867! from (<0x6F60DE..>) to *<0x9BAF73..>} at <0x6F60DE..> In <0x6F60DE..>'s route set, node <0x97DC4B..> was removed In <0x6F60DE..>'s route set, node <0x9B3FDD..> was added after Beta/10.1.0.3 joins

In <0x6F60DE..>'s route set, node <0x73450E..> was added In <0x6F60DE..>'s leaf set, node <0x73450E..> was added

Host Beta bootstrap from Delta/10.2.0.3

---The Address is /10.2.0.3:5009 -----The Node Handle is [SNH: <0x9B3FDD..>//10.2.0.3:5009 [-242837193044443812]] created SocketNodeHandle (<0x73450E..>/EE343A-DELL-09/10.1.0.3:5009 [6008721163289032215]) idTarget <0x9BAF73..> In <0x73450E..>'s route set, node <0x6F60DE..> was added In <0x73450E..>'s route set, node <0x9B3FDD..> was added In <0x73450E..>'s leaf set, node <0x6F60DE..> was added In <0x73450E..>'s leaf set, node <0x9B3FDD..> was added In <0x73450E..>'s leaf set, node <0x97DC4B..> was added Node <0x73450E..> ready, waking up any clients

1 nodes constructed Sending message from <0x73450E..> with key <0x9BAF73..> Enroute {10.2.0.3#1 Delay :-1.0! from (<0x73450E..>) to *<0x9BAF73..>} at <0x73450E..> Sending requesting message

Receiving information at 10.2.0.3 on port 1239 10.2.0.3->95.29867

III. File Format The file is named as the IP address of the host that is pinged. The data is stored as

10.2.0.3->95.29867=><0x6F60DE..>

This data shows that the hrping time from node with ID <0x6F60DE..> to delta/10.2.0.3 is 95.29867 milliseconds.