<<

Development and Implementation of the B-Tracker Approach on a BitTorrent

Andri Lareida Zürich, Switzerland Student ID: 06-700-389 – Communication Systems Group, Prof. Dr. Burkhard Stiller HESIS T

Supervisor: Fabio Hecht, Thomas Bocek ASTER Date of Submission: April 30, 2012 M

University of Zurich Department of Informatics (IFI) Binzmühlestrasse 14, CH-8050 Zürich, Switzerland ifi Master Thesis Communication Systems Group (CSG) Department of Informatics (IFI) University of Zurich Binzmühlestrasse 14, CH-8050 Zürich, Switzerland URL: http://www.csg.uzh.ch/ Zusammenfassung

Die Hauptprobleme welche von der B-Tracker Herangehensweise adressiert werden, sind Effizienz und Lastverteilung in verteilten BitTorrent (BT) Trackern. Historische Daten und Simulationen zeigen, dass das Problem der ungleichen Lastverteilung im sogenann- ten (DHT) Tracker zwei Ursachen hat. Zum einen liegt es an der nicht gleich verteilten Beliebtheit von Torrents und zum anderen an den inneren Me- chanismen einer DHT. Die zweite L¨osung PEX ist fur¨ ihre Ineffizienz bekannt, da PEX Nachrichten regelm¨assig an seine Nachbarn flutet. Erste B-Tracker Simulationen im La- bor haben gezeigt, dass B-Tracker den bisherigen L¨osungen uberlegen¨ ist in Effizienz und Lastverteilung. Das Ziel dieser Arbeit ist es B-Tracker fur¨ ein reales BT Programm zu implementieren und unter realistischen Bedingungen zu evaluieren. B-Tracker ist als Plugin fur¨ die das bekannte BT Programm Vuze implementiert. Das Ur- sprungliche¨ B-Tracker Konzept wird dafur¨ in einen Entwurf weiterentwickelt, der sowohl das ursprungliche¨ Konzept erfullt¨ als auch in die fixe Vuze Schnittstelle passt. Dann wird der Entwurf implementiert. Zum Zweck der Evaluation mussen¨ auch einige Anderungen¨ am Vuze Source Code gemacht werden, diese betreffen vor allem die Messung der Nach- richten. Die Entwicklung von B-Tracker als Vuze Plugin bringt einige Vorteile. So bringt Vuze schon eine DHT mit, die auch vom DHT-Tracker verwendet wird. Des Weiteren kann das Plugin in allen aktuellen Vuze Installationen verwendet werden, da fur¨ das alleinige Ausfuhren¨ des Plugins keine Anderungen¨ am Vuze Source Code n¨otig sind. Drei neue Nachrichten Typen werden eingefuhrt¨ um das B-Tracker Protokoll zu unterstutzen.¨ Die Plugin Implementierung hat auch einige Nachteile unter anderem werden von der Plugin Schnittstelle nicht alle M¨oglichkeiten die die DHT bietet unterstutzt.¨ So mussen¨ einige Abstriche gemacht werden um das Ursprungliche¨ Konzept zu implementieren. Der fur¨ die Evaluation ben¨otigte Aufbau beinhaltet verschiedene Experimente fur¨ die drei Tracker Varianten und drei verschiedene Churn Raten. An einem Experiment sind 100 Peers beteiligt welche eine bestimmte Datei von anf¨anglich zwei Seedern herunter laden. Ein Experiment wird als fertig angesehen sobald alle Peers die Datei heruntergeladen haben. Die Resultate zeigen, dass B-Tracker seine Vorteile auch in einer realen Situation ausspielen kann und unter diesen Bedingungen mindestens ¨ahnlich gut und meistens besser abschneidet als seine Konkurrenz. Wenn die Churn rate auf 0 gesetzt ist hat der DHT- Tracker Vorteile in der Effizienz bei 15% sind beide ¨ahnlich und mit 30% liegt B-Tracker vorne. In der Lastverteilung schneidet B-Tracker durchwegs besser ab. Die Implementierung und Evaluierung von B-Tracker als Vuze Plugin kann als erfolgrei- chen zweiten Schritt auf dem weg zu einem effizienteren und gerechteren verteilten Tracker betrachtet werden.

i ii Abstract

BitTorrent (BT) is a popular Peer to Peer (P2P) system which, by its use of central- ized trackers, breaches the P2P paradigm. For trackers to profit from P2P properties they have to be implemented in a distributed approach. (PEX) and the Azureus Distributed Hash Table (AZDHT) are two widely deployed distributed tracking mechanisms. These suffer from inefficiency and poor load balancing where load is defined as the upload bandwidth used. The B-Tracker approach, which promises to solve these issues, is designed and implemented as a plugin for the Vuze BT client. For the evalu- ation, simulations of a BT swarm under realistic conditions including churn are run for all the trackers. The results show that B-Tracker improves efficiency and load balance under realistic conditions. B-Tracker shows better load balance than DHT and PEX in all simulated conditions. The evaluation shows also that B-Tracker is more efficient than PEX and similar to AZDHT for 15% churn and better for 30% churn. Most B-Tracker traffic is generated by the underlying AZDHT, which cannot be changed by a plugin. B-Tracker shows that it is possible to improve load balancing and efficiency compared to existing solutions.

iii iv Acknowledgments

First and foremost, I would like to thank my supervisors Fabio Hecht and Thomas Bocek for developing the B-Tracker idea which formed the basis of my thesis and for the con- structive discussions we had. A special thanks goes to Fabio for proof reading the thesis and the support he gave me whenever I was in need of help. This thesis would not have been possible without the two of you.

I am obliged to many members of the Communication Systems Group at the University of Zurich led by Prof Dr. Burkhard Stiller for giving me the opportunity to write this thesis and providing me with an excellent server infrastructure and support.

Last but not least I thank my parents for supporting me and for giving me motivation when needed.

v vi Contents

Zusammenfassung i

Abstract iii

Acknowledgments v

1 Introduction 1

1.1 Motivation...... 1

1.2 Description of Work ...... 2

1.3 ThesisOutline...... 4

2 Related Work 5

2.1 BitTorrent...... 5

2.1.1 Tracker Protocol ...... 7

2.1.2 BT Protocol ...... 8

2.1.3 Azureus Extensions ...... 11

2.2 Peer Exchange ...... 11

2.3 DHT-Tracker Extension ...... 12

2.4 BloomFilters ...... 13

2.5 Related Work Summary ...... 13

vii viii CONTENTS

3 Design 15 3.1 Hash Table Load Analysis ...... 15 3.2 B-TrackerApproach ...... 17 3.2.1 Primary Tracker Look Up ...... 17 3.2.2 Secondary Tracker Query ...... 17 3.2.3 MainTracker ...... 18 3.2.4 DHT Manager ...... 20 3.2.5 Messaging ...... 21 3.2.6 Parameters ...... 22

4 Implementation 23 4.1 Plugin Interface ...... 23 4.2 DHT Interface ...... 24 4.2.1 Measurement ...... 25 4.3 FurtherDevelopment...... 25

5 Evaluation 27 5.1 Evaluation Environment ...... 27 5.2 Experiment Design ...... 28 5.2.1 Files, Bandwidth and Seeding ...... 28 5.2.2 Churn ...... 29 5.2.3 DHT...... 30 5.2.4 Performance Issues ...... 30 5.2.5 Parameters ...... 30 5.3 Execution and Results ...... 31 5.3.1 Execution ...... 31 5.3.2 Efficiency ...... 32 5.3.3 Load Balancing ...... 34 5.3.4 Messages ...... 36 5.4 RunTimes...... 40 CONTENTS ix

6 Summary and Conclusions 41

6.1 Summary ...... 41

6.2 Conclusion ...... 42

Abbreviations 47

Glossary 49

List of Figures 49

List of Tables 52

A Installation Guidelines 55

B Contents of the DVD 57

B.1 B-Tracker Plugin ...... 57

B.2 Data...... 57

B.3 Experiment ...... 58

B.4 Related Work ...... 58

B.5 Sources ...... 58

B.6 Thesis ...... 58 x CONTENTS Chapter 1

Introduction

Peer-to-Peer (P2P) technology allows systems to operate without the need of a central entity (e.g. a server) controlling it. P2P systems use the resources provided by its members the so called peers. They allow to share resources provided by peers, such as CPU time, memory or disk space. Typical properties of P2P systems are scalability because every peer brings resources and no single point of failure since there is no central server.

According to a study by ipoque [18] the most popular P2P system accounting for 80% of P2P traffic is BitTorrent (BT). P2P systems in general account for roughly 50% of Internet traffic according to ipoque. BT is mostly used to share large files or large col- lections of small files such as movies, software distributions or music collections. A file server infrastructure providing similar performance would induce huge costs in hardware and maintenance. With BT a user contributes to the infrastructure by providing some bandwidth and disk space and therefore covers a part of the costs. What means that a peer downloading a file has to upload it at the same time.

Traditionally a centralized server called tracker maintains a database containing peers that share a certain file and the amount of data they up- and downloaded[7]. Peers sharing the same file are called a swarm. The tracker as a central entity is a breach of the P2P paradigm. Past events involving “” tracker, which is one of the largest free trackers [3], have shown that scalability and single point of failure are an issue in BT trackers.

1.1 Motivation

In order to overcome the drawback of centralized trackers two approaches to distributed trackers have been added to the original BitTorrent protocol. One is based on distributed hash table (DHT) technology the second one is called Peer Exchange (PEX). PEX is used by most current BT clients, it is a gossiping based protocol where peers send their peer list to connected peers in certain time intervals. It is totally unstructured and can reveal only peers that are already connected to the swarm [28]. Furthermore, PEX is not very

1 2 CHAPTER 1. INTRODUCTION efficient in terms of bandwidth consumption [13], this is due to the chatty nature of the protocol.

DHTs are based on the assumption of roughly equal data distribution [19]. Other work [5, 11, 17] states that in real world applications of DHTs the identifiers and therefore the load is not uniformly distributed among participating nodes. Research has so far focused on uneven load distribution in DHTs but the fact that not all torrents are equally popular has been ignored. A torrent describes a file being shared and gives an address of at least one tracker. It has been shown that the popularity of torrents follows a Zipf like distribution except for the end of the curve [9]. This characteristic means that very few torrents are very popular and most of the torrents are shared by only a small minority of peers.

Figure 1.1 depicts a dataset acquired from The Pirate Bay tracker in 2008 [12] which shows the distribution of torrent popularity. In order to show the uneven distribution of torrent popularity, the value for popularity was calculated as the sum of all peers taking part in a torrent at the time of data capture. This is due to the fact that a tracker has to handle a database containing exactly these peers. The torrents were then ranked according to their popularity score and plotted on logarithmic scales. The so created graph resembles a Zipf distribution very closely. One can also see the exponential cut off described in [9]. This distribution poses a problem to load balancing in BT’s DHTs which has not been addressed in research before the B-Tracker approach [13]. Section 3.1 presents a deeper analysis of this issue.

The motivation for this thesis is to overcome the mentioned drawbacks of existing dis- tributed trackers by implementing the B-Tracker approach. The thesis has the following goals: the first goal G1 is to design and implement the B-Tracker approach on a real BT client; the second goal G2 is to compare the solution developed in G1 to DHT and PEX; the last goals include to show that B-Tracker is superior in terms of efficiency G3 and load balancing G4 under realistic conditions.

1.2 Description of Work

This thesis is based on the work on B-Tracker [13] which proposes a new approach to distributed tracking in BT networks. So far B-Tracker has not been integrated into a real BT program. In order to conduct simulations a plugin implementation of B-Tracker for the very popular BT program Vuze [24] is necessary. The main work in the context of this thesis is the design and implementation of the proposed B-tracker approach. The design has to be adapted to fit into Vuze’s plugin interface and use existing Vuze facilities. The implementation has to be compatible with Vuze and also support measurement capabilities in order to compare the three approaches. For this purpose, the Vuze source code has to be extended. For the evaluation of B-Tracker a whole framework is built in order to simulate swarms of 100 peers including churn. Finally, all the results are parsed and graphs are produced. These graphs are analyzed and discussed. 1.2. DESCRIPTION OF WORK 3

Torrent Popularity Distribution 1e+06

100000

10000

1000 Torrent Popularity 100

10

1 1 10 100 1000 10000 100000 1e+06 Torrent Rank

Figure 1.1: Plot of torrent popularity on logarithmic scales based on The Pirate Bay Dataset 2008 [12]. 4 CHAPTER 1. INTRODUCTION 1.3 Thesis Outline

Chapter 2 looks at related work in detail. The BitTorrent system is explained as well as the two previously mentioned distributed tracker approaches DHT and PEX. In Chapter 3 the desing of the B-Tracker approach is explained thoroughly. Chapter 4 focuses on the implementation of B-Tracker as a Vuze plugin and focuses on Vuze’s plugin interface. In Chapter 5 experiments are defined in order to compare the three approaches and conduct an in depth evaluation. Results are presented analyzed and discussed. Chapter 6 the whole work is summarized and concluded. Chapter 2

Related Work

This chapter introduces related work in the areas which are needed to understand the thesis. It starts with an overview of BT which also covers extensions made by the Azureus client. Then continues with Peer Exchange (PEX) and the Azureus DHT (AZDHT). Finally, Bloom filters are explained.

2.1 BitTorrent

In order to understand the benefits of B-Tracker it is important to understand what BT is and how it works. BT is a file distribution system that uses P2P technology to benefit from properties like scalability, reliability and flexibility [20]. The idea behind P2P is that every client is also a server and therefore all actors are called peers because they are equal. BT implements this paradigm by enforcing a “tit-for-tat” principle [6], which encourages a peer to upload in order to download a file. This way files are distributed to large masses. The beauty is that the more peers want a file the more bandwidth will be available since each peer has to provide a share. One problem in P2P systems is the so called bootstrapping which describes the initial connection process a new peer needs to go through to get first contacts. In BT it is solved by the use of a tracker which acts as broker between peers. Standard BT trackers are centralized servers. There are multiple implementations of BT and they all have their own extensions to the standard protocol. This section focuses on the standard protocol.

The BT system consists of several components which need to be explained. Here is a list with explanations of the components needed for a BT file distribution according to [7, 22].

A web server: used to serve the metainfo file (.torrent) to the end users.

A metainfo file: containing all the necessary information for an end user to connect to a swarm. The first part of the file (announce) contains the tracker URL the second part (info) consists of several keys which are explained later.

5 6 CHAPTER 2. RELATED WORK

A BT tracker: keeping a database of all the peers in a swarm. Peers can ask a tracker for other peers sharing a specific file. A file is identified by the 20 bit SHA1 hash of the info part of the metainfo file. Usually a tracker is responsible for several torrents at the same time.

A seeder: a peer in possession of the complete file. Other peers will initially have to download from this “original” peer. A seeder is only uploading. A peer can be a seeder and a downloader at the same time for different torrents.

A web browser: helping the user to find and download a .torrent file with the metainfo.

The BT client: an application following the BT protocol with a user interface. The .torrent file is loaded into the application and download can start. If the download is finished the application will continue to upload until the user actively stops it. This is called seeding. There are numerous client applications which implement their own protocol extensions.

Peers can have different roles over their lifetime. They can even have several roles at the same time depending on their relationship to their neighbors. Here is a description of the different roles:

Seeder: a peer only uploading as explained before.

Downloader: a peer that is downloading a file.

Provider: a peer that can provide file pieces to another peer. It can be downloader or seeder it just means that the peer can provide new pieces. A provider is a role a peer has from the perspective of another peer.

Neighbor: the peers or providers to which a peer is connected.

The meta info file consists of the announce and the info part. The announce consists at least of the so called announce URL, which is a link to the tracker. The info part consists of at least the following fields:

Name: a UTF-8 encoded String which represents a name for the torrent. This field is only advisory. It can be a file name in case of a single file or a directory if there is a number of files.

Piece length: for the ease of transferring files, they are split in chunks of the same size. That way a file can be uploaded by a peer before it has finished downloading it and the replication factor in a swarm can be improved. These chunks have all the same size; most commonly this defaults to 218 =˜ 256K. Pieces: this is a string concatenation of the SHA1 hash of all the file pieces. Thus, a peer can verify the correctness of a piece after downloading it.

Length: only used if the torrent consists of a single file. In that case it tells the file size in bytes. 2.1. BITTORRENT 7

Figure 2.1: The main contents of a .torrent for a single file download. More possible fields exist which are optional.

Files: this one is used if there is more than one file in the torrent. It is a dictionary containing the length and a sub directory path for each file.

Figure 2.1 gives an overview of the standard fields in a .torrent file. All .torrent files are b-encoded. B-encoding uses UTF-8 characters as delimiters and values, using a 8 bit encoding makes it simple to encode bytes. There are 4 types of values: byte strings, integers, lists and dictionaries. Each of these types has its own delimiters. Since b-encoded files are not easy to read the figure shows plain text. More information on b-encoding can be found under [2]. The info part is a dictionary what means that all its elements are pairs. A closer look at the info part reveals that the file is split in pieces of 262144 Bytes (piece length) for downloading purposes. The pieces’ value is a concatenation of 20 Byte SHA1 hash of each piece, thus it can be told how many pieces there are looking at the size of the pieces value. Note that the file size can be calculated from the pieces length and the number of pieces. However, there is a difference between this calculated size and the value in the length field. This is the result of splitting the file in parts of equal size (the last part may not be filled). The whole info dictionary is used to create a key that identifies the torrent by calculating the 20 Byte SHA1 hash of it. This key is used when a tracker is queried or a neighbor relationship is initiated.

2.1.1 Tracker Protocol

Since later in this thesis a distributed tracker will be designed and implemented it is necessary to understand how a tracker works in more detail. The tracker is contacted over an HTTP GET message, the URL is given in the announce field of a .torrent file. The most important request parameter is the info hash which identifies the torrent. Further required parameters are peer id, ip and port because the tracker might need to return the response to another address due to NAT. Additionally the total bytes downloaded, 8 CHAPTER 2. RELATED WORK

Figure 2.2: Example communication between a peer and a tracker. The peer sends a HTTP GET request with parameters and receives a list of peers and an interval. the bytes still left and the type event are sent with a request. The event parameter gives a reason why the request has been sent. Events can be: started, stopped and completed it can also be blank or not present at all which means that it is a regular announcement done in the interval the tracker specifies in its response. There are additional parameters which are not required but are documented in [22].

Figure 2.2 gives a sample of how a request and answer could look like. The answer is simply a list of peers which contains of id, ip and port and the interval which tells how many seconds a peer should wait before sending a new request.

2.1.2 BT Protocol

After receiving a peer list from a tracker, a downloader will contact the peers in the list which are potential providers. Peers follow a strict protocol in their communication. It is generally reffered to as the BitTorrent Protocol [7].

Figure 2.3 shows a handshake message which a peer sends to a provider he wants to contact. A peer can contact multiple providers simultaneously but the process is always the same. The handshake message for the standard BT protocol begins with a one byte integer length prefix. The prefix indicates the length of the following protocol description which for the standard protocol is“BitTorrent protocol”. Then 1 Byte of zeros follows, this space is reserved for protocol extensions which are already used by some implementations. The last two fields are the torrent ID and the Peer ID, both 20 Bytes long. The torrent ID is the same info hash as described before. The Peer ID is just an identification string, the BT protocol does not give any rules on how to determine it. A sample communication between two peers is illustrated in Figure 2.4.

Figure 2.4 shows a sample of a communication between two peers. After receiving a handshake, the provider immediately returns a handshake with its own torrent and peer 2.1. BITTORRENT 9

Figure 2.3: The diagram depicts the two basic message types used in the BitTorrent protocol. The standard message has a payload depending on the type of the message. The numbers on top of the fields tell the fields’ sizes in Byte.

ID. If either the peer or the provider notices that the torrent ID is wrong, the connection is dropped. Wrong in this case means that either the peer does not share that particular torrent or the format is wrong. The provider’s handshake reply is followed by a Bit Field message indicating which pieces of the file can be downloaded from it.

Figure 2.4 also shows that a link between provider and peer can have different states. A provider can choke a connection if it has no capacity left for instance. Choking means that no pieces will be served to the connected peer. If capacity is freed up the provider sends an Unchoke message to the client. A newly created connection starts as choked on both sides. The second flag is called Interested. It tells if a peer is interested in something the provider on the other side of the connection already has downloaded. There are also messages to set the interested flag or remove it. Therefore, a peer must maintain four state bits, one for its own choke state, one for its own interested state and two for the provider’s states. The states of the sample connection between the peer and the provider is shown in the boxes.

Upon receiving a Bit Field message, the peer can determine if it is interested in a piece of the provider. Its link state is updated to interested and an Interested message is sent to the provider which uses this message to update its own link state. The provider knows now that as soon as it unchokes the peer, it will start sending requests. The request is answered by sending a portion of the requested piece, since pieces are chopped into smaller blocks for easier transport. After the peer has received a complete piece it sends a have message to all of its neighbors to let them know that it has the piece. Eventually, some neighbor may get interested upon receiving the have message. Based on this new piece, the peer has to adapt its link states, maybe it is not interested in a peer anymore and therefore has to send a Not Interested message. This process continues like this until a peer has completed the download. Then it will start seeding, that means the peer will serve pieces without downloading. 10 CHAPTER 2. RELATED WORK

Figure 2.4: This sequence diagram shows a typical message exchange between a peer and a provider following the BT protocol. 2.2. PEER EXCHANGE 11

2.1.3 Azureus Extensions

Many different implementations of the BitTorrent protocol exist, among the five most popular programs is Vuze [10]. Azureus started as an open source project and was later renamed to Vuze which is developed by Vuze Inc.. The official Vuze distribution is not open anymore but the core, which is still Azureus in fact, remains open source. Vuze also provides an API for plugin development and is therefore ideally suited for experimenting with new extensions.

Most BT implementations also bring their own protocol extensions so does Vuze. The Vuze protocol is also called the Azureus Protocol. This protocol is most relevant to the implementation of B-Tracker since it will be realized as a Vuze plugin. The Azureus extension is documented in the Azureus Wiki [26]. The main difference to the standard protocol is the additional AZ_HANDSHAKE which has to be supported by an Azureus client and the AZ_PEER_EXCHANGE which is needed for PEX since it is not part of the standard BT protocol.

The AZ_HANDSHAKE message is exchanged before two peers start using the AZ proto- col. Besides the standard handshake contents (IP address, TCP port, UDP port) the AZ_HANDSHAKE contains the exact version of Azureus used by the sender. Additionally, all the supported message types and numbers are included. Thus, the only message that has to be supported by an Azureus client is the AZ_HANDSHAKE message because the rest is negotiated with the handshake. The AZ_PEER_EXCHANGE message is used to exchange neighbor information with peers. PEX is discussed in the next Section. From the docu- mentation of the AZ protocol [26] it can not be told what the other messages are used for and what their content looks like exactly. They are listed for completeness.

2.2 Peer Exchange

The idea behind Peer Exchange (PEX) is simply to allow peers to exchange lists of providers. A peer sends a list of peers he is connected to to other peers. So a peer can receive new peers not only from the tracker, but also from other peers. This allows a swarm to keep together even if the tracker fails and it can also reduce the load on a tracker. However, PEX can not substitute a tracker completely since it does not have bootstrapping capabilities. In order to benefit from PEX a peer must be in a swarm already.

Experiments have shown that PEX can improve download speed [28] but also that PEX messages have a significant degree of redundancy. There are two main implementations of PEX; the Azureus PEX (AZPEX) and uTorrent (UTPEX). The main difference is that UTPEX sends seperate lists for IPv4 and IPv6 peers. Developers of both implementations have agreed to send a maximum amount of 50 peers per message [21]. They also agreed that messages should only be sent every minute. Therefore, peer discovery is very quick for the first 30%-40% of the peers but slows down and even after 3000 seconds still has less than 60% of the total peers discovered [28]. 12 CHAPTER 2. RELATED WORK

Kademlia Mainline Azureus PING PING PING STORE ANNOUNCE_PEER STORE FIND_NODE FIND_NODE FIND_NODE FIND_VALUE GET_PEERS FIND_VALUE N/A N/A KEY_BLOCK Table 2.1: Comparison of the two BitTorrent DHT implementations and the Kademlia standard queries.

2.3 DHT-Tracker Extension

The BitTorrent DHT-Tracker is a fully distributed tracker. In contrast to PEX it also includes a solution to the bootstrapping problem and, therefore, it can replace a clas- sical tracker. There are two DHTs in use, the Azureus (AZDHT) [27] and the official [14] implementation, known as Mainline DHT (MDHT). Both versions are based on the Kademlia DHT [8]. The focus of this thesis lies on AZDHT since it is relevant for the implementation. The DHT-Tracker uses a DHT to store providers.

A DHT works like a normal hash table. It can store key and value pairs and retreive a value for a key. A distributed hash table is, as the name implies, distributed on to several nodes. A node is an instance of the DHT implementation. In the AZDHT, each node gets a 160 bit identifier and is responsible for the keys closest to its identifier. Closeness in Kademlia is defined by an XOR metric [16]. Furthermore, keys are replicated to the 20 closes nodes in order to increase the robustness against churn.

Table 2.1 compares the queries used by the different two implementations used in BT and the Kademlia standard. Both AZDHT and MDHT stick close to the original query set. The queries may have different names in the different implementations but the func- tionality is the same. Only the Azureus implementation adds a KEY_BLOCK query which can be used to request blocking and unblocking of keys. How this exactly works was not documented at the time of writing. The two BT implementations added an error message as a possible return which is not a query and is therefore not in the table.

Another difference can be found in the way bootstrapping is handled. The AZDHT uses a hard coded URL dht.vuze.com:6881 to contact the bootstrapping node. In order to reduce the load on the bootstrapping node AZDHT contacts are saved upon shutdown of the client and providers learned from a peer can also be sued to bootstrap the AZDHT. The MDHT goes one step further since it stores several known good nodes in the .torrent file. This nodes could be an original seeder or a node that is especially kept alive for bootstrapping purposes like the bootstrapping node used in the AZDHT. At this point it should be mentioned that Vuze also supports AZDHT as well as MDHT. 2.4. BLOOM FILTERS 13 2.4 Bloom Filters

Bloom Filters were named after their inventor Burton Bloom who first described how hash coding can be used for filtering [4]. Bloom Filters use hash functions in order to map values into an identifier space. In contrast to traditional hash coding, Bloom filters intentionally allow a certain error rate. Thus, the identifier space can be reduced significantly. The space of a Bloom filter is represented as a bit sequence, where each bit represents an element. If an element is added, it is passed through a hash function which returns a field number of the bit sequence. The bit with the returned number is set to one and then it can be checked if an element is in the array by using the same hash function. However, it is by design not possible to retrieve an element out of the filter.

2.5 Related Work Summary

Two approaches exist to overcome the problem of scalability of centralized BT trackers. Although, the DHT-Tracker can fully replace a centralized tracker, the load created by tracking is not fairly distributed. Fairness and equality are a basic principles of P2P systems [20]. PEX on the other hand might be fair in terms of load balancing, but it does not use resources (upload bandwidht) efficiently. Furthermore, PEX can not replace a centralized tracker it only reduces its load. In order to overcome these shortcomings B-Tracker is introduced which improves load balancing and efficiency compared to the existing solutions. 14 CHAPTER 2. RELATED WORK Chapter 3

Design

This Chapter covers the design of the B-Tracker Plugin. It will begin with a more detailed analysis of load distribution in hash tables. Then the B-Tracker design is explained in detail.

3.1 Hash Table Load Analysis

Section 1.1 stated that load is unevenly distributed among nodes in DHTs, this fact is investigated further in this Section. The problem consists of two parts. Torrent popularity follows a Zipf like distribution and keys in a DHT’s address space are not uniformly distributed. In fact using a hash function to determine a key is like using random keys. This fact in combination with the Zipf law makes load balancing even worse as is shown here.

Figures 3.1 and 3.2 show the results of a simulation based on random identifiers and a Zipf distribution. The simulation featured 10’000 nodes and 20’000 keys. First the nodes were generated with a random key which is the SHA1 hash of a random integer. For the SHA1 hash the methods from the Azureus project were used in order to have a realistic identifier space. Then keys were generated with the same random hash and assigned to the next higher node. Also, a popularity factor is assigned to the key. For this factor a zipf function is used to recreate the distribution observed earlier. This factor can be thought of as total peers in a swarm which indicates the load on the tracker. In minimum peers send one announce message every minute to the tracker. So the popularity factor is proportional to the number of requests and therefore it will be thought of as number of requests per interval. The values presented here are the average results of 1000 simulation runs.

Figure 3.1 shows the number of keys maintained by a node on the X axis. The Y axis shows the actual number of nodes maintaining a certain amount of keys. This is only the result of the random distribution of keys but it already shows the problem. From the figure it can easily be read that more than 50% of all nodes maintain only one or zero keys. There are nodes maintaining up to 21 keys.

15 16 CHAPTER 3. DESIGN

Keys Managed by Nodes 3500

3000

2500

2000

1500

1000 No. of Nodes Managing Keys

500

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 No. of Keys Managed

Figure 3.1: How many nodes manage a number of keys.

Total Requests Per Node 30000

25000

20000

15000

Requests Per Interval 10000

5000

0 0 1000 2000 3000 4000 5000 Node Rank

Figure 3.2: Total requests per node per query interval. 3.2. B-TRACKER APPROACH 17

Figure 3.2 plots requests per node where the nodes are sorted according to their rank (most popular nodes first). That means the requests of all the keys managed by one node were summed up to get the number of requests. Note that the curve is cut off to the right, since it would continue to 10’000 nodes. In this graph both problems (key and request distribution) are reflected. The Zipf like popularity is applied to the uneven distributed nodes. 25% of nodes (2500) are handling 97.8% of requests and 10% of all nodes handle 73.5% of requests.

The simulation of a DHT has shown that the effects of popularity distribution and the use of random keys intensify the load balancing problem.

3.2 B-Tracker Approach

This section covers the B-Tracker approach and its benefits as proposed by [13]. The B-Tracker protocol has two phases. In the first one a DHT is used to learn about initial providers. This process can be viewed as bootstrapping. After initial contacts are learned the second phase starts. The second phase relies on requesting providers from peers. In other words, peers learn about new providers from already known peers. So far this is a combination of the DHT and PEX approaches. B-Tracker goes one step further by introducing bloom filters [4] to increase efficiency of the gossip.

3.2.1 Primary Tracker Look Up

In the first phase, so called primary trackers are queried. A primary tracker in the B- Tracker protocol is a node in the DHT which stores initial contacts. Primary trackers are just DHT nodes they are selected by the DHT algorithm usually by numerical closeness of node id and storage key. Thus, primary trackers do not necessarily take part in the B-Tracker protocol. A B-Tracker peer simply adds its IP address and port number to the DHT. These can be used by others to connect to an initial set of B-Tracker peers.

DHT values are replicated to a set of nodes in order to protect data against node failure. This factor is called the primary replication rp. The number of providers stored per resource is determined by the primary tracker storage capacity parameter cp.

3.2.2 Secondary Tracker Query

The second phase starts when a peer has received providers from a primary tracker. A peer starts requesting providers from secondary trackers after it has queried a defined number of primary trackers. This number is represented by the np parameter. A peer can add itself as a provider to a primary and secondary trackers. Primary trackers have a capacity limit of cp. A peer will try to register itself with a certain number of secondary trackers determined by the secondary replication rs. 18 CHAPTER 3. DESIGN

Once a peer has queried np primary trackers it will start sending requests to secondary trackers. This will typically be the case in the beginning of a peers lifetime and when neighbors are going offline. In order to increase the efficiency of such requests, a request includes a bloom filter containing all known providers of the requesting peer. Bloom filter decrease reply message size because only unknown peers are returned. Bloom filters can produce false positives and some peers might not be discovered. This chance depends on the bloom filter size. However, a peer does not need to know the whole swarm to successfully download a file.

The importance of Bloom filters regarding B-Tracker is the reduced filter size which saves bandwidth upon sending requests. Further, the use of filters increases efficiency of re- sponses since only unknown peers are returned. By allowing errors, like Bloom filters do, the risk of having unnecessary peer information delivered is introduced to the system. However, application of Bloom Filters is only feasible if a small filter size is required and some errors can be tolerated.

B-Tracker also defines a time to life TTL, which defines the time an entry on a tracker is valid. This is required since providers might leave the system without posting a leave message. Due to a hardware crash or an interrupted connection. Therefore, a provider has the re-register itself before the TTL has expired.

3.2.3 Main Tracker

For each download that is added to Vuze the B-Tracker plugin creates and starts a new Tracker object. It consists of the main routine that is responsible for executing requests to other trackers or to tell the DHT manager to look up trackers. The Tracker class is also in charge of handling incoming messages after they have been routed to the appropriate Tracker object.

Figure 3.3 shows the main flow of B-Tracker which is executed in the Tracker object. In short, tracking consists of a loop in which B-Tracker status is logged, trackers can be requested from the DHT, keep-alives are sent, the thread sleeps for a defined amount of time, and B-Tracker requests can be sent. Tracking is started with the start of a download in Vuze. In the beginning of the loop it is checked if there is at least one active tracker. If no active trackers are known, a DHT look up has to be executed. This is always the case in the first execution of the loop. DHT queries are delegated to a DHTManager object which handles all the DHT related logic. Since there are no active trackers present, no keep-alive messaging has to be performed. Therefore the thread sleeps for a defined time. This time is variable but it is always between TSI and TMI. In the mean time providers could be discovered by the DHT or incoming messages, these are handled outside the main loop since their arrival is asynchronous. After the first timeout some active trackers are expected to be found. A tracker is considered active if the last contact was no longer ago than the TTL. The number of trackers returned by the DHT depends on the cp parameter. The behavior after initial contacts have been discovered depends on rs, if rs trackers are present no more requests have to be sent. A peer tries to know at least rs active trackers. This practice is equivalent to registering itself with rs trackers, since 3.2. B-TRACKER APPROACH 19

Figure 3.3: Flowchart of the main tracker algorithm. 20 CHAPTER 3. DESIGN

Figure 3.4: Flowchart of the message handling algorithm.

connections are mutual. If the active tracker count is lower than rs a request will be sent to one or more random trackers.

The DHTManager will pass potential trackers found in the DHT to its Tracker. The Tracker then performs a check if it has already heard of that peer. If the potential tracker is not know yet, it is added to a DHT-tracker list and a Handshake message will be sent to it.

Figure 3.4 depicts the flow of incoming messages in the Tracker class. With any message received the sender will be added to the active tracker list or TTL will be renewed if it is already in the list. Depending on which message type is received the peer will have to behave differently. For a Handshake message a handshake is sent back, the message manager will prevent handshake loops since it allows only one handshake to a certain peer in two minutes. Naturally, for a request a response will be sent back. The response contains a randomly selected subset of the peer’s active tracker list. Furthermore, the Bloom filter contained in the Request message is used to only select peers which will be beneficial to the requesting peer. If a response message is received handshakes are sent to the potential trackers found in the message.

3.2.4 DHT Manager

The Azureus DHT is used for the first phase of the B-Tracker protocol. That means if a peer starts downloading a file it will first query the DHT for the key of the download. 3.2. B-TRACKER APPROACH 21

Where the key is a hash function of the info hash plus a prefix. The prefix was introduced to separate the classical DHT tracker from B-Tracker. If identical keys had been used, the DHT would contain two types of values for one key. In case that no values are found for the given key, the peer will write its IP address and port number (UDP or TCP) into the DHT. The DHT can store multiple values for one key, what means that several peers can write their details into the DHT. Upon a read request these values will be returned by the responsible DHT nodes. Values are returned one by one and by all the responsible nodes. In other words, if there is a single key value pair there will be up to 20 replies, since the replication factor of the Azureus DHT is 20. Furthermore, there is no strict order of the replies. A listener will just receive replies until the time out is reached and it stops. Therefore a DHT manager has to perform checks if it has already received a value or not.

Values returned from the DHT are just IPv4 addresses and ports serialized to a binary string of 6 bytes, where 4 are used for the IPv4 address and 2 for the port number. The port number can be either UDP or TCP. The DHT Manager needs to know if cp active trackers are in the DHT or if more are needed. Since there is no status information in the DHT a new contact has to be sent a handshake message first. This part is done by the main tracker routine. The DHT Manager just passes on contacts from DHT values. The main tracker remembers addresses received from the DHT and sends handshakes. Upon handshake replies the tracker checks if the address was learned from the DHT. If it was indeed learned from the DHT the main tracker will confirm the activeness of the peer to the DHT Manager. Which in turn increases a counter. If this counter is less than cp the DHT Manager will add the local contact to the DHT by issuing a write request.

3.2.5 Messaging

The B-Tracker Protocol introduces three message types: Handshake, Request and Re- sponse. They are sent using messaging facilities provided by Vuze. Responsibilities for messaging are included in BTrackerMessageManager.. The Message Manager handles the sending of the three message types. Messages are received by a special listener class, which passes messages on to the B-Tracker main class where it is routed to the appropriate tracker object.

A handshake contains the info hash of a torrent, the peer ID, the senders IP address and its port numbers (TCP and UDP). With the info hash a receiver can check if it is sharing the torrent, if not it might be spam and the message is dropped. The peer ID is needed to identify the peer and update its TTL timeout. IP address and port numbers are needed in case the receiving peer does not yet know the sender. By including contact information, the sender can be added as tracker. The Message Manager keeps track of handshakes sent in order to prevent handshake loops. It will only allow a handshake to the same peer 120 seconds after the last one.

There is another message type for the request, again it contains the info hash and the peer ID but also a bloom filter which contains all the neighbors of the requesting peer. This filter is included to increase the efficiency of responses to these requests. In contrast to PEX which just sends a defined number of peers to all its neighbors B-Tracker will 22 CHAPTER 3. DESIGN send unknown peers only and only upon explicit request. The size of the bloom filter can be configure by the filter size parameter fs. By using already existing messaging facilities the risk of introducing potential security issues is minimized. Assuming that Vuze messaging is secure, the new messages do not expose any information that would not be used in BitTorrent anyways. In the worst case an attacker could find contact details of other peers. However, this information is no secret, since traditional trackers and or distributed ones give this information to anyone. Private trackers are an exception to the general behavior, but B-Tracker is not used for private trackers. It can be said that no new threat exposures are introduced by B-Tracker.

3.2.6 Parameters

Here is an exhaustive list of all the parameters used in the B-Tracker implementation. Also descriptions of the parameters are given. cp The Primary Tracker Capacity determines the number of contacts stored in the DHT per .torrent. If this value is very large (e.g. >20) it will cause more DHT traffic. If it is too small the swarm can get separated especially under heavy churn. rp The Primary Tracker Replication parameter tells how many times a value stored in a primary Tracker is replicated to other primary trackers. For the Vuze based implementation this parameter is fixed to 20 by the underlying DHT. rs The Secondary Tracker Replication defines how many peers a tracker has in its tracker list. Since connections are always mutual it is the same as registering itself at this number of trackers. fs The Filter Size controls how long the Bloom Filter array is, which is sent with a request. The number has to be multiple of 8 since the filter is represented as byte array. mr The Max Response parameter defines the maximum number of trackers returned in a B-Tracker response message. TSI Tracker Start Interval is the time B-Tracker sleeps in between look ups at the begin- ning. If the peer knows enough trackers the interval will be doubled until it reaches the max interval. It is also given in milliseconds. TMI Tracker Maximum Interval is the maximum interval possible when doubling the interval. It is given in milliseconds. TTL The Time To Live defines after amount of time a tracker is considered old and lifeness has to be checked. Time is given in milliseconds.

In contrast to the design there is no cs (secondary capacity) parameter in the implementa- tion. A peer has no capacity limit for known trackers, since there is no active connection resources used by a tracker entry are small. A tracker will also not send keep-alive hand- shakes unless its active peer count drops under rs. Therefore, it is no required to actively remove trackers from the list. Chapter 4

Implementation

This Chapter describes the implementation of B-Tracker as a Vuze plugin where to focus lies on the interface between B-Tracker and Vuze. Since B-Tracker is a Vuze plugin it has to implement a plugin interface and also adhere to certain conventions described in the Vuze plugin development guide [25]. This Chapter helps developers to extend B-Tracker or to implement a new Vuze plugin using the AZDHT. The BTracker class is the root of the implementation, it will create new Tracker objects as new downloads are added. Furthermore, it initializes B-Tracker properties, logging and the MessageManager class.

4.1 Plugin Interface

A Vuze plugin has to implement the Plugin interface defined in the Vuze source code. Upon initialization of the Vuze client all plugins are loaded by calling the initialize method of the Plugin implementation. In turn Vuze passes a reference to the PluginInterface object to the plugin, over which Vuze features like the DHT can be accessed.

Listing 4.1 shows the BTrackerPlugin class which implements the Plugin interface, in fact BTrackerPlugin implements an extension of the interface called UnloadablePlugin. Un- loadablePlugin interface adds a method which is called by Vuze when a plugin is unloaded. This can happen manually or if the client is closed. The two methods implemented by the BTrackerPlugin class simply delegate their duties to the BTracker singleton class (lines 5 and 10). This way the connection of B-Tracker and the Vuze core is encapsulated and the static BTracker class does not need to implement any interfaces. The PluginInterface object that is passed to B-Tracker offers access to the features of Vuze. The most impor- tant functions for B-Tracker are the DHT- and Peermanager interfaces.

1 public class BTrackerPlugin implements UnloadablePlugin { 2 3 @Override 4 public void initialize(PluginInterface arg0) throws PluginException { 5 BTracker.instance().initialize(arg0);

23 24 CHAPTER 4. IMPLEMENTATION

6 } 7 8 @Override 9 public void unload ( ) throws PluginException { 10 BTracker.instance().unload(); 11 } 12 } Listing 4.1: Implementation of the Vuze UnloadablePlugin interface.

4.2 DHT Interface

In the Vuze source code, the AZDHT is called “distributed database”, it refers to the DHT interface defined in the DistributedDatabase interface. The most important these are read, write and delete. An important constraint is the DHT’s fixed replication factor of 20. In B-Tracker terms it means that rp is fixed to 20. The Azureus DHT supports multiple values per key, this way new peers can easily add their contacts. DHT functionality is implemented in DHTManager.java.

1 . . . 2 public void write(DistributedDatabaseListener listener , DistributedDatabaseKey key, DistributedDatabaseValue value) throws DistributedDatabaseException ; 3 4 public void write(DistributedDatabaseListener listener , DistributedDatabaseKey key, DistributedDatabaseValue [] values) throws DistributedDatabaseException ; 5 . . . Listing 4.2: Write methods of the DistributedDatabase interface.

Listing 4.2 shows the write methods provided by the DistributedDatabase interface. The write methods require a listener which has to be implemented by the caller. Listeners are used as call backs for DHT events resulting from the Write request. For instance, if a write request was successful the listener is called with appropriate event type. Then the event can be handled by the listener. B-Tracker uses Write events only for logging purposes. Both methods require a key which is can be generated from any byte sequence. It is possible to store one or several DistributedDatabaseValues values with one request. Such DistributedDatabaseValue can be created from any byte sequence or from a Map object.

1 . . . 2 public void read(DistributedDatabaseListener listener , DistributedDatabaseKey key, long timeout ) throws DistributedDatabaseException ; 3 4 public void read(DistributedDatabaseListener listener , DistributedDatabaseKey key, long timeout , int options ) throws DistributedDatabaseException ; 4.3. FURTHER DEVELOPMENT 25

5 . . . Listing 4.3: Read Methods of the DistributedDatabase interface.

Listing 4.3 shows the read methods provided by the DistributedDatabase interface. Like for Write requests a DistributedDatabaseListener is required. Successful Read events contain the value read and need to be handled in order to make use of the value. Since the DHT replication factor is set to 20, a value is returned 20 times unless a responsible node failed before answering the request. Therefore, the DHTManager class filters already received values before passing the potential trackers into the main tracker algorithm. The timeout parameter defines how long the request will be valid, after the timeout expired a timeout event is handed to the listener. As option predefined Constants are used, at the time of writing two of them existed. A high priority and an exhaustive read flag. They are not used in B-Tracker.

In order to provide the B-Tracker implementation with the same possibilities that the DHT-Tracker uses the DistributedDatabase interface was extended with another read method. This read method added a parameter with which the number of wanted re- turn values can be defined. Setting this value to 30, like the DHT tracker does, reduces DHT traffic when churn is high.

4.2.1 Measurement

In order to compare the different approaches in terms of load balancing and efficiency, their usage of upload bandwidth needs to be measured. Upload bandwidth used for tracking is equivalent to the cumulative size of messages sent by either approach. There are DHT, PEX and B-Tracker messages that have to be recorded in log files. Logging B-Tracker messages is simple, since only the plugin implementation has to be changed. PEX and DHT messages are handled by Vuze and to log them the Vuze source code had to be changed. For this purpose, a static class called Measurement is introduced. Additionally, the code is extended at the points where PEX and DHT messages are being sent. The B-Tracker plugin uses the same class to log messages. The result is a measurement log file which contains all relevant messages from one run and from one peer.

4.3 Further Development

The B-Tracker implementation developed in the course of the Thesis has still prototype status. In order to be used broadly outside the lab some improvements are required. These are explained in the following paragraphs.

At the moment B-Tracker supports only peers running B-Tracker. This should be en- hanced in a way that a peers can also discover new providers not running B-Tracker. This feature is required since in the beginning of a broad deployment only a few peers will support B-Tracker and benefit would be low. For a working enhancement a way has to be defined in which status information of non B-Tracker peers can be tracked. 26 CHAPTER 4. IMPLEMENTATION

As it is, B-Trackers only bootstrapping mechanism is the DHT. This is fine for total decentralized tracking, like it is the scope in this Thesis. In a more daily BitTorrent scenario there is usually a tracker, which might fail at some point. However, peers are already known by then and a DHT query could be obsolete. B-Tracker should be enhanced with a possibility to learn peers from other sources such as peers returned by a classical tracker.

The B-Tracker implementation described in this thesis lacks support of IPV6. This draw- back does not influence the outcome of the thesis, but it prevents wide adoption of the plugin should it be released for public use.

To be widely adopted B-Tracker would need other implementations for the main BitTor- rent clients. This would mean that the Vuze messaging system that is used in the scope of this Thesis can not be used anymore, since it works only between Vuze peers. Chapter 5

Evaluation

This chapter concentrates on the evaluation of the B-Tracker approach. The evaluation’s goal is to prove that the B-Tracker approach first described in [13] improves load balancing compared to the pure DHT tracker and is more efficient than PEX. The evaluation is based on the three different approaches under three different scenarios where scenarios are merely different churn rates. This results in 9 experiments which are then executed 10 times, so each experiment has 10 runs.

Table 5.1 summarizes the different experiment configurations. There is one column for each approach and one row for each churn rate. This results in 9 cells for the nine experiments. Experiment identifiers are a combination of approach shortcut and churn rate.

5.1 Evaluation Environment

Table 5.2 shows the hardware specs of the server used for the experiment. The experiments are conducted on a single server, in order to extend the number of peers in experiments more servers need be added or alternatively faster hardware can be used. Distributing the experiments onto several servers adds complexity to the system but increases the maximum peer count. The server can connect to the Internet but no connections will be accepted from the outside. This way experiments are executed in a closed environment which provides more accurate results. Especially the Azureus DHT is affected by this constraint. Thus, a private Azureus DHT is built up each time a run starts. This requires

Approach Churn B-Tracker DHT PEX 0% BT00 DHT00 PEX00 15% BT15 DHT15 PEX15 30% BT30 DHT30 PEX30 Table 5.1: Listing of the 9 experiments used in the evaluation.

27 28 CHAPTER 5. EVALUATION

Server Hardware CPU 2x AMD Opteron(tm) 6180 SE RAM 64 Gbyte HDD1 134 Gbyte HDD2 400 Gbyte HDD3 1.8 Tbyte (RAID0) Table 5.2: List of the relevant hardware specification of the server in use. one peer to take over the role of the bootstrapping node. Furthermore, all peers commu- nicate over their own network interface, which is a sub interface of the loop back interface. By using the same interface with 100 or even more peers results in unpredictable port conflicts, since Vuze maintains at least 30 connections to other peers. 100 peers using 30 ports at least, block almost 50% of the available ports and conflicts become very likely.

5.2 Experiment Design

Table 5.1 summarizes the different experiment configurations on which the evaluation is based. Under these circumstances a distribution of a single ISO file is executed. The file is considered to be new and therefore will be very popular in the beginning but over time popularity will decrease, so the time between two peers joining the torrent will increase. This time is called peer inter arrival time. A run is finished as soon as all the peers have finished the download. The setup described here is the same for all experiments.

5.2.1 Files, Bandwidth and Seeding

The most popular files shared in BitTorrent are large, it is essential to use a large file for the experiment [12]. Popular high resolution movie torrents reach sizes of 8GB or more. Two reasons speak against using such a file for the evaluation. The first, using such a large file requires several hours to download, which is not feasible for experiments consisting of several runs. The second, usually movie files, are often shared outside the frame of legality. Therefore, the Ubuntu 10.11 64-bit PC (AMD64) desktop ISO distribution [15] was chosen. It has a size of roughly 715MB and it is also officially distributed over the BitTorrent network.

The scarcest resource of the typical P2P node is the upload bandwidth capacity. Through the wide adoption of DSL internet connections, download bandwidth capacity is much higher than upload. The evaluation reflects this fact, especially since upload overhead traffic is compared. As a basis for bandwidth capacity limitations, the fastest offering of Switzerland’s largest ISP (Swisscom) [1] was selected as upper limits, which is comparable in terms of speed to intermediate offerings of Switzerland’s largest cable network provider (UPC-Cablecom) [23]. The Swisscom DSL offering has 20 Mbit/s download and 2 Mbit/s upload bandwidth capacity. For the evaluation these values are used as the mean in a 5.2. EXPERIMENT DESIGN 29 normal distribution of upload bandwidth capacity where download bandwidth is always ten times the upload bandwidth capacity. This way the setup becomes more realistic since not all real world peers have the same bandwidth capacity on their internet connections.

A BitTorrent file distribution needs to start with at least one seed, this evaluation is no exception. For all the experiments two seeds are used and they are different from the other peers. Seed download and upload speeds are fixed at the mean rates described in last paragraph. Thus, experiments are comparable with each other. If a seed got a very low speed from the random distribution the outcome of one run could be way different from one where the seed got a high or medium speed. Keeping the seeds’ upload speed limited ensures that peers need to find other sources for the download in order to complete it in a reasonable amount of time. Another specialty is that seeds are not affected by churn for the same reason explained before. Naturally, the two seeds are the first ones to be online since it would not make sense to publish a .torrent before activating at least one seed.

After the seeders are online, the other peers will start arriving. The popularity is very large in the beginning of a run; therefore, peer inter arrival time is very short in the beginning so that quickly a large number of peers will join the swarm. Peer inter arrival time follows an exponential function and ranges from 1 to 35 seconds for the 100 peers simulated. In a real life situation, a peer also leaves the swarm after a certain time or after reaching a certain share ratio. For the experiment, a share ratio of 1 was chosen. This ensures that the experiments can finish in reasonable time and also forces peers to find new providers if all their neighbors have reached the limit. Peers which reach this ratio will not shut down, they will just stop uploading. Since this rule only applies after a download has finished it is still possible for a peer to have a share ratio larger than 1. The seeds are not affected by this rule, since it is not realistic that an original seeder just stops uploading as long as the file is still wanted.

5.2.2 Churn

A very important factor in P2P systems is churn. In the real world peers leave the system for various reasons among them are computer shutdown (e.g. at night), connection loss or crashes. A P2P system needs the ability to cope with this unpredictable behavior of peers. Churn is important for the evaluation because with more churn, more tracker look ups are performed and will render some addresses useless.

Each approach has a churn rate parameter which defines the amount of churn in the system. The churn rate is expressed in percent of peers that will quit at each churn interval. The interval is set to 240 seconds for all experiments. The churn rate varies between 0, 15 and 30 percent. In other words, with churn rate of 30, every four minutes 30% of all all peers except the original seeds will go offline. They will come back online with a different port number what makes them new peers, since the old address became useless. Though, the new peer has the old download state because otherwise the run would most likely run endlessly. Varying the churn rate allows a comparison of the three approaches under different conditions. 30 CHAPTER 5. EVALUATION

5.2.3 DHT

The Azureus DHT is used by around 1 million users, depending on the time of day. Using such a large DHT is not feasible for the evaluation, since it is impossible to measure all the nodes taking part in the DHT. Furthermore, it would be almost impossible to separate the traffic belonging to different torrents. Measuring only a subset of the nodes would make the experiments incomparable, because values might be stored outside the measured nodes. Therefore, a local DHT inside the closed environment is used. The DHT is run by those peers taking part in the run. Thus, all DHT messages are measured since no irrelevant downloads interfere with the run.

Starting the DHT from scratch requires a bootstrapping node. By default Vuze queries dht.vuze.com:6881 for a bootstrapping node. For the evaluation this means that one node always has to use port 6881 and the hosts file of the server has to be edited to replace dht.vuze.com with the node’s IP address. Initialization of the DHT takes around 15 seconds per node, for that reason all the peers are started and only after a wait time of 15 seconds .torrents will be added to the running instances.

5.2.4 Performance Issues

During the first runs with large files (e.g. 700 MB) several performance issues were discovered. Hard disk utilization was constantly at 100%, also memory usage was so high that the server started using the swap file extensively. CPU usage was no problem though.

The high hard disk usage has several reasons, one is the swapping caused by the excessive memory usage of Vuze. Other reasons are, the files that have to be written to the disk and also logging uses the disk. In order to reduce disk usage the Vuze disk write cache has to be increased from the standard 4 MB to 100 MB. This does not solve the issue completely. Thus disk capacity is increased. A RAID 0 disk array is added to the server to provide storage space for the downloaded files. The logs are still written to system disk. Disk access was the major performance issue discovered during the experiments.

Vuze by default uses up to 1Gbyte of memory, that is so much that the servers 64 GB RAM are not sufficient to run 100 Vuze instances. The Java Virtual Machine parameter -Xmx is used in order to set a maximum amount of memory a Vuze instance is allowed to use. With this restriction swapping is prevented and disk usage of the system disk is greatly reduced. For 100 peers the -Xmx value is set to 512 MB which is sufficient for Vuze to run properly. At least for Vuze instances without a GUI as they are used in this evaluation.

5.2.5 Parameters

Table 5.3 shows the B-Tracker parameters used in the evaluation. The Primary Tracker Capacity is set to 10, this value ensures that the system can handle the churn rates used and does not over use the DHT at the same time. Primary Tracker Replication is fixed 5.3. EXECUTION AND RESULTS 31

Experiment Parameters

CP 10 RP 20 RS 35 FS 512 MR 20 TSI 15000 TMI 120000 TTL 300000 Table 5.3: B-Tracker parameters used in the experiment. at 20 by the Azureus DHT, and is left unchanged to be better comparable to a realistic system. Secondary Tracker Replication is set to 35 since this is a common value used in P2P systems and was also used in early B-Tracker simulations [13]. The Bloom Filter Size of 512 showed good performance in simulations, since a peer only needs to have 35 neighbors some false positives will not harm the system. Maximum Response Size of 20 is used to be consistent with the first simulations. Timeout Start Interval and Timeout Maximum Interval set to 15 seconds or 2 minutes respectively. Using a small value for the start speeds up initial peer discovery, increasing the timeout to 2 minutes can save resources when churn is low or non existent. The TTL is set to 5 minutes.

5.3 Execution and Results

After the evaluation setting is explained the results are presented. First, the execution is explained in more detail. Then the results and an analysis are presented.

5.3.1 Execution

An experiment is executed for each tracker approach with churn rates of 0, 15 and 30 percent. Furthermore, each experiment consists of 10 runs, in order to cancel effects influencing a single run. Such effects are produced by churn, since it selects peers by chance. There is always a chance that one peer gets restarted repeatedly hence delaying the experiment.

Messages are measured just before they are sent over the network. The measured size includes only the raw data being sent, no TCP/IP headers are included. Messages are logged together with a time stamp, message type and size. This allows for detailed analysis of the communication between peers, what is valuable for debugging. For example it can be determined how many FIND_VALUE messages were sent by the DHT. 32 CHAPTER 5. EVALUATION

Mean Load Simple 2.5 B-Tracker (Total) DHT (Total) 2 PEX (Total)

1.5

1 Mean Load (Mbyte) 0.5

0 0 15 30 Churn Rate in (%)

Figure 5.1: Mean load per peer under different churn conditions.

5.3.2 Efficiency

First, the efficiency of the three approaches is compared. Efficiency means wasting as little resources as possible in order to accomplish a certain task. In the context of the evaluation the resource is load which is defined as the aggregated size of sent messages also called traffic. The task is to download a file. To obtain the numbers for the comparison, each experiment’s log files were parsed and the different message sizes were added up and divided by the number of peers. Then the 10 runs are again averaged and the result is the mean upload traffic used for tracking per peer out of 10 runs.

Figure 5.1 presents the mean outgoing traffic for all the experiments. Each bar represents the mean total outgoing traffic per peer, the error bars show the standard deviation (STD) between the 10 runs of each experiment. It can be seen that without churn B-Tracker is in the middle between the DHT-Tracker and PEX. As soon as churn is active load starts to increase, what is expected since joining and leaving the DHT will cause traffic and all three approaches use the DHT. What is rather surprising, is that the PEX load increases more than threefold from 0% to 15% churn. One explanation is that old provider information is still sent from peer to peer and therefore more traffic is generated. Also B-Tracker and DHT load increase with increasing churn, noteworthy is that B-Tracker load increases less than DHT and with 30% churn is therefore more efficient than the DHT approach. However, without churn the DHT approach is significantly more efficient. Therefore, further analysis of load composition is required. The error bars increase with churn, which justifies the execution of multiple runs.

Figure 5.2 shows again the mean load per peer but also how much accounts to DHT, B-Tracker or PEX messages. Looking at the zero churn bars reveals that B-Tracker’s 5.3. EXECUTION AND RESULTS 33

Mean Load Detailed 2.5 B-Tracker (BT) B-Tracker (DHT) 2 B-Tracker (Total) DHT (Total) PEX (PEX) 1.5 PEX (DHT) PEX (Total)

1 Mean Load (Mbyte) 0.5

0 0 15 30 Churn Rate in (%)

Figure 5.2: Decomposed mean load per peer under different churn conditions.

DHT portion of load is the same as DHT’s traffic. Having B-Tracker load on top of the DHT load makes it impossible for B-Tracker to be more efficient. However, a zero churn environment is not realistic but it shows interesting properties of the system. With a churn rate of 15% B-Tracker and DHT are very close to each other and with 30% B-Tracker is more efficient. An explanation for the equal DHT traffic between B-Tracker and DHT is that without churn the DHT does not need to be queried often or only once and B- Tracker does exactly the same in its first phase. In contrast to the simulations of Bocek and Hecht [13] in which the replication factor of the DHT was reduced for B-Tracker, this is not feasible for a realistic evaluation if the Azureus DHT is used. Because, in the real world the DHT properties can not be changed easily because of the Vuze specific plugin implementation of B-Tracker. Furthermore, the Azureus DHT has to support the DHT-Tracker at the same time. On the other hand, the pure B-Tracker gets more efficient with rising churn where the main driver of load is the DHT.

Figure 5.2 also shows that using PEX and DHT produces more DHT load than pure DHT. PEX itself generates load by flooding messages to its neighbors as soon as churn is introduced. When using PEX the DHT load is slightly higher than with the pure DHT approach. With PEX enabled, queries to other trackers (DHT or centralized tracker) are reduced to the minimum, therefore the download time increases with the use of PEX what also explains the higher DHT load. The error bars show that with more churn the error grows. The error bars for PEX (DHT) traffic at 30% churn are smaller than the ones for the DHT-Tracker, which is an effect of the reduced DHT queries of PEX since the DHT’s housekeeping load is more constant than query load. 34 CHAPTER 5. EVALUATION

Load Balancing Absolute 300 B-Tracker (Total) DHT (Total) 250 PEX (Total)

200

150

100

50

Standard Deviation of Peer Load (kbyte) 0 0 15 30 Churn Rate in (%)

Figure 5.3: Absolute load balancing.

5.3.3 Load Balancing

One goal in the development of B-Tracker was to provide better load balancing than the DHT-Tracker. Load was covered before and the load balancing is defined as the standard deviation of the mean load of peers. Figure 5.3 shows the absolute standard deviation of the peers’ loads. The larger the standard deviation, the less balanced load is, or the smaller the standard deviation the better is the load balancing. The error bars give the standard deviation of the runs. Standard deviation is given in absolute numbers. As expected, B-Tracker has the lowest standard deviation in all scenarios and, therefore, the best load balancing. Since the values are absolute it is normal that the standard deviation rises with churn because the load increases. If load is increased by 10% the difference between a highly loaded and a less loaded peer will increase more than 10%. Since PEX load is much higher than B- Tracker or DHT load, comparing absolute STD is not suited to compare PEX to the other approaches. However, the high values of PEX are suspicious and it has to be investigated if this is only due to the higher load. Figure 5.4 plots the coefficient of variation (CV) which is more comparable than the standard deviation, because it expresses the STD as a percentage of the mean. This way the very high PEX values can be compared to the others, since the CV is dimenson less indicator of how well values are distributed. The biggest difference to the standard deviation can be seen in PEX. PEX looks suddenly much more balanced than the others except for the zero churn experiment. The mean load shown in Figure 5.2 explains this effect. For zero churn PEX messages account only for the smaller portion of the load. The other portion is produced by the DHT which is not very well balanced. As soon as churn is introduced PEX messages start to produce more load. Since PEX messages are regularly sent after a certain time interval their load is distributed evenly. Furthermore, the PEX CV is slightly higher with higher churn while B-Tracker and DHT CVs are lower. The DHT CV is notably lower with high churn, this is due to the fact that Node IDs are 5.3. EXECUTION AND RESULTS 35

Load Balancing Relative 0.6 B-Tracker (Total) DHT (Total) 0.5 PEX (Total)

0.4

0.3

0.2 Coefficient of Variation 0.1

0.0 0 15 30 Churn Rate in (%)

Figure 5.4: Relative load balancing expressed as Coefficient of Variation. created newly every time Vuze restarts. This flattens the random distribution effects of the DHT.

Although PEX has the lowest CV with churn rates greater than zero, it would be wrong to promote it as the best solution. PEX wastes considerable amounts of bandwidth which is not suitable if the resource wasted is scarce like upload bandwidth. Therefore, one has to consider not only the CV but also the STD and the efficiency. B-Tracker shows clearly the best results in load balancing considering both relative and absolute load balancing.

Figures 5.5, 5.6, 5.7 visualize the load balancing in another way. After each run the peers are ranked according to their total traffic. After the experiment the mean of all peers with rank 1 is calculated the rank 2 and so on. The result is plotted in the figures, it shows the distribution of load in one experiment.

Figure 5.5 shows the results without churn. PEX and DHT curves resemble each other in shape, only that the PEX curve is much higher. This is because the general load is higher and the pure PEX load is well distributed. The B-Tracker curve is much smoother than the others. The figure reflects the fact that B-Tracker is higher than DHT-Tracker load. The DHT-Tracker has a small jump between peers 20 - 25 and the slopes are steeper for from rank 0 to 25 than for the rest. Also the PEX curve shows the change in slope. The jump and change in slope comes from the replication factor of 20 which is the number of peers queried during a DHT look up. The jump occurs at ranks higher than 20 than because a look up has to query the 20 closest nodes. In order to be sure of having queried the 20 closest nodes, some others will be queried. If several nodes are found which are not closer, the algorithm stops searching.

Figure 5.6 shows the results with 15% churn. Compared to Figure 5.5 the curves are smoother, the PEX curve is far higher, the B-Tracker and DHT-Tracker curves are much closer and cross each other. The smoother looking curves are an effect of the x-axis scale, since PEX has a roughly three times the load it has with zero churn. The DHT-Tracker 36 CHAPTER 5. EVALUATION

Upload Traffic per Peer Without Churn 0.8 B-Tracker 0.7 DHT PEX 0.6

0.5

0.4

0.3 Load (Mbyte) 0.2

0.1

0 0 10 20 30 40 50 60 70 80 90 100 Peer Rank

Figure 5.5: Ranked mean load per peer for 0% churn. curve is higher than B-Tracker until the jump around rank 25, from where on it is slightly lower. In total the DHT-Tracker is still a little more efficient than B-Tracker as can be seen in Figure 5.1. The change in slope steepness is slower than in with zero churn for the PEX, the jump can not be observed at all.

Figure 5.7 shows the results with 30% churn. The PEX curve is even higher than with 15% churn what further squeezes the x-axis. The jump in the DHT-Tracker curve is still visible but the DHT-Tracker curve crosses the B-Tracker curve between the ranks 50 and 60. In total B-Tracker is more efficient.

The figures reflect all the observations made earlier. It can be seen that total load is higher for B-Tracker with zero churn but lower with 30% churn where the B-Tracker curve is lower than the DHT curve for most ranks. Load balancing is reflected in the steepness of the curves. It is clearly visible that B-Tracker curves are flatter and smoother than the DHT curves.

5.3.4 Messages

Figure 5.8 shows the mean load produced by DHT message types per peer with a churn rate of 15% the error bars represent the STD between runs. Investigating the figure reveals that REPLY-FIND-NODE messages account for most of the load. REPLY-FIND-NODE messages are used in DHT routing to find the closest nodes for a read or write request. They are also used when a node needs more contacts because of losses due to churn. Furthermore, it can be seen that B-Tracker uses the DHT more economically than the DHT-Tracker since the REQUEST-FIND-VALUE and REQUEST-STORE are smaller. This is due to the small size of the values stored in the DHT and the fewer read and write requests. DHT has a higher STD in most message types, this reflects the over higher STD of DHT which is shown in Figure 5.1. 5.3. EXECUTION AND RESULTS 37

Upload Traffic per Peer with 15% Churn 2.5 B-Tracker DHT 2 PEX

1.5

1 Load (Mbyte)

0.5

0 0 10 20 30 40 50 60 70 80 90 100 Peer Rank

Figure 5.6: Ranked mean load per peer for 15% churn.

Upload Traffic per Peer with 30% Churn 3 B-Tracker DHT 2.5 PEX

2

1.5

Load (Mbyte) 1

0.5

0 0 10 20 30 40 50 60 70 80 90 100 Peer Rank

Figure 5.7: Ranked mean load per peer for 30% churn. 38 CHAPTER 5. EVALUATION

DHT Load by Message Type with Churn 15% 200 180 B-Tracker 160 DHT 140 PEX 120 100 80 60 40 20 0 DATA REPLY-FIND-NODEREPLY-FIND-VALUEREPLY-PINGREPLY-STOREREQUEST-FIND-NODEREQUEST-FIND-VALUEREQUEST-PINGREQUEST-STORE Aggregated Message Size (kbyte)

Figure 5.8: Mean DHT load of a peer by message type.

DHT Total Messages by Type with Churn 15 900 800 B-Tracker 700 DHT 600 PEX 500 400 300 200 100 Number of Messages 0 DATA REPLY-FIND-NODEREPLY-FIND-VALUEREPLY-PINGREPLY-STOREREQUEST-FIND-NODEREQUEST-FIND-VALUEREQUEST-PINGREQUEST-STORE

Figure 5.9: Mean number of DHT messages sent per experiment. 5.3. EXECUTION AND RESULTS 39

B-Btracker Load by Message Type 90 Churn 0% 80 Churn 15% Churn 30% 70 60 50 40 30 20

Aggregated Message Size (kbyte) 10 0 HANDSHAKE REQUEST RESPONSE

Figure 5.10: Mean B-Tracker of a peer load by message type.

Figure 5.9 shows the mean number of messages sent by a peer during an experiment, the errorbars represent the STD between runs. Comparing this graph to the previous one in Figure 5.8 it is apparent that there are even a bit more REQUEST-FIND-NODE as there are REPLY-FIND-NODE messages. Therefore REPLY-FIND-NODE messages are much bigger in size than the corresponding request message. The numbers make perfect sense, since a reply is only sent after a request was received and some requests might not be answered. A comparison of the REPLY-FIND-VALUE message size and numbers proves that B-Tracker uses less DHT requests and that the values stored in the DHT are smaller. B-Tracker sends only a small amount less REPLY-FIND-VALUE messages but they use less than half the load of the DHT-Tracker’s messages. The DHT STD is larger than the others for most message types, therefore, it is the number of messages which varies between runs and not the message size.

Figure 5.10 shows the mean load produced by a message type for one peer, the errorbars are the STD between runs. Most of the load is produced with HANDSHAKE messages. If one wanted to further optimize B-Tracker, reducing the HANDSHAKE messages sent could improve load greatly. Although, it can not be told how much the system benefits over the additional peer state (online/offline) information. If a peer would further distribute providers, which went offline due to churn, this could lead to the same issues discovered with PEX. Another specialty which can be seen in the graph is the reduction in messages when moving from zero to 15% churn. The only explanation is that due to the shut down of peers less handshakes are sent, this effect is then compensated by the additional overhead with 30% churn. The error of the RESPONSE and REQUEST messages is significantly larger for the churn rate of 30%. This must be an effect resulting of churn’s random peer selection. 40 CHAPTER 5. EVALUATION

Mean Run Time 90 B-Tracker 80 DHT PEX 70 60 50 40 30 Run Time (minutes) 20 10 0 0 15 30 Churn Rate in (%)

Figure 5.11: Plot of the mean time to finish one run .

5.4 Run Times

The measurements made during experiments allow for a high level comparison of run times. Comparing the time needed to complete a run, or for all peers to download a file, is only possible and feasible because a real BT client was used. If the three approaches were used in the real world these values could be felt by the users.

Figure 5.11 shows the mean time taken to complete a run of an experiment, the error bars show the standard deviation between runs. The run times show that PEX takes more time than the others, what is surprising since it uses the DHT-Tracker as well as PEX messages. Therefore, it is expected to show similar run times to the DHT-Tracker. One reason for this effect is the way PEX interacts with other trackers in Vuze. If PEX is used the regular tracker’s announce interval is set to the maximum allowed value, this means less tracker queries. The results show that PEX can not compensate the higher tracker interval time. The DHT-Tracker behaves in an expected way, the run time is higher for higher churn rates. The STD is significantly higher for higher churn rates. B-Tracker shows an interesting behavior, since its run times are lower for 15% churn than for zero churn. This is a hint that improvements can be made. Further investigations are necessary to explain the STD increase in the DHT-Tracker and the low STD for 15% churn in B-Tracker. Chapter 6

Summary and Conclusions

6.1 Summary

The main problem addressed by the B-Tracker approach are efficiency and load balancing in distributed BitTorrent trackers. Historical data and simulations showed that the uneven load distribution is a result of uneven popularity distribution and the mechanics of DHTs in general. The second approach of distributed tracking PEX is known to be inefficient because it relies on flooding of messages. Abstract simulations of the B-Tracker approach and the other distributed trackers have shown a significant improve of B-Tracker over PEX and DHT in efficiency as well as load balancing.

B-Tracker was implemented as a plugin for the popular BT program Vuze. The original B-Tracker concept had to developed into a design fitting into the Vuze plugin interface. Then the B-Tracker plugin was implemented, a working Vuze plugin is the result. In order to conduct the evaluation, changes to the Vuze source code had to be made. Developing B-Tracker as a plugin had several advantages. One was the already existing DHT which could be used for B-Tracker’s primary tracker look up. Furthermore, using the existing DHT ensured that B-Tracker can be used in the real world outside the lab. Also, did the plugin not require changes to be made to the Vuze source code. Three new messages were introduced in order to run the B-Tracker protocol. The plugin development also had some drawbacks. One of them is the DHT interface which does not offer all the features of the DHT but is simple to use. Several things had to be changed while moving from the original design to the plugin implementation.

The Evaluation required a large setup of several experiments testing all three tracker approaches under three different churn rates. Parameters of the experiments were fixed at certain values and file size churn rate and seeding policy defined. An experiment involved 100 peers downloading a file from two seeders with one tracker type and a fixed churn rate. An experiment was considered done as soon as all the participating peers had finished downloading. In order to measure the messages sent by the DHT and PEX the Vuze source code had to be complemented with a few logging statements. Logged were time, type of message and message size. The results were thoroughly analyzed and discussed.

41 42 CHAPTER 6. SUMMARY AND CONCLUSIONS 6.2 Conclusion

G1, the first goal, was designing and implementing B-Tracker. It was reached by devel- oping a Vuze plugin. This plugin can be used on any Vuze client with version 4.7 or higher. Furthermore, the evaluation showed clearly that the B-Tracker plugin is a feasible alternative for centralized trackers, PEX and the DHT-Tracker.

G2 required the setup of a realistic evaluation in order to compare the B-Tracker plugin to the other approaches. Chapter 5 describes in detail the framework developed in order to compare the B-Tracker, PEX and DHT. By introducing and varying churn between experiments as well as selecting parameters, such as bandwidth capacity and filesize, according to the real world, the evaluation was made more realistic than already existing B-Tracker simulations.

G3 demanded that B-Tracker improves efficiency compared to PEX and DHT. The eval- uation results show that B-Tracker is equally to more efficient than the DHT-Tracker as soon as churn is involved. In contrast to the expectations set after B-Tracker concept sim- ulations [13], the B-Tracker plugin could not outperform its competitors in the discipline of efficiency in every experiment. The DHT-Tracker performed better under zero churn conditions. With increasing churn results change. Both types are similar at 15% churn but B-Tracker wins at 30%. The difference to the expected outcome can be explained with the DHT implementation. In order to be comparable the concept evaluation [13] used a DHT replication factor of 20 for DHT tracking and a factor of 2 for B-Tracker. Since B-Tracker already uses less read and write requests and smaller DHT values than the DHT-Tracker this is the only explanation for the relatively high load under zero churn. Thus, B-Tracker values stored in the DHT had to be reduced in size while DHT look ups were reduced to a minimum. However, a churn free environment is not a realistic assumption. The main part of the load in B-Tracker was generated by the DHT and therefore it is difficult to be more efficient than the DHT-Tracker itself without further changing its behavior.

DHT tracking supported by PEX creates more than three times the load created with the DHT-Tracker with churn. In terms of load balancing B-Tracker showed the expected improvements over the DHT-Tracker. Comparing PEX to the other approaches was not a simple task, since PEX uses a lot of bandwidth its standard deviation is high but the coefficient of variation is lower than the others’. Furthermore, is PEX wasting resources and does not seem to have a benefit. Other work [28] has shown that PEX can improve download times and therefore it might be the combination of DHT and PEX which is not ideal. Or the PEX implementation in Vuze has unknown issues.

G4 was to improve load balancing in distributed trackers. Compared to the DHT-Tracker B-Tracker improved load balancing among peers in absolute and relative terms. It shows that B-Tracker solves the problem of uneven load distribution in DHTs partly. It can not totally solve it as long as it relies on a DHT for primary tracking. Still, B-Tracker improves load balance all though most of the load is generated by the DHT. Since B- Tracker executes less DHT read and write requests, this leads to the conclusion that the regular housekeeping activities of the DHT are better distributed than the reads and writes. Comparing PEX and B-Tracker in the discipline of load balancing is similar to comparing apples and oranges, because PEX uses that much bandwidth that the effects 6.2. CONCLUSION 43 of the DHT are canceled out totally. However, the evaluation proves that B-Tracker is superior in load balancing than the pure DHT approach and therefore reached G4.

Goals G1 and G2 were successfully reached, the evaluation showed that clearly. B-Tracker shows better efficiency than the DHT-Tracker with a churn rate 30% what leads to the assumption that it will also perform better with even more churn. Compared to PEX B-Tracker makes a very good impression. G3 was achieved almost completely, taking into account that a churn rate of 0% is not realistic and for the other evaluated conditions B-Tracker was similar or better than DHT and PEX. B-Tracker’s main advantage is the improved load balancing compared to the other evaluated approaches. PEX cannot be considered better in load balancing than B-Tracker or DHT since it uses that much bandwidth that load balancing becomes irrelevant. G4 could be reached. It can be said that the implementation of the B-Tracker plugin is the second successful step towards better load balancing and efficiency in distributed BitTorrent trackers. In the course of this thesis it became clear that using a real world application and realistic evaluation scenario yields different results than an abstract simulation. Some compromises had to be made in order to make B-Tracker a piece of software useful in the real world, but it could reach its goals almost completely. 44 CHAPTER 6. SUMMARY AND CONCLUSIONS Bibliography

[1] Swisscom AG. Internet zu hause: Fur¨ jeden das richtige dsl. http://www.swisscom. ch/res/internet/dsl/index.htm?languageId=de, last visited: April 2012.

[2] The Wikipedia Authors. Bencode. http://en.wikipedia.org/wiki/Bencode, last visited: February 2012.

[3] The Pirate Bay. About the pirate bay. http://thepiratebay.se/about, last visited: April 2012.

[4] B.H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13:422–426, July 1970.

[5] J. Byers, J. Considine, and M. Mitzenmacher. Simple load balancing for distributed hash tables. In M. Kaashoek and Ion Stoica, editors, Peer-to-Peer Systems II, vol- ume 2735 of Lecture Notes in Computer Science, pages 80–87. Springer Berlin / Heidelberg, 2003. 10.1007/978-3-540-45172-3 7.

[6] B. Cohen. Incentives build robustness in , 2003.

[7] B. Cohen. The bittorrent protocol specification. http://www.bittorrent.org/ beps/bep\_0003.html, last visited: November 2011.

[8] S.A. Crosby and D.S. Wallach. An analysis of bittorrent’s two kademlia-based dhts. Technical report, Department of Computer Science, Rice University, Houston, 2007.

[9] G. D´an and N. Carlsson. Power-law revisited: large scale measurement study of p2p content popularity. In Proceedings of the 9th international conference on Peer-to-peer systems, IPTPS’10, pages 12–12, Berkeley, CA, USA, 2010. USENIX Association.

[10] P. Gil. The best torrent downloading software, 2012. http://netforbeginners.about.com/od/downloadingfiles/tp/ best-torrent-downloading-software-2012.htm, last visited: April 2012.

[11] B. Godfrey, K. Lakshminarayanan, S. Surana, R. Karp, and I. Stoica. Load balancing in dynamic structured p2p systems. In INFOCOM 2004. Twenty-third AnnualJoint Conference of the IEEE Computer and Communications Societies, volume 4, pages 2253 – 2262 vol.4, march 2004.

[12] F.V. Hecht, T. Bocek, and D. Hausheer. The pirate bay 2008-12 dataset. http: //www.csg.uzh.ch/publications/data/piratebay/, 2008.

45 46 BIBLIOGRAPHY

[13] F.V. Hecht, T. Bocek, and B. Stiller. B-tracker: Improving load balancing and efficiency in distributed p2p trackers. In Peer-to-Peer Computing (P2P), 2011 IEEE International Conference on, pages 310 –313, 31 2011-sept. 2 2011. [14] A. Loewenstern. Dht protocol. http://bittorrent.org/beps/bep\_0005.html, last visited: November 2011. [15] Canonical Ltd. ubuntu-11.10-desktop-amd64.iso. http://mirror.switch.ch/ftp/ mirror/ubuntu-cdimage//oneiric/ubuntu-11.10-desktop-amd64.iso, last vis- ited: April 2012. [16] P. Maymounkov and D. Mazi`eres. Kademlia: A peer-to-peer information system based on the xor metric. In Peter Druschel, Frans Kaashoek, and Antony Rowstron, editors, Peer-to-Peer Systems, volume 2429 of Lecture Notes in Computer Science, pages 53–65. Springer Berlin / Heidelberg, 2002. 10.1007/3-540-45748-8 5. [17] S. Rieche, L. Petrak, and K. Wehrle. A thermal-dissipation-based approach for bal- ancing data load in distributed hash tables. In Local Computer Networks, 2004. 29th Annual IEEE International Conference on, pages 15 – 23, nov. 2004. [18] H. Schulze and K. Mochalski. Internet study 2008/2009. http://www.ipoque.com/ sites/default/files/mediafiles/documents/internet-study-2008-2009.pdf, 2009. [19] R. Steinmetz and K. Wehrle. P2P Systems and Applications, pages pp. 119–135. Springer-Verlag Berlin Heidelberg, 2005. [20] R. Steinmetz and K. Wehrle. P2P Systems and Applications, pages pp. 9–16. Springer-Verlag Berlin Heidelberg, 2005. [21] TheryOrg. Bittorrent peer exchange conventions. http://wiki.theory.org/ BitTorrentPeerExchangeConventions, last visited: November 2011. [22] TheryOrg. Bittorrent protocol specification v1.0. http://wiki.theory.org/ BitTorrentSpecification, last visited: November 2011. [23] upc cablecom. Fiber power internet 25. http://www.upc-cablecom.ch/b2c/ internet/fiberpower25.htm, last visited: April 2012. [24] Inc. Vuze. Vuze. http://www.vuze.com/, 2003-2011. [25] Inc. Vuze. Plugin development guide. http://wiki.vuze.com/w/Plugin\ _Development\_Guide, last visited: February 2012. [26] Inc. Vuze. Azureus messaging protocol. http://wiki.vuze.com/w/Azureus\ _messaging\_protocol, last visited: November 2011. [27] Inc. Vuze. Distributed hash table. http://wiki.vuze.com/w/Distributed\_hash\ _table, last visited: November 2011. [28] Di Wu, P. Dhungel, Xiaojun Hei, Chao Zhang, and K.W. Ross. Understanding peer exchange in bittorrent systems. In Peer-to-Peer Computing (P2P), 2010 IEEE Tenth International Conference on, pages 1 –8, aug. 2010. Abbreviations

API Application Programming Interface AZ AZureus AZDHT AZureus DHT BT BitTorrent CSG Communications Systems Group CPU Central Processing Unit CV Coefficient of Variation DHT Distributed Hash Table DSL Digital Subscriber Line GUI Graphical User Interface JAR Java ARchive P2P Peer to Peer PEX Peer EXchange PDF Portable Document Format RAM Random Access Memory SHA Secure Hash Algorithm STD STandard Deviation TMI Tracker Maximum Interval TSI Tracker Start Interval TTL Time To Live URL Unified Resource Locator UCS Universal Character Set UT µTorrent UTF-8 8 bit UCS Transformation Format

47 48 ABBREVIATONS Glossary

.torrent Is a short form for torrent file. The torrent file contains the necessary informa- tion to share a file or a set of files over BitTorrent.

Bootstrapping Bootstrapping is the name for the process of joining a P2P system. The problem is that initially no other peers are known.

Churn Churn is the term for unstability in a P2P network. It can have several causes like hardware failure or peers intentionally leaving the system.

Peer A peer is the term used for a member of system where every member is server and client at the sime time. The term is also used for members of a BitTorrent system.

Provider A peer in a BitTorrent system which can provide a file to a requesting peer.

Swarm A swarm is the term for a group of peers sharing the same file in a BitTorrent network. Peers of a swarm are always connected to each other directly or inderectly.

Tracker A tracker is a broker system for in the BitTorrent system. Original BitTorrent trackers are centralized servers, new approaches distribute the responsibility among peers. In the context of B-Tracker a peer is also a Tracker, since it can be queried for providers by other peers.

49 50 GLOSSARY List of Figures

1.1 Plot of torrent popularity on logarithmic scales based on The Pirate Bay Dataset 2008 [12]...... 3

2.1 The main contents of a .torrent for a single file download. More possible fields exist which are optional...... 7

2.2 Example communication between a peer and a tracker. The peer sends a HTTP GET request with parameters and receives a list of peers and an interval...... 8

2.3 The diagram depicts the two basic message types used in the BitTorrent protocol. The standard message has a payload depending on the type of the message. The numbers on top of the fields tell the fields’ sizes in Byte. 9

2.4 This sequence diagram shows a typical message exchange between a peer and a provider following the BT protocol...... 10

3.1 How many nodes manage a number of keys...... 16

3.2 Total requests per node per query interval...... 16

3.3 Flowchart of the main tracker algorithm...... 19

3.4 Flowchart of the message handling algorithm...... 20

5.1 Mean load per peer under different churn conditions...... 32

5.2 Decomposed mean load per peer under different churn conditions...... 33

5.3 Absolute load balancing...... 34

5.4 Relative load balancing expressed as Coefficient of Variation...... 35

5.5 Ranked mean load per peer for 0% churn...... 36

5.6 Ranked mean load per peer for 15% churn...... 37

5.7 Ranked mean load per peer for 30% churn...... 37

51 52 LIST OF FIGURES

5.8 Mean DHT load of a peer by message type...... 38

5.9 Mean number of DHT messages sent per experiment...... 38

5.10 Mean B-Tracker of a peer load by message type...... 39

5.11 Plot of the mean time to finish one run ...... 40 List of Tables

2.1 Comparison of the two BitTorrent DHT implementations and the Kademlia standard queries...... 12

5.1 Listing of the 9 experiments used in the evaluation...... 27

5.2 List of the relevant hardware specification of the server in use...... 28

5.3 B-Tracker parameters used in the experiment...... 31

53 54 LIST OF TABLES Appendix A

Installation Guidelines

In order to install the plugin into Vuze, a Vuze installation of version 4.7.0.2 is recom- mended. The plugin might also work with older versions of Vuze, but these were not tested.

To install the plugin into Vuze simply copy the folder /B-Tracker Plugin/btracker from the CD to the Vuze plugin directory. The plugin directory can be found in the Vuze properties menu under Plugins. Restart Vuze and B-Tracker will be activated for active downloads. To verify that B-Tracker is working, the B-Tracker plugin log can be opened over the Tools > Plugin menu.

To change B-Tracker parameters the btracker.properties file in the btracker folder must be edited. It does not make sense to use different properties on different peers.

55 56 APPENDIX A. INSTALLATION GUIDELINES Appendix B

Contents of the DVD

On the root level of the DVD there is the thesis as PDF file, a german and an english ver- sion of the abstract in plain text files and the final presentation slides in presentation.odp.

B.1 B-Tracker Plugin

The folder B-Tracker Plugin contains:

ˆ A folder btracker which contains the B-Tracker JAR file and btracker.properties file. This B-Tracker plugin build can be used with a non modified Vuze client version of version 4.7.0.2 or higher.

ˆ A folder B-TrackerPlugin ( project) which contains the eclipse project and sources of the plugin build. In order to build the plugin the Vuze sourcecode or JAR is required.

B.2 Data

This folder contains the raw log file data in the archive data.tar.gz. To regenerate the plot data and plots extract the archive and copy the *.gnuplot files to the same directory. Run parselogs.sh and then run gnuplot for the gnuplot files.

ˆ data.tar.gz Archive containing all the evaluation data.

ˆ *.gnuplot Gnu Plot scripts to generate the graphs.

ˆ *.pdf Graphics generated from the data in data.tar.gz

57 58 APPENDIX B. CONTENTS OF THE DVD B.3 Experiment

This folder contains all files needed to run experiments. Copy this folder to a linux system and adapt file paths in the following properties files:

ˆ btdhtpx100.seq Sequence properties, properties defined here will be used to overwrite properties in the other files.

ˆ btrackerparam.properties Basic B-Tracker properties.

In order to execute a full series of experiments execute the create interfaces.sh script before starting the experiment with nohupseries.sh.

Running the experiments requires the following prerequisites:

ˆ Python must be installed

ˆ A Java version of 1.6 or higher must be installed and added to the path.

ˆ For building B-Tracker an Ant installation is required.

The build script executes an svn update, password and username have to be set.

B.4 Related Work

Contains all papers used in the process of writing this thesis.

B.5 Sources

This folder contains the Eclipse projects for the B-Tracker Plugin and the adapted source code of Vuze which was used for the evaluation.

B.6 Thesis

Contains the LATEX sources of this thesis including images.