Monitoring BitTorrent Swarms

António Manuel Rebelo Alves Homem Ferreira

Dissertação para a obtenção de Grau de Mestre em Engenharia de Redes de Comunicações

Júri Presidente: Prof. Doutor Paulo Jorge Pires Ferreira Orientador: Prof. Doutor Ricardo Jorge Feliciano Lopes Pereira Co-orientador: Prof. Doutor Fernando Henrique Corte Real Mira da Silva Vogais: Prof. Doutor João Coelho Garcia

September 2011 ii Acknowledgments

To all those who helped me and supported me over this journey, from teachers to collegues, friends to family, i thank you all.

iii iv Resumo

Os protocolos Peer-to-Peer, especialmente o BitTorrent, sao˜ responsaveis´ por uma grande maioria do trafego gerado na Internet, tendo um grande impacto sobre o trafego´ inter-ISP e, consequentemente, nos custos de peering dos ISPs. Atraves´ da monitorizac¸ao˜ de mais de 3200 enxames reais num ambiente de Internet, descobrimos que existe uma grande quantidade de localidade que pode ser explorada e utilizada para diminuir o trafego´ inter-ISP. Discutimos a relac¸ao˜ entre o tamanho do enxame, a popularidade do conteudo´ e a localidade existente e demonstramos que, mesmo enxames de pequeno tamanho temˆ propriedades de localidade. Tambem´ observamos´ que existem enxames que partilham conteudo´ espec´ıfico de uma regiao,˜ demonstrando uma elevada localidade. Durante a experienciaˆ tambem´ descobrimos que existe uma quantidade de conteudo´ repetido a ser partilhado na rede. Varios´ peers temˆ a tendenciaˆ de publicar o mesmo conteudo´ atraves´ de ficheiros de torrent diferentes, criando varios´ enxames independentes que acabam a partilhar um grande numero´ de partes comuns. Esta redundanciaˆ pode ser explorada de forma a aumentar a disponibilidade dos dados e a diversidade de origem dos mesmos, bem como a localidade existente. Para explorar esta redundancia,ˆ propomos uma nova tecnica´ com o nome de Partial Swarm Merger, que adiciona um novo componente a` infra-estrutura BitTorrent, permitindo que os peers possam descobrir outros enxames que partilhem conteudo´ comum. Com esta informac¸ao,˜ os diferentes peers podem participar nos diferentes enxames, anunciando e solicitando de cada enxame as partes em comum com o seu download. Desta forma, a disponibilidade das partes em comum aos varios´ enxames, aumentara.´

Palavras-chave: BitTorrent, Peer-to-Peer, Monitorizac¸ao,˜ Localidade, Conteudo´ repetido, Disponibilidade

v vi Abstract

Peer-to-Peer protocols, specially BitTorrent, account for most traffic generated in the Internet, having a great impact on inter-ISP traffic and thus ISPs’ peering costs. However, through locality mechanisms, P2P traffic can be contained close by in the network and even in the same ISP, decreasing inter-ISP traffic. Through the monitoring of over 3200 live Internet swarms, we found that there is a lot of locality that can be exploited and used to decrease inter-ISP traffic. We discuss the relationship between the swarm size, content’s popularity and the existing locality and find that even small swarms have some locality properties. We also observed swarms sharing content specific to a region and thus showing a great amount of locality. During the experiment we also discovered that there is a significant amount of repeated content being shared. Various publishers tend to publish the same content through different torrent files, creating independent swarms that end up sharing a large number of common parts. This redundancy can be exploited in order to increase data availability and source diversity, as well as the existing locality. To deal with this redundancy, we propose a novel technique, called Partial Swarm Merger, which adds a new component to the BitTorrent infrastructure, allowing peers to learn about swarms with common content. With this information, peers could combine the different swarms, announcing and requesting from each swarm the pieces in common with their download. This will increase the availability of the parts which are common to the several swarms.

Keywords: BitTorrent, Peer-to-Peer, Monitoring, Measurements, Locality awareness, Repeated content, Availability

vii viii Contents

Acknowledgments...... iii Resumo...... v Abstract...... vii List of Tables...... xi List of Figures...... xiv

1 Introduction 1 1.1 The Problem...... 1 1.2 Work description...... 2 1.3 Structure of this thesis...... 3 1.4 Publications...... 3

2 BitTorrent Protocol 5

3 State of the art 9 3.1 Locality Solutions...... 9 3.1.1 Locality through Client...... 9 3.1.2 Locality through peer and ISP cooperation...... 11 3.1.3 Locality through ISP alone...... 12 3.1.4 Comparison between solutions...... 14 3.2 Locality studies...... 14 3.2.1 Studies of BitTorrent’s locality...... 14 3.2.2 Comparison between related studies...... 16 3.3 Content availability...... 17

4 Methodology for gathering and analysing data 21 4.1 System architecture...... 21 4.2 Data analysis methodology...... 23

5 Results 25 5.1 Content Analysis...... 25 5.1.1 Content pollution...... 25 5.1.2 Content repetition...... 26

ix 5.2 Locality Analysis...... 31 5.2.1 Repeated content...... 31 5.2.2 All content...... 33 5.2.3 Regional content...... 36 5.2.4 Large swarms...... 38 5.2.5 Two-hour period...... 38 5.3 Peer and Tracker behavior...... 41 5.3.1 Tracker behavior...... 41 5.3.2 Peer behavior...... 42 5.4 Summary...... 44

6 Partial Swarm Merger 47 6.1 PSM...... 47 6.2 Use Case...... 49

7 Conclusions and future work 51 7.1 Conclusions...... 51 7.2 Future Work...... 52

Bibliography 56

x List of Tables

3.1 Comparison between some solutions already implemented...... 15 3.2 Comparison between studies to the locality potential in BitTorrent...... 18

5.1 Torrent aggregation benefits...... 32

xi xii List of Figures

2.1 (1) Obtaining peers from tracker to be able to join the swarm, (2) exchanging data with other peers and (3) obtaining more active peers on the swarm...... 6

4.1 Work flow of the system...... 22

5.1 CDF with the percentage of unique pieces for each torrent file...... 26 5.2 Maximum swarm size and number of seeders for all swarms representing pollution with a maximum swarm size value above 50 peers...... 26 5.3 Time, in hours, swarms sharing polluted content take to drop to 20% of their maximum size...... 27 5.4 CDF with the shared percentage of pieces per number of torrent pairs with at least one common piece...... 27 5.5 CDF with the shared MegaBytes per number of torrent pairs with at least one common piece...... 28 5.6 Number of torrents published by team...... 29 5.7 Histogram of the content repetition frequency...... 29 5.8 Average number of peers per content...... 30 5.9 Average number of seeders per content...... 30 5.10 Number of peers per country for similar content...... 31 5.11 Number of peers per ISP for similar content...... 32 5.12 Increase in swarm size at each ISP by aggregating similar content...... 33 5.13 CDF with the content size for all content considered as being pollution...... 34 5.14 Average number of peers obtained and average number of peers per country for the one- day data aggregation period...... 34 5.15 Average number of peers obtained and average number of peers per ISP for the one-day data aggregation period...... 35 5.16 Distribution of the percentage of the median number of peers that belong to the same country or ISP for the one-day data aggregation period...... 35 5.17 Countries with an average above 30 peers per day for a regional torrent...... 36 5.18 ISPs with an average above 10 daily peers for a regional torrent...... 36

xiii 5.19 Regional torrents with 60% of all peers belonging to the same country, at least 75% of the times...... 37 5.20 Regional torrents with 30% of all peers belonging to the same ISP, at least 75% of the times. 37 5.21 Torrent size and maximum number of seeders for swarms with a maximum number of seeders above 5000...... 39 5.22 Average number of peers obtained vs average number of peers per country for the two- hour period data aggregation...... 40 5.23 Average number of peers obtained vs average number of peers per ISP for the two-hour period data aggregation...... 40 5.24 Distribution of the percentage of the median number of peers that belong to the same country or ISP for the two-hour period data aggregation...... 41 5.25 Distribution of peer per each tracker per torrent, for torrents announced in more than 8 trackers...... 42 5.26 Swarm size throughout time per tracker, for a video file torrent...... 42 5.27 Day-night behavior for a video file torrent...... 43 5.28 Number of seeders per swarm size...... 43 5.29 Swarm size per content size...... 44 5.30 Swarm size throughout time...... 44

6.1 PSM - Populating databases...... 49 6.2 PSM’s workflow...... 50

xiv Chapter 1

Introduction

1.1 The Problem

In the last few years, Peer-to-Peer (P2P) communication have increased exponentially [37] and proven to be one of the most successful architectures for providing a number of services like VoIP, video stream- ing and of course file sharing. Due to this rapid growth, Internet Service Providers (ISPs) have been seeing their inter-ISP traffic increase, mainly because of file sharing protocols. The main responsible is BitTorrent [9] which represents most of the P2P traffic generated worldwide [37]. The reason why P2P and especially BitTorrent are so popular is because of their unique properties. In P2P networks every peer can be both server and client, can connect to any other peer and the network doesn’t depend on any peer, allowing peers to enter and leave the network at any time. This means that there are no infrastructure costs. The only cost is the residential grade equipment, so it is easier and faster for a user to put his content online to be shared. P2P protocols also offer a very high scalability. Besides these P2P characteristics, the BitTorrent protocol also has mechanisms of robustness like the random selection of peers to which it connects to, a tit-for-tat algorithm to ensure a consistent download rate and guarantee that the client is connected to peers with roughly the same bandwidth as itself. These characteristics and others allow BitTorrent to be able to work well in networks with few peers and even better in networks with lots of peers. There are a number of solutions for decreasing the inter-ISP traffic generated by BitTorrent but some, like traffic shaping, are difficult to implement due to the fact that BitTorrent doesn’t use standard ports and can encrypt its traffic. Other solutions like caching are discouraged because some of the files shared infringe copyrights. One solution to the inter-ISP traffic problem are locality mechanisms. These mechanisms force peers to prefer connections to other peers in the same ISP or country rather than peers half way around the globe and, at the same time, maintain BitTorrent’s performance. The real goal is to provide a win-win solution for both user and ISP, where users get faster downloads, since they are connected to other peers in a close by network, and ISPs decrease the amount of inter-ISP traffic by increasing their intra-ISP traffic. However, implementing a locality mechanism isn’t as trivial as it may sound. In a P2P network there are no infrastructures and a peer can enter and leave the network at any

1 time. This makes the network very unpredictable and thus poses some difficulties in the implementing of locality mechanisms. BitTorrent also has other mechanisms that represent major barriers for the implementation of locality mechanisms, such as the random selection of peers to whom it connects to. Another characteristic of P2P protocols is the low barrier of entry for publishing content which has enabled many to publish the works of others. However, users tend to download the content from sources they trust. These sources are usually groups of individuals (publisher teams) that have made a reputation for themselves by competing with each other to be the first group to publish a specific content. This competition between groups often results in the creation and publishing of different torrent files that represent the same or very similar content. This fact is a source of redundancy which is not exploited by the conventional BitTorrent protocol. In BitTorrent, swarms are identified by an hash of information regarding the content being shared. Even if two swarms are sharing the same content, just by having a different file name, these swarms will be isolated from each other and won’t exchange any kind of data. To fully understand the BitTorrent protocol, the peers and the shared content, it is necessary to mon- itor and analyse live Internet swarms. This enables us to extract important information about the peers and swarms’ behavior, the existing locality in these swarms and identify patterns in the network. With all this information, we can check if the implementation and deployment of locality mechanisms, or other mechanisms like the ones that explore content redundancy, are viable and what kind of modifications to the protocol are needed to take the most advantage of such mechanisms.

1.2 Work description

Through the monitoring of over 3200 BitTorrent swarms, we were able to gather peer data, as well as information regarding the content being shared in each swarm. With this data, the work was divided into three studies: (1) a locality study, which presents all findings regarding locality in each swarm, (2) a content study that focuses on the analysis of the content being shared across the different swarms, and (3) a peer related study, which discusses peer’s behavior and other findings. The locality study shows that there is a great amount of locality to be explored in BitTorrent swarms, and analyzes all monitored swarms to quantify this locality. We show that the existing locality is corre- lated with the swarm size and thus the content’s popularity, since popular content is expected to have very large swarms. However, we also found swarms that represented content specific to a given country, region or language which showed a remarkable locality potential, despite their often small size. Another result observed was the fact that, even small, non-regional swarms, showed some locality that can be exploited. This locality was analysed in periods of one day and in periods of two hours and the results show that, for small sized content and fast download speeds, a locality mechanism would benefit much if it was also implemented in the tracker, substituting the random peer selection algorithm. Regarding the content analysis, we discuss the redundancies found in the repeated contents being shared across the network. We show that the competition between publisher teams results in the exis- tence of different swarms, sharing the same or similar content but isolated from each other and discuss how this affects the performance of BitTorrent. We demonstrate that combining these different swarms

2 into a larger one will result in the improvement of content availability and existing locality. We propose a solution called Partial Swarm Merger (PSM) as an efficient way to exploit the content redundancy. This solution would be based on a service outside the BitTorrent network and would not require any modifications to the BitTorrent protocol. However, BitTorrent client applications would need an extension in order to use the service. As for the peer related study, we show some findings regarding peer behavior. We show a day- night behavior, where the swarm periodically decreases and increases in size. We also present the distribution of peers through the different trackers for torrent files announced in several different track- ers. We show that, the current lack of load balancing mechanisms between trackers results in some trackers having much larger swarms than others, affecting availability and locality. Another behav- ior we found was that the percentage of seeders in the swarm tends to increase with the swarm size, making it very close to 100% for very large swarms. This behavior shows that free-riding is not a significant problem in BitTorrent, and that most peers are willing to share what they download.

The data for all these studies was obtained by downloading and analysing torrent files, and gather- ing information on the associated swarms. The latter was performed using an instrumented BitTorrent client application running on several PlanetLab[8] nodes that periodically queried trackers for swarm and peer information. The peers’ IP addresses were later converted to geographic position and to the corresponding ISP. This operation was accomplished with the help of MaxMind’s GeoIP databases.

1.3 Structure of this thesis

This thesis is organized as follows: Chapter2 explains the BitTorrent protocol in detail. In Chapter3 we present and discuss all of the related work, regarding locality mechanisms and studies, as well as, studies and approaches for improving availability of content. After the related work, Chapter4 presents our methodology for gathering and analysing data and Chapter5 presents all our findings. Then, we present our approach, PSM, for improving availability and locality through redundancy exploitation, as well as a use case scenario, in Chapter6. Finally, Chapter7 presents the final conclusions, future work and achievements.

1.4 Publications

So far, the work developed in this thesis has resulted in one paper. This paper was submitted to be presented at a Conference and we are now waiting for its evaluation.

* Antonio´ Homem Ferreira, Ricardo Lopes Pereira and Fernando M. Silva. Partial Swarm Merger: Increasing BitTorrent content availability. Submitted to the INFOCOM’2012. This article discusses the existance of isolated swarms sharing the same content in BitTorrent networks, and how the merging of these swarms can increase content availability. In the paper, we introduce PSM, ex- plaining how it would work and could be implemented.

3 4 Chapter 2

BitTorrent Protocol

Before presenting the work described in this paper, we provide a brief review of the BitTorrent protocol. BitTorrent is a P2P protocol for file sharing where peers share files among themselves, supporting the upload costs. Unlike other file-sharing P2P protocols such as eMule1 or Kazaa2, BitTorrent doesn’t provide any mechanism for file search. Its goal is just to exchange and replicate files. This means that all file searches are done outside the network. There are two main components on the BitTorrent protocol:

1. Tracker: Provides a list of peers sharing a given file. Can also receive and log information about upload/download rates and other details for statistical purposes.

2. Peers: Share a given file among them. There are two types of peers: the ones that have already finished the download of the file, called seeders, and the ones still downloading the file, called leechers.

To share a content, a peer needs to create a torrent file and publish it. This file contains meta- information like: (1) file names and sizes, (2) tracker(s) Uniform Resource Locator (URL), (3) the hashes for each file part (piece) and the fixed piece size, (4) comments, creation date, encoding and other information on the content and files. After publishing the torrent file, usually in a webpage, interested users can download and open it with a BitTorrent client application. This application reads the file and queries the tracker for a list of active peers for that same file. After receiving the list, it connects to the peers and starts downloading the file. All file distribution is done between peers. Trackers don’t get involved in the file sharing process. These two steps in the file download are represented in Figure 2.1 as step 1 and 2.

Peers exchange blocks of data from a data aggregate which contains one or more files concatenated. The exchange unit is the piece, which has a fixed size. Each piece is associated with a SHA1 hash, found in the torrent file and used to verify its integrity. After downloading and verifying the hash, a peer informs every peer connected to it that it already has that piece available for upload. These hashes are

1http://www.emule.com/ , last accessed August 2011 2http://www.kazaa.com/ , last accessed August 2011

5 1 2 3

Tracker

e

v

i

t

s

r c

Tracker e a e

r

e f

Tracker o p o

s

m

r s t

r

s t

e

e i s e

e L

e s e rs e p e p v iec pe

i t p u st

t t es ue u q q

s c Req Re Peer 2 e Peer 2 e a s n

u f r R ow q o e s kn e ece of t e Pi List R s p s i peer L Peer 2

Es Establish tabli Peer 1 co sh connection Re nnec qu tion es Peer 1 t p iec Peer 1 es P iec Peer 3 es Peer 4

Peer 3 Swarm Peer 3 Swarm Swarm

Figure 2.1: (1) Obtaining peers from tracker to be able to join the swarm, (2) exchanging data with other peers and (3) obtaining more active peers on the swarm. the same for equal pieces of the same size, being very unlikely that different pieces happen to produce the same hash.

Peers periodically query the tracker for other peers sharing the same content. However, peers can also query each other to obtain other active peers in their swarm. This can be done using (PEX) or through DHTs[13]. This is represented in step 3 in Figure 2.1. Each swarm is identified by the infohash generated from the torrent file. The infohash is an urlencoded 20-byte SHA1 hash generated with the information in the info key of the torrent file. This information includes: piece size, the hash of all pieces, the file name, the file size and other information. This way, two torrent files that refer to the same file can produce two different infohashes just by changing the file’s name. Despite sharing the same content, these two torrent files will be associated with two different and isolated swarms that do not share any information with each other. As for trackers, they typically provide 50 peers chosen using a random algorithm but are free to use better peer selection algorithms.

When a peer starts downloading a file, it doesn’t have any part of that file to share so it is very important to start downloading any piece to get a complete one. As such, for the first piece, a random algorithm is used to choose which piece to download. As for the rest of the pieces to download, they are chosen by a ”rarest-first” algorithm. This algorithm determines that a peer should select pieces to download that most of the peers connected to it don’t have. Popular pieces are left for later download. This way, file sharing in the network is faster since rare parts tend to become popular overtime and the protocol robustness is increased since it assures that rare parts don’t become unavailable with a peer leaving the network. ”Rarest-first” also guarantees a good performance even when there is only one seed sharing a given file. Peers will download from it different rare parts and eventually start sharing parts between themselves, not depending entirely on the one seed, making download faster. There is also another property in the piece selection algorithms to get pieces downloaded quickly: if a piece has already been requested, the algorithm has preference in finishing that piece before requesting a new one. As for the end of the download, there is also a mechanism to prevent delay from requesting a piece from a peer with a low transfer rate: the last sub-pieces are requested to all peers.

6 BitTorrent also has a way of handling peer communication through its tit-for-tat algorithm, also known as choking algorithm. Through this mechanism, peers chose to whom they upload their pieces, ”choking” (not uploading to) the others. There are a fixed number of slots for upload and every 10 seconds, the average download rate for the last 20 seconds for each peer is calculated. After this calculation, the peers with the highest average are unchoked. They are given a slot and upload to these peers starts. For the rest of the peers, no uploading is done. However, this mechanism has a problem: it may come a time where it is impossible to find other peers that may allow higher download rates. To solve this problem, there is a single slot that uses ”optimistic unchoking” which unchokes a peer every 30 seconds (by default) regardless of its download rate. This algorithm is equivalent to always cooperating in the Prisoner’s Dilemma. This way, every peer has a chance to compete for an upload slot and it can be assured that if there is a peer with higher download rate than the ones occupying the upload slots, it will eventually get a slot for itself. All these properties make BitTorrent very scalable, provide a very fast way to share files and discour- age ”free-riding” [19] where peers don’t share what they download.

7 8 Chapter 3

State of the art

Many studies have been done regarding P2P network monitoring and, especially BitTorrent monitoring. In this Chapter, we focus on most of the work done in the field of locality and content availability. Regarding the locality issue, there are three main solutions: (1) solutions implemented only on the Bit- Torrent client, (2) solutions for BitTorrent and ISP cooperation and (3) solutions implemented only in the ISP. In Solutions implemented only on the BitTorrent client, a client computes the distance between itself and other peers and chooses to connect to the ones ”close” to it. Another solution is to have BitTorrent clients and trackers cooperate with ISPs. Since ISPs have inside information about the networks topol- ogy, they can know exactly where peers (in their own network) are and specify rules and paths for the communication between each of them. These two different solutions are based on peer selection where a peer is supposed to connect to and download the file from other peers that are ”close” to it. However, there are other solutions implemented by ISPs alone like caching, traffic shaping or even traffic blocking. These are the most difficult to implement because of P2P networks and BitTorrent protocol characteris- tics, such as the ability to conceal its traffic. Regarding content availability, most work presented focuses on bundling and other technics that can be used to increase the availability of unpopular content by grouping and sharing related content. The main goal of these solutions is to supply a mechanism that enables BitTorrent and P2P file sharing protocols to maintain high availability for a given content, even after the initial flash crowd moment.

3.1 Locality Solutions

3.1.1 Locality through Client

A BitTorrent client uses a random algorithm to choose the peers it is connected to. This is the first thing to be changed when a locality mechanism is implemented. In a client solution, a different algorithm is used, one that makes the peer connect to other peers ”close” to it, belonging to the same ISP, for example. Ono [6] is an Azureus1 extension that uses a biased peer selection mechanism. This mechanism re-

1Azureus is a client. http://azureus.sourceforge.net/ , last accessed August 2011

9 lies on the Content Distribution Networks (CDNs). These networks have replica servers all over the globe, in many ISPs, and through dynamic DNS redirection, a client is sent to the nearest CDN servers. This means that two clients that are redirected to the same CDN servers are likely to be close to each other. Ono uses this information to build ratio maps that represent the proximity of a given peer to a CDN server and then the peer uses these ratio maps to determine if other peers are close to it. If two peers have similar ratio map values, there is a high probability that they are close to each other. This solution doesn’t need any additional information about the network topology and no new infrastructure but doesn’t give any guaranties that it will decrease the inter-ISP traffic since not every ISP has its own CDN server. However, the authors’ evaluation of the work developed shows that it reduces the inter-ISP traffic since, over 30% of the time, it finds peers that belong to the same AS as itself. The authors also show that their solution increases download and upload rates, whenever bandwidth is available, and that peers are located along paths with lower RTTs than those chosen randomly. Other solutions suggest some modifications either in the tracker’s peers list (the one requested and sent to peers) and client’s peer selection algorithm or just in P2P traffic shaping devices [3]. The peers’ list is built with the help of internet topology maps or ISP’s information about its IP address range, so that peers connect to others close to them. However, peers don’t all connect to just the ones from the same ISP, they also establish and maintain some connections to peers outside the ISP so that they can have a global view of the pieces existing in the network. To evaluate the usage of Biased Neighbour Selection mechanisms, a discrete-event simulator was built and used. Their conclusion is that a biased peer se- lection method ”works well” but should be combined with other mechanisms such as bandwidth throttling or caching. To have the network work as desired, there should also be a number of peers outside the same ISP in a peer’s list. In [5], another approach was made. To study how a BitTorrent locality mechanism would improve user and ISP experience, a number of solutions are implemented, analyzed and measured in a real internet AS topology. Solutions include changes in BitTorrent protocol at peer selection level, choking/unchoking level and a piece picker locality. The peer selection method the authors implemented is very similar to the one in [3]. The tracker keeps an association between peers and AS hop counts and uses this list to select the peers to send when a request for it is made. The modifications to the choking/unchoking algo- rithm consisted in choking and unchoking peers based on the network distance among each other (AS hop count) instead of peer’s uploading speed. However, the optimistic unchoking was left untouched. As for the piece picker locality method, it substituted the rarest-first so that a peer is encouraged to download first the pieces close to it. The evaluation of the solutions shows that tracker locality (peer se- lection method) is able to achieve low AS hop count while choker and piece picker locality decrease the download time, in comparison with standard BitTorrent. Although the tracker locality was able to retain traffic close by in the network, it had a problem: peer workload was not being distributed evenly. In their conclusions, the authors suggest that there should be a tradeoff between locality and peer workload. The authors of [33] also focused their work on Biased Neighbor Selection and Biased Unchoking meth- ods. Like [5], they modified the peer selection algorithm and the choking algorithm to prefer connections to peers based on the AS hop count. However, they also decided to change the optimistic unchoking al-

10 gorithm. While in the standard optimistic unchoking method all peers choked have the same probability of being optimistically unchoked, in their approach, this probability was associated with the distance to these peers. The closer a peer is, the higher the probability it is optimistically unchoked. The authors studied and compared the usage of Biased Neighbor Selection, Biased Unchoking and both and the results show that the combination of both methods achieves the best performance in both locality and download speed. Their experiment also showed that Biased Unchoking works best in high load situa- tions. There are also other solutions that include a biased peer selection through real-time measurements of pings, traceroutes, AS hops, etc. An example of a work in this area is the development of full BitTorrent client application program called TopBT [40]. TopBT is a topology aware client that needs neither ISP nor services like CDNs to find peers close by. It uses traceroute and ping probes from time to time to deter- mine a proximity to a peer, and map the already established connections to corresponding AS hops and link hops. After obtaining this information, it unchokes peers based on the routing hops, download rates and reciprocal upload rates. This way, there isn’t even the need to have the platform largely deployed to see results in the network’s traffic. However, this work focuses more in decreasing the download time and BitTorrent traffic in the network, since it tries to connect to peers with lower hop count, and not necessarily solving the inter-ISP traffic problem (although it can contribute to solve this problem). The AS hop count determines the number of ASes that a packet has to go through from source to desti- nation. Although a low AS hop count can help contain the traffic close by in the network, there are no guarantees that it will be contained inside the same ISP unless AS hop count is equal to zero. Their evaluation of the BitTorrent client was done by deploying it onto 106 PlanetLab and residential hosts and downloading various torrent files in each node/host. The results, when compared to the ones obtained from a regular BitTorrent client, show that TopBT has a lower download time and can reduce up to 25% induced BitTorrent traffic. One other work that also focuses on exploiting locality in P2P networks is Adaptive Search Radius (ASR)[35]. The authors show that both ISPs and peers could gain if the network was used more effi- cently. ASR is a peer selection mechanism that defines a search radius, measured in network hops with the peer in the center, and only connects to other peers inside that search radius. However, this mech- anism dinamically changes the search radius according to the file parts’ availability. This way, download time is not affected and the client application still uses the network in an efficient way. This search radius only affects the downloads, not the uploads. Through simulations, the authors show that, despite the fact that ASR proved to accomplish better results than BNS, these two approaches complement each other and would benefit if combined. The remaining results show that both peers and ISPs would benefit from the usage of ASR in an internet topology network.

3.1.2 Locality through peer and ISP cooperation

There are also some approaches that rely on a tight cooperation between peers and ISPs, like the one presented in [1]. This work focuses on a service provided by the ISP, an Oracle Service that users can

11 query to get information on the underlying network, enabling peer selection according to ISP defined criteria. When peers query the Oracle, they send it a list of possible P2P neighbors and the oracle ranks each of them according to some criteria defined by the ISP. Since the service is provided by the ISP and it has full access to information regarding the underlay network, the Oracle can rank P2P neighbors based on anything, from topological information to link congestion or cost. This service has many advantages towards the other solutions that lack cooperation between peers and ISPs: (1) peers don’t need to waste valuable time and resources measuring path performance, (2) there is no need to infer network distances since the ISP has real time information about all of its links, (3) bottlenecks can be avoided thus improving user experience and (4) ISPs can direct traffic away from expensive links or congestion ones. Despite these advantages, there is a problem regarding the mutual trust between peers and ISPs. Since there are a lot of files being shared illegally, users tend to avoid using these services. As for ISPs, they are very reluctant in giving away information about their own network. The system was evaluated in a Testlab and using a modified Gnutella1 protocol to query the Oracle upon peer entering. The results show that the number of Query messages, flowing in the network, decreases and that messages tend to stay inside the same AS. Future work for this project includes evaluation of the work running it in PlanetLab. Another solution created in this area is P4P [44]. P4P follows the same objective as the Oracle that is to supply underlay network information to the overlay P2P network but does it in a different way. In this approach there is an iTracker located in each ISP that supplies network layer information about the network on which it is. This component has all the information to rank P2P neighbors through both network distance and ISP defined criteria. There is also another component called an appTracker. When a peer wants to find other peers close to it, it queries the appTracker for a peer list. In order to create this list, the appTracker communicates with the iTracker from each ISP to build a list with peers mainly from same AS. However, some peers in the list should be from different ASes as the requesting peer so that the robustness of the P2P protocol is maintained. In the end, the list of peers is sent back to the requesting peer. This approach was evaluated through simulation and real Internet experiments and is already deployed. The results show that it can be a promising approach to solve inter-ISP traffic problem as well as maintain P2P performance. IETF has also formed a working group for ALTO (Application Layer Traffic Optimization) [39] with the objective of defining a protocol for P2P applications to query a service about the underlying network topology. They discuss the requirements for a standard ALTO protocol and also focus on some issues like security, privacy and service discovery for ALTO usage.

3.1.3 Locality through ISP alone

The paper [36] discusses the major locality-aware solutions that can be implemented: (1) blocking traffic, (2) using network caching, (3) shaping traffic and (4) using stateful policy management. Each one is examined and it is determined if the solution is able to solve the problem or not.

1Gnutella is a P2P file sharing protocol. http://rfc-gnutella.sourceforge.net/ , last accessed August 2011

12 1. Blocking traffic[36]: The aim of this solution is to reduce the traffic and consequently the bandwidth associated with P2P communication. This blocking is usually done by blocking ports that are as- sociated with P2P networks. However, P2P applications often use dynamic ports and allow users to choose which ports to use, making this solution very difficult to implement. In the case of Bit- Torrent, by blocking P2P traffic to the outside network, the tit-for-tat algorithm might start choosing peers from the same ISP to share the contents. However, if there aren’t enough peers in the same ISP sharing that same content, users experience will be harmed. Although this mechanism can reduce costs of P2P traffic to zero, its disadvantages are enough to discourage its implementation.

2. Using network caching[36]: This is a workable solution that consists of the ISP maintaining a cache of popular P2P files and redirecting the client to that cache. By doing this, traffic is reduced and kept as much as possible inside the ISP’s network. Despite all this, these solutions have some problems associated with illegal content sharing, where no copyrights are paid. It can only cache legal contents and so cannot be used in every situation since there are files being shared illegally in many P2P networks.

3. Shaping traffic[36]: Its objective is to analyze and label traffic with different priority levels. This way, P2P traffic is identified, given a low priority and thus consumes less bandwidth. With this approach ISP can have great control over traffic in its network. However, P2P applications adopted several mechanisms to avoid being labeled as such by hiding the nature of the packets being exchanged in the network. For example, BitTorrent can encrypt data exchanged between peers, making it very hard to examine and label. For this reason, traffic shaping is very difficult to perform accurately. There are also several problems regarding user experience. Since all packets need to be examined, tagged and wait for transmission based on its priority, a delay is most likely to be added to the communication. Despite the cost reduction obtained from using this mechanism, it is not an effective way of managing all P2P traffic.

4. Stateful Policy Management[36]: Uses statefull, deep-packet inspection to intelligently identify, la- bel and redirect P2P traffic away from expensive links. It can manage both downstream, through redirections, and upstream, controlling the number of connections to outside networks. This is achieved by facilitating the connection between peers inside same ISP. If a peer is trying to down- load a file from an external source, the Stateful Policy Management mechanism can check if an- other peer inside its network is sharing that same file and redirect the connection to this internal source. This solution is transparent to the subscriber and doesn’t have the problems associated with traffic shaping since user experience remains the same. This is the best solution when com- pared to the others.

Besides these solutions, there are others like LiteLoad [18] that follows a Stateful Policy Management approach. LiteLoad is a system that detects and manages P2P communications arriving from or to a peer and does it completely unaware of the content being shared by peers. It requires no blocking, caching or shaping of traffic. It simply looks for patterns of communication that identify existing P2P networks. Through filters in the ISP’s network, the packets that match known messages for one or

13 more P2P protocols (for example, session initiation messages) are sent to LiteLoad and, based on predefined rules, the system decides to let it reach its original destination or changes the destination’s IP address of the packet to another peer’s IP address. The new destination can be an internal peer (to the ISP) or a specific external one, avoiding expensive links. LiteLoad has a pool with both internal and external peers found so there is no problem in redirecting communications to another peer, rather than the original destination. Despite filtering and redirecting messages, it has no interest in the content and doesn’t analyze it or build any kind of index with it. This is done for two reasons: (1) make the system unaware of the content being shared to avoid being able to identify illegal content and (2) make the system able to work under both exposed and encrypted protocols, since it only needs the header to do its job. The system was evaluated through a simulation with real peers and a proof of concept was made. It was also tested with eMule on obfuscated mode (data encrypted) and proven to work. The authors’ future work is to implement the system on a large scale ISP and develop the system to become more flexible to deal with more behavioral patterns of the users.

3.1.4 Comparison between solutions

To summarize, Table 3.1 shows a simple comparison between the different implementations of locality mechanisms, mentioned in the previous sections. For each solution it holds the locality mechanism used, where it is implemented and what dependencies it has. By analyzing the table, we can see that locality mechanisms used by the different solutions have different dependencies, depending on where they are implemented. Implementations that just rely on modifica- tions to the BitTorrent client application tend to depend on other services already deployed, like CDN networks, or on internet topology maps, for retrieving information on the AS hop count between each peer and itself. ASR and TopBT are the exceptions. While TopBT only needs an IP to AS map and its own traceroute results to be able to calculate the AS hop count between itself and other peers, ASR has no dependencies. However, its results depend on the percentage of peers in the network using ASR as well. Solutions like the Oracle or P4P give ISPs a method to redirect P2P traffic as they wish and only depend on the ISP for the network topology information. Solutions implemented by the ISP alone, also have the goal to redirect P2P traffic away from undesirable links, however they depend on machines and methods capable of identifying this same traffic.

3.2 Locality studies

3.2.1 Studies of BitTorrent’s locality

Despite an increasing number of solutions, there is still a lot of work to be done in studying the viability of locality mechanisms. The main objective of these approaches is to decrease inter-ISP traffic, in a way that doesn’t affect the user experience. Since P2P networks are very dynamic and thus very unpre- dictable, one must first understand the protocol, how it works and how users take advantage of it. One of the first studies regarding the problem of the inter-ISP traffic caused by P2P communications

14 Solution Locality Mechanism Implementation Dependencies Ono Proximity to a CDN server BitTorrent client applica- CDN network tion TopBT Proximity to other peers, BitTorrent client applica- IP to AS maps (for calcu- both AS hop count and tion lating AS hops) delay ASR Proximity to other peers P2P client application User experience depends based on network hops on the percentage of peers in the network running ASR Biased Peer Proximity to other peers BitTorrent client applica- Internet Topology Maps Selection based on AS hop count tion and/or Tracker and/or ISP (for IP address information) Biased Proximity to other peers BitTorrent client applica- Internet Topology Maps. Unchoking based on AS hop count tion Should be combined with Biased Peer Selection Oracle/P4P Proximity to other peers ISP or independent ser- ISP (for network topology based on rules defined by vice (small modifications information) ISPs to the BitTorrent client might also be needed) Traffic shap- Redirecting traffic away ISP Capability for identi- ing/Caching/ from expensive links fying/caching/shaping etc traffic LiteLoad Redirecting traffic away ISP Capability of identifying from expensive links patterns of user commu- nication in P2P networks

Table 3.1: Comparison between some solutions already implemented.

and how the costs of content distribution are shifting from CDNs to ISPs was [21]. This study analyzed BitTorrent tracker logs and ”payload packet traces collected at the edge of a 20,000 user access net- work”. After analyzing all this data, the conclusion was that there was enough locality to be exploited. In the end, the performance of some solutions was evaluated like having proxy-trackers on the edge of a network or based on domain names, matching rules or network-aware clustering [22]. Another study [30] debates the difficulties related to locality mechanisms in BitTorrent and compares mechanisms such as Ono, latency, shortest path and same AS. Their conclusion proves that locality exists but isn’t enough to be exploited without degrading the network’s robustness. Although previous solutions and studies are also related to the work that is going to be done and is explained in this document, they focus their attention in getting results through the implementation of solutions. This work wants to monitor several real-life swarms in real-time and real network topology. One of the first approaches to the monitoring of real-life swarms was [23]. In this paper, the lifetime of a torrent was followed throughout 5 months. The torrent observed was a Linux Redhat 9 distribution and 180 thousand clients downloaded this torrent during the 5 months. This monitoring wasn’t exactly to study locality issues but rather BitTorrent itself. Still, geographic information associated with each peer was logged. Since only one torrent was monitored (that was known from the start that it would be popular), this paper can easily show every step in a torrent’s life, especially the flash crowd moment. There was one work [42] where the authors, claiming that studies until then seldom discussed the con- tent and peer diversities, monitored BitTorrent swarms for a few thousands video and non-video files.

15 They downloaded (automatically) every torrent they could from a specific web page for advertising tor- rent files, and had an application running in Planetlab to gather as much information as possible about the file and peers downloading and sharing it. The results of their real-world measurements show that a global locality approach is not the best choice since most AS clusters don’t have potential for locality. In [11], the authors main goal was to answer questions such as: ”what are the win-win boundaries for ISPs and their users?” or ”what is the maximum amount of transit traffic that can be localized without requiring fine-grained control of inter-AS overlay connections?”, among others. They collected 100k tor- rent files and then constantly queried tracker to gather all information associated with each torrent. Each peer was associated to its own ISP and download speed. This speed wasn’t measured in real time but taken from speed-test services available. With this data, they studied the performance of several mech- anisms as means for locality. Their conclusions say that through locality mechanisms, in most cases, there is a win-win situation. Although the last two papers got real results from real-world measures, they only studied public trackers where there are no incentives for prolonged sharing of a file. For this reason, the authors of[43] took a different approach. They followed torrents in both private trackers and public trackers and compared both results. Since in private trackers peers have reputation which consists of an upload/download ratio, peers have a need to share what they download for much longer than in public trackers. This is the so called Share Ratio Enforcement[43]. Also, private trackers are usually not opened to anyone like public ones and often ban users that lack on sharing. The need for uploading as much as a peer can is called, as referred in the paper, ”uploading starvation”. In monitoring both torrent and users activities in private trackers and comparing to public ones, it can be seen that both have very different behavior and both make very different contributions for BitTorrent traffic in the network. In a more recent study, [17] follows the one before, but also gets a lot of focus in the popularity of the torrent file. As expected and proven by measurements, different torrents and associated swarms have different locality awareness. For example, English video files tend to be popular worldwide while Span- ish or German video files tend to be popular only in some regions, as expected. Other aspect that was observed in this work was a ”clear statistical indication for day-night behavior”. Their conclusions show that the bigger the swarm, the more potential it has for locality, as expected, and that the variation for the number of peers per AS is quite high which can impose some difficulties in the implementation of locality mechanisms.

3.2.2 Comparison between related studies

Table 3.2 shows a comparison between some of the studies of the BitTorrent network previously men- tioned, showing the different aspects monitored in each of these studies. As we can see from the compared studies, [23] was the one that collected more information, especially regarding swarm and individual peer lifetime. However, it only followed one torrent file, so only one swarm was observed. As for [17], it measured the swarm and individual peer lifetime on selected swarms and only through the course of several days. For this reason, their measurement was considered limited. As for peer

16 bandwidth, [23] calculated it based on the session download time, obtained through tracker logs. On the other, the authors of [11] used the results of Ookla’s speedmeter service2 to obtain the peer bandwidth. It is easy to understand why the method was considered a limited and indirect measurement, since Ookla’s speedmeter only allows users to check an average download speed for a specific IP address range. No measurement to the peer’s bandwidth was made. Regarding the tools used, the authors of [17] used both PlanetLab and G-Lab 3 [38]. Summarizing, apart from [23], the studies presented relied more in snapshots of the swarm rather than the lifetime of peers, seeders and swarm. There is also very few information on the actual download rate of each peer. As for tracker’s availability, no study gathered any information whatsoever. When the tracker is down, peers can only connect to and exchange data with the ones they already know, and are not able to request new ones. This way, the tracker’s availability should be considered when studying BitTorrent swarms.

3.3 Content availability

Despite the fact that BitTorrent has very high scalability, dealing very well with flash crowd moments, it struggles to replicate content with few peers exchanging it. When content is either unpopular or has a very small number of seeders sharing it, BitTorrent tends to make the download last longer which makes it less efective. Most solutions to this problem either involve analysing content represented by the same torrent file being shared between different swarms or by using mechanisms such as bundling. One of the first approaches to increase availability through swarm merging is DISM [12]. This work focuses on load-balancing between trackers responsible for the same torrent file. Since many torrent files are published across a number of trackers, by re-allocating peers among these trackers, small swarms can be merged into bigger ones and at the same time avoid re-allocating too many peers to a specific tracker. This way, there is a load-balancing between trackers and swarms, and availability is increased for the swarms with few peers. However, this approach is only good for the same torrent file and is based on the fact that the same torrent file is shared across a significant number of trackers. Their proposed solution would be implemented through a small change in the BitTorrent protocol by adding a new protocol message: tracker redirect, for redirecting peers to other trackers. A different study determined, through measurements, that more than 85% of all peers participate in more than one torrent [14]. This means that most peers are, at a given time, sharing several different files and contents. Through measurements, analysis and modeling, the authors concluded that availability is a big problem with BitTorrent, mainly because of the decreasing peer arrival rate and the free-riding. Since most peers participate in various torrents at a given time, they propose incentives based on inter- torrent collaboration, instead of the current incentives, for prolonged seed lifetime. This mechanism follows the same path as current BitTorrent mechanisms, namely the ”tit-for-tat”, to be able to get also an instant collaboration. However, this mechanism would calculate the incentives and collaboration based

2http://www.ookla.com/ , last accessed August 2011 3http://www.german-lab.de/ , last accessed August 2011

17 Study Dissecting bittor- On the Locality Deep Diving Measurement of rent: Five months of BitTorrent- into BitTorrent BitTorrent Swarms in a torrent’s based Video File Locality[11] and their AS lifetime[23] Swarming[42] Topologies[17] IP to AS NO YES YES YES mapping Geo- YES NO YES YES graphic position Swarm life- YES (one torrent NO NO YES (limited) time for 5 months) Individ- YES (one torrent NO NO YES (limited) ual peer only) lifetime Seeders’ NO NO NO NO lifetime Swarm size YES (one torrent YES YES YES only) Number YES (one torrent NO YES YES of seed- only) ers and leachers File YES (one torrent YES YES YES type/size only) Peer band- YES (based on NO YES (indirectly and NO width session download limited) time) Tracker’s NO NO (abnormal re- NO NO availability sults due to tracker failure were not in- cluded in the study) Tools used Caida’s NetGeo; Modified version of Own Own tracker logs; Bit- CTorrent; Planet- Script/Program; Script/Program; Torrent client Lab; Whois GeoIP; Ookla’s GeoIP; PlanetLab application logs speedmeter; and G-Lab; Whois iPlane project

Table 3.2: Comparison between studies to the locality potential in BitTorrent.

on all torrent files being shared by peers, whereas current mechanisms focus on only one torrent at a time. The authors argue that this approach could be applied to an exchange based incentive mechanism that would be fairer and improve content availability.

Another approach to the problem is bundling [29]. By grouping and sharing of related content, avail- ability can be improved for unpopular content. The authors quantify content availability and explain how bundling can improve current availability. The authors’ measurements show that 40% of the swarms have no publishers (seeders) available more than half the time. However, bundling can increase the content availability. By grouping related content, unpopular content can become much more popular and thus improve its availability in the network. This is a technique that is already being used by many publishers. The best example for bundling is the sharing of music albums instead of a single song. Their measurements prove that bundled content has more availability than isolated content in comparison and that download times for unpopular content decrease when bundling is used. However, this method

18 forces peers to download content they don’t want along with the one they want. This way, peers consume more resources, mainly bandwidth and disc space, for content they don’t want/need. Throughout the paper, the authors also develop a model for content availability. In [16], the authors also addressed the availability and the bundling solution but in a rather different way. First they show that content bundling is already widely deployed in BitTorrent and quantify it, and then in [15] the same authors propose that bundling should be done automatically by the system and not manually by the publisher, as it happens currently. However, they propose to consider content similar based only on the torrent file name. File size, content hash or even category are not used as criteria due to the fact that BitTorrent users use the name of the file to find the content they are looking for. By comparing three text classification algorithms, they concluded that the ”cosine” one is the one with more accuracy and show that it is possible to get benefits from ”title-based bundling”. A study of what drives publishers to publish content, was also very important for understanding the amount of repeated content being shared in the network [10]. Through swarm measurements they discovered that most publishers are divided into three categories: antipiracy agencies that publish fake or malicious content, altruistic publishers and profit-driven publishers. With the help of the RSS feed from BitTorrent tracker sites, they could identify the initial publisher by being the first ones to join a given very recent swarm and identifying the only seed. Their study shows that a very small number of publishers is responsible for about 67% of all published content. In their conclusions, the authors state, through their measurements and analysis, that if the profit-driven publishers were unable to continue to publish new content, BitTorrent could have its popularity at risk.

19 20 Chapter 4

Methodology for gathering and analysing data

In this Chapter, we describe the methodology used for gathering and analyzing all the information, as well as the results obtained. First, Section 4.1 presents the system’s architecture for gathering data on each torrent file and peers sharing it. It shows the workflow of the system and presents all compo- nents developed and used. Then, Section 4.2 presents and explains the criterias used and the different choices made in the data analysis. It also shows how long the experiment took, how many swarms were monitored and where the torrent files were obtained.

4.1 System architecture

The monitoring system was composed by three components:

1. Rich Site Summary (RSS) reader script

2. BitTorrent client application

3. Data Storage Server

Figure 4.1 represents the data flow between the different components.

First, the RSS reader script reads periodically RSS feeds from three major BitTorrent search en- gines: PirateBay1, isohunt2 and btjunkie3. These RSS feeds contain, among other information, a URL for downloading the torrent file, which the RSS reader script used to obtain the file. These torrent files were then stored in a local webserver. While these files were being downloaded by the RSS reader script, several instances of the BitTorrent client application ran in each PlanetLab node. This application was an instrumented version of BitTornado (a python BitTorrent client application)4. It was modified so

1http://thepiratebay.org/ , last accessed August 2011 2http://isohunt.com/ , last accessed August 2011 3http://btjunkie.org/ , last accessed August 2011 4http://www.bittornado.com/ , last accessed August 2011

21 Tracker WebServer GeoIP

Peers’ list IP to geo position Swarm size mapping

torrent files Torrent’s logs BitTorrent client RSS reader BitTorrent clients Data Storage Server

IP address Number of Pieces

Remote peer Remote peer

Figure 4.1: Work flow of the system. that it would download the torrent files, from the webserver where the RSS reader script would put them, reading the torrent files and querying corresponding trackers to obtain peer related information such as swarm size, number of leechers and number of seeders. PlanetLab was found to be very useful in this experiment since its usage made it possible to gather information concurrently, more diverse information and to prevent the constant queries made to the tracker being misinterpreted as a DoS attack. BitTor- nado was used to query trackers for peers at intervals of approximately 20 minutes. Since BitTornado doesn’t support the usage of Distributed Hash Tables (DHTs), tracker queries were the only way to ob- tain peers for each swarm. After obtaining the peers IP address from the tracker, the application would connect to each peer in order to learn how many pieces it had already downloaded. The application was also modified in order to not download the content pertaining to the torrent files. As for the Data Storage Server, it would download the logs generated by each BitTorrent client application instance and group them by date. This was done twice a day, in a twelve-hour period. After collecting all logs, this com- ponent started to process the logs to extract specific information, joining all gathered data to create a global snapshot of each swarm. Then, using the MaxMind’s GeoIP databases, it converted the obtained IP addresses to geographic positions (such as country) and corresponding ISPs. The RSS reader script component was also responsible for lauching new instances of the BitTorrent client application, for the newly downloaded torrent files, and stopping the instances that were monitoring ”dead” swarms. To lauch new instances, it would select the PlanetLab nodes with less application instances running at that time, to evenly distribute the load over the different nodes. However, for each

22 torrent file there was always at least 20 nodes monitoring its assotiated swarm. As for the stopping of the monitoring process, since both RSS reader script and Data Storage Server were running on the same machine, there was an easy data sharing and the RSS reader script was able to identify the swarms that were not generating relevant information (as explained in Section 4.2). By having a modular architecture, with a single Server coordinating all BitTorrent clients running in all PlanetLab nodes, the system gathers data much more efficiently and in higher quantity. This experiment ran for almost 3 months so, in order for the data collected to represent exactly what happened in the network during that time, the system needed to have a very high availability. As for developed software, most of it was done using java, python and shell script. These choices were made based on the simplicity of the syntax and ability to process large data files in a fast and easy way. Both BitTorrent client application and RSS reader script were developed in python programming language, where as for the file processing, shell script was used. Java programming language was only used for the IP to geographic position and ISP mapping.

4.2 Data analysis methodology

The first step in this study was the download of the BitTorrent torrent files. The files were obtained through a Rich Site Summary (RSS) reader script that read RSS feeds from PirateBay, and btjunkie and downloaded the files in each feed. The RSS feeds were followed from the 25th of April 2011 to the 18th of July 2011. The torrent database was primed with Piratebay’s one hundred most popular files on April 25th and complemented with isohunt’s twenty most popular files for each major category (audio, tv and video) on May 24th. As for the swarm monitoring, a swarm stopped being monitored when the number of peers dropped below 30 and the number of seeders dropped to 0 for a period of a week. This criteria was based on the fact that these swarms were found to be very unpopular and thus no peer would want to download them. Since this work was divided into two studies, one regarding the repeated content found to be shared in the network and a locality study, there were different criterias established for the data analysis for each study. Regarding the repeated content, in order to determine which torrent files represented the same content or at least very similar content, the torrent files (metainfo) were compared based on the following similarity criteria (in order):

1. Piece size - In order to be able to compare pieces’ hash values, it is necessary for the pieces of both torrent files to have the same length;

2. Overall content size - Torrent files were only compared if either both had the same size or their sizes were within a margin of 5% of each other;

3. Pieces’ hash value - For two pieces to be considered as corresponding to the exact same content, their hash needs to be exactly the same;

23 4. Number of pieces in common - We decided that files with more than 75% of the total number of their pieces in common should be considered as corresponding to the same or very similar content. These pieces also need to be in order and starting at the beginning of both torrent files.

The cutoff value of 75% common pieces was used after analysing all the gathered torrent files. Figure 5.4 shows the cumulative distribution function (CDF) for the number of common pieces for every torrent pair combination with at least one common piece. It shows that torrent pairs either have less than 20% in common or more than 95%. We choose to focus on the torrents with more pieces in common, where the gains would be larger, thus compensating for the overhead of the torrent similarity discovery process. Although two torrents are determined to be similar without ever analysing their content, it is nearly impossible for so many parts to have common hash values and have different content. The torrent file name and corresponding content file name weren’t a criteria for comparing contents, since torrent files often have a serial number or their infohash as name and the torrents’ file content many times doesn’t have a name corresponding to its content (for example, a movie might have the name movie.avi instead of the actual movie name). However, this data was useful for manually validating a sample of the findings. As for the geographic and ISP information, using MaxMind’s GeoIP databases we were able to convert the collected IP addresses into corresponding country and ISP. The IP addresses collected were aggregated and analised based on periods of a day and periods of two hours. This means that we calculated the existing locality for each swarm based on all IP addresses gather either in a one- day period or a two-hour period. We chose each of these periods based on results obtained from the monitoring and on the average content size, which is expected to affect the download time and thus the peer’s lifetime. As we will see in Section 5.3, the swarm size shows a periodic behavior, with a period of approximately one day. As for the two-hour period, it was chosen to prove that there still exists much locality in a shorter period. This choice also allowed us to show that the existing locality dependents on the peer’s lifetime and download time. Swarms that never reached a minimum of 50 simultaneous peers or never had a seed online during the monitoring were excluded from the locality study.

24 Chapter 5

Results

In this Chapter, we present and describe our findings. Section 5.1 presents the results related to the content analysis study. Section 5.2 describes the results for the locality study, analysing and describing the locality found in BitTorrent swarms, and Section 5.3 discusses other interesting findings, such as tracker and peers’ behavior.

5.1 Content Analysis

This first part of the study refers to the content analysis. This Section will present all the results for the content study such as the amount of polluted content[7] discovered to be shared in the network and repeated content shared in different and isolated swarms. For this study, we show some patterns and behaviors related to the content shared in the swarms.

5.1.1 Content pollution

We obtained 3211 torrent files of which 221 had more than 70% of their own pieces’ hash values re- peated. These files were excluded from this part of the study since they were deemed as representing no real content, but rather pollution [7]. However, they were included in the locality study, since they were associated with large swarms and thus a great amount of network traffic. The cut-off value of 70% was chosen based on the results shown in Figure 5.1 which represents a CDF with the percentage of own unique pieces for all torrent files. Torrent files that had more than 80% of all their own pieces unique were considered legitimate content since it is custom for shared content to have samples or even repeat their own content, for example, repeated scenes within a movie or tv show. Furthermore, the swarms associated with these contents showed a unique behaviour that lead us to believe they were actually pollution. Most of these swarms either had a very small number of peers (below 50) or went from a very high number of peers to 20% of its maximum size within a day, as if the peers abandoned the download. After reaching this value, these swarms continued to decrease, never increasing in size again, which excludes the day-night behavior described in Section 5.3. Figure 5.2 represents the different swarm’s maximum size and Figure 5.3 shows the time, in hours, these swarms

25 100

90

80

70

60

50

40

30 Percentage of unique own pieces

20

10

0 1 10 100 1000 10000 Number of Torrents

Figure 5.1: CDF with the percentage of unique pieces for each torrent file. took to drop to 20% of their maximum size during the monitoring. As we can see, almost 60% of the swarms sharing polluted content take less than twenty-four hours to go from their maximum size down to 20% of that value. As for the swarms that took more than 100 hours (about 4 days) to drop to 20% of their maximum size, these swarms are associated with a smaller number of peers that goes from as few as 50 peers to a few hundred. With fewer peers, the download is expected to take longer and this is the main reason why the swarms take much longer to disapear. Peers can only be sure a content is pollution after downloading it or being informed by others that have already finished the download.

12000 Maximum swarm size Maximum number of seeders

10000

8000

6000 Number of peers

4000

2000

0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 Torrent

Figure 5.2: Maximum swarm size and number of seeders for all swarms representing pollution with a maximum swarm size value above 50 peers.

5.1.2 Content repetition

After comparing all of the remaining torrent files with each other, we determined the percentage of pieces in common for each torrent pair. Figure 5.4 shows that CDF. As we can observe, most torrent pairs had

26 100

90

80

70

60

50

40 Percentage of swarms

30

20

10

0 1 10 100 1000 Hours to reach 20% of maximum size

Figure 5.3: Time, in hours, swarms sharing polluted content take to drop to 20% of their maximum size. between 0% and 16% and between 95% and 100% of all their pieces in common. No torrents were found with 30% to 75% of pieces in common. As for the amount of data in common, Figure 5.5 shows that most torrent pairs have 10 to 200 MB of data in common, however, there is also a significant number of torrent pairs that have 400 to 850 MB of data in common. This is the amount of data that could be shared among peers currently in different swarms.

100

90

80

70

60

50

40

30 Percentage of number torrent pairs 20

10

0 0 10 20 30 40 50 60 70 80 90 100 Percentage of shared pieces

Figure 5.4: CDF with the shared percentage of pieces per number of torrent pairs with at least one common piece.

Figure 5.4 also shows that there is a significant number of torrent pairs that share 100% of their pieces. In this case, we had to compare the infohashes of these pairs to identify the ones that shared the same swarm (pairs with same infohash) and the ones that had different and isolated swarms (pairs with different infohash). From the 66 torrent files that were found to have at least one other torrent file with 100% equal pieces, none of them shared the same infohash, which means that there was the exact same content being shared over different and isolated swarms. This happens because the contents

27 100

90

80

70

60

50

40

30 Percentage of number torrent pairs 20

10

0 0.01 0.1 1 10 100 1000 10000 Shared MegaBytes

Figure 5.5: CDF with the shared MegaBytes per number of torrent pairs with at least one common piece. have different names despite having the exact same hash pieces.

After comparing the different torrents, we decided that only torrent files that shared more than 75% of their pieces should be considered as referring to similar content. We obtained 2933 torrent files representing unique content and 278 torrent files representing similar content. This means that, about 8.6% of all torrent files analysed were files that represented repeated content in other BitTorrent swarms. As stated before, of these 278 torrent files, 66 had at least one other torrent file with 100% equal pieces and all had unique infohashes, due to different names being assigned to the files being shared.

As mentioned in the introduction, most of the similar torrent files exist due to different publishing teams repackaging content first published by others. Figure 5.6 shows the number of torrent files found to have been published by each team and number of torrent files found to be repeated. From this graphic we can conclude that the three teams that uploaded more torrent files account for about 4.8% of all the 3211 torrent files analyzed and for about 14% of the torrent files found to be repeated. This shows the impact these teams have in the publishing of repeated content. We did not analyse who published what first as this was irrelevant to our study.

Figure 5.7 shows a histogram representing the content repetition frequency. For example, content that was found to be repeated twice was observed 44 times, which means that there were 88 torrent files that referred to a content that was already being referred by exactly one other torrent file. By analyzing Figure 5.7, we observe that most content is unique and that most repeated content is repeated only once. As expected, we can also observe that contents that are found to be repeated more than five times tend to be rare. Even so, there was one content that was found to be repeated in 24 torrent files. If these torrent files had at least one swarm associated, this means that there were at least twenty four independent swarms sharing the same content. From these results, one may conclude that there is a strong potential to increase availability through the combination of BitTorrent swarms. However, we also need to know how many peers all swarms corresponding to each content have, in order to determine if the swarm merging would actually be useful.

28 70 Number of repeated torrents published Total number of torrents published

60

50

40

30 Number of torrents

20

10

0 2HD ARROW ASAP CORE CTU DEFACEDDOH FPM FQM IFLIX IMAGiNE LOL MAX MEM MOMENT RELOADEDTASTE T0XiCiNK WHOA

Teams

Figure 5.6: Number of torrents published by team.

10000

1000

100 Number of times observed

10

1 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 Content repetition

Figure 5.7: Histogram of the content repetition frequency.

Figure 5.8 represents, for each repeated content, the average number of peers per torrent announced by the tracker as points and the sum of all peers per content as the line in the graphic. There are several conclusions that can be derived from this graphic. The first is that there are contents where there is only one torrent that has a swarm with peers. This can be explained in two different ways: there might be much trust on the publisher of that torrent file and very little trust in the publishers of the remaining torrent files representing that same content; the first torrent file is published much sooner than the remaining ones, making these obsolete when published. In this case, they behave as unique content does, where there is only one swarm per content. Second, the more the content is repeated, the more the peers are distributed over different swarms and the smaller the swarms corresponding to each torrent file for that same content are. By analysing both points and line in the graphic we can conclude that, for most of these cases, by merging all swarms into one, we would get a bigger swarm for that content and the more peers there are in a swarm the higher the availability can be. Third, about 17 contents didn’t have

29 a single peer in all their swarms.

10000 Number of peers in the swarm Sum of the number of peers of all swarms

1000

100 Average number of peers

10

1 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 Repeated content

Figure 5.8: Average number of peers per content.

Figure 5.8 shows us that we can expect a significant growth in the number of peers by combining the different swarms sharing similar content. However, it is also important to analyse the different swarms regarding the number of seeders, since these are the peers that have all the file’s pieces available for sharing. Furthermore, as seeders do not upload, they are not subject to the tit-for-tat algorithm, providing newly joined peers with their first pieces for trading with others. Figure 5.9 represents, for each repeated content, the average number of seeders per torrent as points and the sum of all seeders per similar content as the line in the graphic. From this graphic, we can observe that, for most contents, there is only one swarm with seeders. This proves that, by joining all swarms regarding a given content, we can definitely improve availability. As for the ones with multiple swarms with seeders, since the seeder number grows, the availability will also improve.

10000 Number of seeders in the swarm Sum of the number of seeders of all swarms

1000

100 Average number of seeders

10

1 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 Repeated content

Figure 5.9: Average number of seeders per content.

30 5.2 Locality Analysis

This Section presents the locality study. The main reason for the locality study is to determine if there is enough locality in BitTorrent swarms that can be exploited to decrease inter-ISP traffic and overall network traffic. First, Section 5.2.1 shows the growth in locality properties for the merging of swarms sharing the same or very similar content. Then, Section 5.2.2 presents the existing locality in all swarms. After this, Sections 5.2.3 and 5.2.4 show some locality properties of regional and large size swarms and compare the results to the ones obtained for all torrent files. Popular content is the content expected to generate the most amount of network traffic, since it’s the one associated with the largest swarms. As for regional swarms, these are the ones that share content specific to a region, country or language and, for this reason, show a remarkable locality potential. Finally, Section 5.2.5 presents the existing locality for swarms, aggregating data over periods of two-hours.

5.2.1 Repeated content

Increasing the number of peers available for downloading each piece will also increase the opportunities for exploiting locality. By having peers exchange data with other peers within the same ISP or in ISPs with favorable peering agreements, locality algorithms and protocols enable ISPs to reduce their outside connectivity costs, while improving the performance of other applications and even of P2P file exchange [41][4][35]. Having several peers within the same ISP downloading the same content will also increase the hit ratio of eventual caching systems [34].

100 Total Torrent 1 Torrent 2 Torrent 3

10 Number of peers

1 0 10 20 30 40 50 60 70 80 Country count

Figure 5.10: Number of peers per country for similar content.

Figure 5.10 shows the number of peers found to be sharing the same content during one day. Three torrents are shown, which were found to have similar content. For each, the number of peers per country is shown, as is the number of peers per country obtained by combining the three torrents. Each line is

31 100 Total Torrent 1 Torrent 2 Torrent 3

10 Number of peers

1 0 50 100 150 ISP count

Figure 5.11: Number of peers per ISP for similar content.

Table 5.1: Torrent aggregation benefits.

Torrent 1 Torrent 2 Torrent 3 Total Num. Countries 87 68 38 101 Avg. Peers/Country 6.20 5.25 2.61 9.85 Num. ISPs 238 178 54 342 Avg. Peers/ISP 2.26 2.01 1.83 2.91

sorted in descending order of number of peers per country. Countries with a single peer are not shown. We can observe that the number of countries with peers increases as does the number of peers in each country. This is substantiated by the average values shown in Table 5.1. In countries where internal connectivity is better than external connectivity, the increased national availability should improve the download performance. The peers considered were those provided by the trackers to our modified peers. This number is inferior to the total amount of peers announce by the tracker provided statistics. As such, we would expect the gains to be even higher. Figure 5.11 shows the same data aggregated by ISP. We also observe an increase in the number of ISPs with peers, but more significantly, the number of peers per ISP increases noticeably. This will provide ISPs with more internal sources on their network, creating opportunities for diminishing the external BitTorrent traffic. Figure 5.12 provides an overview of the gains in swarm dimension throughout the torrent life and our monitoring period. For each torrent and for each day, we calculated the average number of peers per ISP by itself and by aggregating all the similar content torrents. For each torrent, we show the minimum, maximum and 25th, 50th and 75th percentile for the increase in average number of peer per ISP. We notice that for most torrents, the gains are significant. Only for a few was there a decrease in the number of peers per ISP which is explained by the small overlap in ISPs of each similar torrent, causing the increase in number of ISPs to exceed the increase in number of peers. As we can see from our findings, there is a high percentage of repeated content shared across Bit-

32 1800

1600

1400

1200

1000

800

600 Swarm increase (%)

400

200

0

-200 0 10 20 30 40 50 60 Torrent

Figure 5.12: Increase in swarm size at each ISP by aggregating similar content.

Torrent networks that results in multiple swarms that, despite sharing the same content, do not exchange data among themselves. Merging these isolated swarms sharing the same contents can, in many cases, improve content availability, as well as the existing locality.

5.2.2 All content

As discussed in Section 5.1, content considered to be pollution was included in the locality study because of its swarm sizes and amount of data exchanged between peers. Figure 5.2 represents the maximum number of peers for the swarms that represented pollution and respective maximum number of seeders, in descending order and for swarms with a maximum size above 50 peers. Figure 5.13 shows the content size for all contents considered pollution. As it can be observed, 90% of all content has a size between 370 MB and 1 GB. When the swarm sizes and seeder numbers are analyzed together with the contents’ size, we can conclude that, even being polluted content, it generates a great amount of traffic in the network. For this reason, it is very important for this data to be used in this part of the study.

First, we focus on popular content and large size swarms. Figure 5.14 shows the relationship there is between the number of peers obtained from the tracker queries and the average number of peers per country. As expected, the more peers there are available, the higher is the average of peers per country. The number of peers available from the tracker is related with the swarm size. Therefore, the bigger the swarm, the more locality potential it has. Figure 5.15 shows the same for the average number of peers per ISP. Once again, we get the same results, however, since the number of ISPs available is much higher than the number of existing countries, Figure 5.14 shows a steeper tendency.

These two figures show there is a high locality potential, especially for large swarms. However, they only focus on the average number of peers per country and ISP, showing only a trend. In order to obtain

33 100

90

80

70

60

50

40 Percentage of torrents

30

20

10

0 1 10 100 1000 10000 Size (MB)

Figure 5.13: CDF with the content size for all content considered as being pollution.

125

25 Average peer per country

5

1 10 100 1000 10000 100000 Average number of peers obtained

Figure 5.14: Average number of peers obtained and average number of peers per country for the one- day data aggregation period. results with a stronger statistic value, we need to focus on maximum, minimum and median values. To obtain these results, we calculated, for each torrent file for every day, the median percentage of the obtained peers that belonged to the same country and ISP. Then, for all the results, we calculated the minimum, maximum and 25th, 50th and 75th percentile. Figure 5.16 represents the results obtained for all swarms that reached a maximum size between 50 and 500, 500 and 1000, 1000 and 5000, 5000 and 10000, and above 10000 peers. Each of these aggregated swarms represent 1053, 107, 188, 125 and 96 torrents, respectively.

Despite identifying a great amount of locality, we didn’t get the expected results. After analysing Figures 5.14 and 5.15, one would expect that the 25th, 50th and 75th percentile would increase with the swarm size. However, Figure 5.16 shows that this doesn’t happen. The reason for this happening is because peers tend to be more evenly distributed over the different countries for larger swarms. In

34 125

25 Average peer per isp 5

1 10 100 1000 10000 100000 Average number of peers obtained

Figure 5.15: Average number of peers obtained and average number of peers per ISP for the one-day data aggregation period.

100

90

80

70

60

50

40

30

20

Median number of peers (%) 10

0 ISP median (BetweenISP median 50 and (Between 500)ISP median 500 and(BetweenISP 1000) median 1000 (Between andISP 5000) median 5000 (Above and 10000) 10000) Country medianCountry (Between median 50 Countryand (Between 500) median 500Country and(Between 1000) median 1000Country (Between and 5000) median 5000 (Above and 10000) 10000)

ISP/Country

Figure 5.16: Distribution of the percentage of the median number of peers that belong to the same country or ISP for the one-day data aggregation period.

Figure 5.16 this translates into the decreasing of the median value and the approximation of the values for the 25th and 75th percentile for the larger swarms. Since peers are more evenly distributed over the different countries, the usage of a locality mechanism is expected to be able to get a big enough number of peers to connect to, without harming the user experience. As for the smaller size swarms, they also show locality properties. However, most of these swarms were either regional, as explained later, or their peers belonged to the big countries like the US.

Regarding the ISP peer distribution, Figure 5.16 shows no pattern whatsoever. However, it can be seen that there are a significant number of peers belonging to the same ISP in most swarms. As for the country distribution, small size swarms tend to have a high percentage of peers belonging to the same ISP. Comcast, one of the biggest ISPs in the world was also one of the most observed in these swarms.

35 5.2.3 Regional content

The results obtained for the smaller size swarms, motivated a search for regional swarms within. These swarms represent groups of peers that share content specific to a given region, country or language. Before analysing all the considered regional swarms, we present an example of a regional torrent. Fig- ures 5.17 and 5.18 show the results for a torrent file representing a movie translated into Italian. For a swarm with a maximum size of 2500 peers, Figure 5.17 shows a maximum of almost 1800 peers be- longing to the Italian country. Throughout the monitoring of this torrent, 95% of all peers obtained from tracker queries belonged to Italia. As for Figure 5.18, it represents the ISPs with an average above 10 peers. We obtained 7 Italian ISPs which together represent over 85% of all monitored peers. From the graphic we can also see that, there was an ISP that had a maximum of 700 peers in the swarm and a median value of approximately 100 peers. This result clearly indicates a very high locality potential that could be used to decrease inter-ISP traffic, increasing intra-ISP traffic. This one torrent is an example of a regional swarm. However, we found others such as this one.

Countries with average number of peers higher than 30 1800

1600

1400

1200

1000

800 Number of Peers

600

400

200

0 0 IT Country Figure 5.17: Countries with an average above 30 peers per day for a regional torrent.

ISPs with average number of peers higher than 10

700

600

500

400

300

Number of Peers 200

100

0 0 Tiscali SpA IUnet Infostrada IUnet Fastweb Telecom Italia Vodafone Omnitel N.V. Telecom Italia Wireline Services

Figure 5.18: ISPs with an average above 10 daily peers for a regional torrent. ISP

For a torrent to be considered regional, we decided that at least 60% of all peers obtained from tracker queries needed to belong to the same country at least 75% of the times. Figure 5.19 represents the minimum, maximum and 25th, 50th and 75th percentile of all torrents found to match this criteria. As

36 we can see from the results, we obtained several torrent files that, despite matching the above criteria, had a low minimum number. We justify these results with a very small number of tracker queries that returned a low number of peers that were scattered across many countries. This is probably due to the random algorithm used by the tracker for peer selection. From the previous results, we searched for the ones that had, at least, 30% of all obtained peers belonging to the same ISP, for 75% of the times. Figure 5.20 represents the obtained results.

100

90

80

70

60

50

40

Percentage of peers in a single country 30

20

10 0 10 20 30 40 50 60 Torrent

Figure 5.19: Regional torrents with 60% of all peers belonging to the same country, at least 75% of the times.

70

60

50

40

30

20 Percentage of peers in single ISP

10

0 0 5 10 15 20 25 30 Torrent

Figure 5.20: Regional torrents with 30% of all peers belonging to the same ISP, at least 75% of the times.

Although these results tend to suggest a high gain with the usage of locality mechanisms, we also need to compare the size of these files with the maximum number of peers registered for that swarm. From the 57 torrent files obtained, more than 35 had over 300 MB and about 15 had over 1 GB, ap-

37 proximatelly 63% and 8.5%, respectively. As for the swarm size, despite the low swarm sizes, we still obtained 5 swarms that reached a maximum size of at least 10000 peers, two of which were sharing a file with over 7 GB. However, 46 torrent files, which represent approximately 80% of all files represented in Figure 5.19, had a maximum swarm size below 500 peers. Despite representing a low percentage of the total number of swarms above 50 peers, these regional torrents show very high median values, for the percentage of peers belonging to the same country, that have a great impact in the results obtained in Figure 5.16. If we were to change the cut-off value from 60% to 50%, we would get 81 torrent files with a maximum swarm size between 50 and 500 peers . This represents more than 5.1% of the total number of torrent files with a maximum swarm size above 50 peers. This shows that many small size swarms, rather than unpopular, are popular but only to a specific group that share the same interests, which, most times, are related to the country most of the peers in the group are from. From these results, we can already conclude that there is a great amount of locality to be explored in BitTorrent swarms. There are a significant number of small size swarms, between 50 and 500 peers, which have much locality that can be explored. However, due to the low peer number and despite the number of torrent files, these swarms represent a small percentage of the overall network traffic generated by BitTorrent.

5.2.4 Large swarms

We also need to focus on the most popular content that leads to very large swarms. These swarms are the ones responsible for most traffic generated in the BitTorrent network. Figure 5.21 shows the content size and the maximum number of seeders observed for swarms with a maximum number of seeders above 5000. As we can see, in the 150 swarms, approximately 40% reach a maximum of over 9000 seeders. As for size, about 80% had a size above 370 MB and 26% a size above 1 GB. One important aspect to keep in mind is the fact that the total amount of completed downloads is expected to be much higher than the maximum number of seeders observed over the course of the experiment, since peers are always entering and leaving the network and not all share what they download.

These large swarms achieve a very high average of peers per country and ISP and have shown to have these peers evenly distributed over the different countries and ISPs (Figures 5.14 and 5.15). This is an important aspect since it can be responsible for enabling a locality mechanism to create several country or ISP clusters of peers so that the traffic can be restrained closer to the peer’s network position. This reduces inter-ISP traffic and, at the same time, since there are a lot of these clusters with roughly the same number of peers, or at least with a high number of peers, user experience should not suffer with the use of a locality mechanism.

5.2.5 Two-hour period

After obtaining all the previous results, we studied the locality for a 2-hour data aggregation period. This means that, instead of aggregating all the peer information obtained over a day, we did it over a period

38 10000

1000

100 Size (MB)

10

1 1 10 100 1000 10000 100000 Maximum number of seeders

Figure 5.21: Torrent size and maximum number of seeders for swarms with a maximum number of seeders above 5000. of two hours and calculated the existing locality. However, we only did this for the larger swarms since these are the ones where the download time is expected to be lower because of the number of peers with data to upload. This low download time will affect the number of peers one can obtain and chose to download from. Figures 5.22 and 5.23 represent the number of peers obtained from the tracker queries and the rela- tion it has with the average number of peers per country and ISP for the two-hour period. As expected, these results are much lower than the ones for the one-day period. This can be justified with the much lower number of peers obtained from the tracker queries and the fact that trackers use a random algo- rithm to obtain the active peers for a given swarm. Trackers usually send a minimum of 50 peers and a maximum of 200 peers when queried. There is also the issue with the query time interval. For most trackers, the time interval is between 15 and 20 minutes, however, if a peer requests more than 50 peers, this interval might grow bigger. The bigger the query time interval, the lower the number of peers one can obtain in a two-hour period.

Despite the lower results, there is still much locality to be explored, especially for the large size swarms. Figure 5.24 shows that, with a lower period and thus a smaller number of peers obtained, we still achieve better results when compared to Figure 5.16. However, these results also translate into peers not being evenly distributed over the different countries and ISPs. This way, a locality mechanism needs to be aware of the total amount of peers it can obtain during the download time. According to these results, for small size content and low download time, a locality mechanism would achieve very different results for different peers, affecting user experience.

After comparing Figures 5.24 and 5.16, we can conclude that the content’s size and corresponding

39 125

25 Average peer per country

5

1 10 100 1000 10000 Average number of peers obtained

Figure 5.22: Average number of peers obtained vs average number of peers per country for the two-hour period data aggregation.

125

25 Average peer per isp 5

1 10 100 1000 10000 Average number of peers obtained

Figure 5.23: Average number of peers obtained vs average number of peers per ISP for the two-hour period data aggregation.

download time will, most definitely, have a great impact in the existing locality. If the content’s size is small, the download time will be low as well. This will result in a smaller number of peers obtained from the tracker queries, which in turn results in a lower locality value for many peers. The random peer choosing algorithm used by the tracker is the main reason why this happens.

From our results, we conclude that a locality mechanism would get better results if it was also imple- mented in the tracker rather than only in the client application. With the ever increasing access speeds, downloads tend to take less time, which affects the number of peers from which the user can choose to download from, as proven by the results obtained. This affects the amount of locality a peer can exploit, so even if a client application has a locality mechanism implemented, it is expected to achieve worst results than if this locality mechanism was also implemented in the tracker and used instead of the random peer selection algorithm. Because of the query time interval for the tracker, we also prove that a client would benefit, and most definitely increase locality, if DHTs are used, since a higher number of peers can be obtained in a lower time interval.

40 100

90

80

70

60

50

40

30

20

Median number of peers (%) 10

0 ISP median (BetweenISP median 50 and (Between 500)ISP median 500 and(BetweenISP 1000) median 1000 (Between andISP 5000) median 5000 (Above and 10000) 10000) Country medianCountry (Between median 50 Countryand (Between 500) median 500Country and(Between 1000) median 1000Country (Between and 5000) median 5000 (Above and 10000) 10000)

ISP/Country

Figure 5.24: Distribution of the percentage of the median number of peers that belong to the same country or ISP for the two-hour period data aggregation.

5.3 Peer and Tracker behavior

5.3.1 Tracker behavior

From the 3211 torrent files observed, 2701 torrent files had an announce list with more than one tracker. However, only for 1094 did we find more than one tracker responding to queries, while for the remaining we only found one tracker from the list responding to the queries. This way, we observed 2517 torrent files that were only announced in one tracker and 1094 that were announced in more than one tracker. Figure 5.25 represents the minimum, maximum and 25th, 50th and 75th percentile for the percentage of each tracker’s swarm size in the sum of all tracker’s swarm sizes. In the more than 100 torrents announced in 8 or more trackers, few have the same or similar swarm size for every tracker. These are represented in the figure as having the minimum, maximum and 25th, 50th and 75th percentile very close to each other. As it can be seen, many trackers have very high maximum values or very low minimum values, but have the 25th, 50th and 75th percentile very close to each other. This means that, most of these torrent files, despite being announced on 8 or more trackers, have one or two swarms much larger than the others. This will affect the availability and locality for the smaller size swarms. This also tells us that, peers tend to query only a few of the available trackers, being the main reason why the swarms have such different sizes.

Figure 5.26 shows the swarm size for all tracker for a given torrent file. As it can be seen, the torrent was announced in five different trackers. However, while four trackers hold about the same number of peers, there is one tracker that holds a much smaller number of peers, about a quarter. It is expected for the smaller swarm to have a lower performance than the rest of the other swarms, since it has less than half their size. The figure also shows several rapid decrease and increase of the number of peers for some trackers. This represents a downtime for the tracker. Despite happening quite often, the trackers recover very quickly and, since there is always at least one tracker working correctly and/or peers can

41 80

70

60

50

40

30 Tracker swarm (%)

20

10

0 0 20 40 60 80 100 Torrent

Figure 5.25: Distribution of peer per each tracker per torrent, for torrents announced in more than 8 trackers. use DHTs to find other peers, this downtime doesn’t seem to have a significant effect in the peers ability to find others, even for torrent files announced in only one tracker.

Swarm size throughout time for each tracker 1400

1200

1000

800

600 Number of Peers

400

200

0 05/14 05/21 05/28 06/04 06/11 06/18 06/25 Time

Figure 5.26: Swarm size throughout time per tracker, for a video file torrent.

5.3.2 Peer behavior

As for peer behavior, during our experience we noticed a periodic day-night behavior. This behavior can be observed in Figure 5.26 and shows how the swarm decreases and increases for periods of approximately a day. Figure 5.27 also shows this behavior for a single non-regional torrent, however most swarms behaved the same way. This behavior can be justified with the results obtained in Figure 5.16, where a significant percentage of peers belongs to the same country, in most swarms. By analysing all swarms, we observed periodic changes in the swarm size that went from almost 5% to approximately

42 60%. This behavior is more evident in larger swarms because they tend to have a larger number of peers entering and leaving the network, as observed in Figure 5.14.

25000 Number of Leechers Number of Seeders Swarm size

20000

15000

10000 Number of Peers

5000

0 04/29 04/30 05/01 05/02 05/03 05/04 05/05 05/06 00:00 00:00 00:00 00:00 00:00 00:00 00:00 00:00 Time

Figure 5.27: Day-night behavior for a video file torrent.

Another peer and swarm related behavior observed was how the number of seeders and swarm size related. Figure 5.28 shows that the number of seeders in a swarms tends to be very close to the total number of peers in the swarm. Each point represents the number of seeders and swarm size. The line represents all the points for which there are only seeders in the swarm. From this figure we can see that free-riding has no significant meaning in BitTorrent swarms, and for this reason, availability increases with the swarm size.

100000 Number of seeders per swarm size y(x)=x

10000

1000 Swarm size 100

10

1 1 10 100 1000 10000 100000 Number of seeders

Figure 5.28: Number of seeders per swarm size.

Regarding the swarm size and content size, Figure 5.29 shows that they are not related what so ever. The swarm size is only related to the content itself. If content is popular, it is associated with a large swarm, since there is a great number of peers that want that content. If content isn’t popular, fewer peers will want it and thus it will be associated with smaller swarms. However, from this figure we can also see that there are a significant number of torrent files for the 370 MB, 700 MB and approximately 2 GB sizes. These are standard sizes for music albums, tv shows and movies.

43 100000

10000

1000 Size (MB)

100

10 1 10 100 1000 10000 100000 Number of peers

Figure 5.29: Swarm size per content size.

Through the continuous monitoring of swarms, we also observed their lifetime. Figure 5.30 shows the lifetime for a popular video file torrent. As expected, the swarm shinks overtime. However, since it was a popular video content (one of piratebay’s one hundred most popular files), it takes over 2 months to drop from approximatelly 23000 peers to about 3000 peers. In the figure, it is also shown that no data was collected during the 22nd of June. This was due to a problem with the tracker queries related to this torrent. Figure 5.30 also proves that free-riding is not an issue in BitTorrent, since the size of the swarm tends to be the same as the number of seeders.

Swarm size throughout time 25000 Number of Leechers Number of Seeders Swarm size

20000

15000

10000 Number of Peers

5000

0 04/23 05/07 05/21 06/04 06/18 07/02 07/16 Time

Figure 5.30: Swarm size throughout time.

5.4 Summary

In this chapter, all the experimental results were presented. It is shown that BitTorrent networks have properties in many fields that can be used to improve its performance and minimize its impact in the

44 ISPs peering costs. There are many BitTorrent swarms that share equal or very similar content, but are isolated from each other. This is due to some of BitTorrent properties, like how the torrent files’ identifier, the infohash, is generated, which results in the same content having different identifiers just by changing its name. Our measurements and results show that, this repeated content in the network is caused by the competition between publisher teams that want to be the first to upload a given content and have a high reputation among the users. We also show that, by combining these different swarm, we can get an improved availability for the repeated content and increase the locality properties of these swarms. However, most BitTorrent swarms showed enough locality so that its usage would have benefits for both user and ISP. These locality properties increase with the swarm size, as expected. Even so, small size swarms also present locality that can be exploited. Some of the monitored swarms also showed locality properties much above the average. These swarms were identified as sharing regional content, which is content specific for a given region, country or language. Despite identifying a great amount of locality, it is also shown that this locality depends greatly on the number and diversity of peers obtained throughout the download time. If the download time is low, it is expected that, during the download of the file, a lower number of peers is obtained. This can have a negative effect on the existing locality for some peers, since there are few peers from which to download from. For this reason, we propose that when using a locality mechanism, the tracker’s random peer selection algorithm should also be changed, so that it is also locality-aware of the peer’s network position. The obtained results also show that BitTorrent isn’t as unpredictable as one would think. Swarms show a day-night behavior that leads to a fluctuation in the swarm size with a period of approximately one day. This oscillation was more evident for larger swarms, however, it was also observed in medium and small size swarms. Another important behavior observed was the willingness of most peers to share what they download. This was observed when comparing the swarm size with the number of seeders, for each swarm.

45 46 Chapter 6

Partial Swarm Merger

In the previous chapter, we identified the sharing of repeated content in the network. This redundancy was caused by the competition between publishers that, just by changing the content’s name, generated different swarms that were isolated from each other but, at the same time, were sharing the same content or at least a very similar one. We also showed that it is possible to join these swarms and increase availability for the content being shared across these different swarms. This chapter will present Partial Swarm Merger, our proposed solution for joining these different and isolated swarms, and have peers participate in multiple ones.

6.1 PSM

We envisioned several solutions to take advantage of the repeated content problem, using it to increase content availability:

1. A BitTorrent client modified to search for different torrent files that refer to same or very similar con- tent, using the torrent names for searches on torrent databases, and to query the different swarms for different pieces. This solution would need modifications to the BitTorrent client application;

2. Trackers which analyse each torrent file published and automatically notifies peers, in order to merge swarms within the tracker that shared the same or very similar content. Modifications to the BitTorrent Tracker application, client and protocol would be needed;

3. A Service outside BitTorrent which peers could query for other torrent files with the same or very similar content. This service would maintain an index for torrent files referring to the same content and would supply this information to peers. This wouldn’t require modification neither to the BitTor- rent protocol nor to the BitTorrent trackers. Only modifications to the BitTorrent client applications would be necessary.

From these three possible solutions, the simpler to implement and deploy is the third one. The first would require the modified BitTorrent client application to search for files with a similar name to the one the peer is downloading and then, download all torrent files so they could be analysed. This process is

47 slow and could only search for torrents based on the name of the torrent file not its content and would introduce great delay and overhead, only being useful for very large files. The second solution, where the tracker would analyse and merge the swarms sharing common pieces, has many disadvantages that make it hard to implement. Most trackers don’t keep the tor- rent files, they just keep an association between peers and the infohash from the swarm they belong to. Another disadvantage is the fact that Trackers don’t normally communicate or exchange any kind of information with each other. For these reasons, this solution would require major modifications to BitTorrent Trackers. Having this in mind, we propose a solution with an architecture based on a service outside the BitTorrent network, called Partial Swarm Merger (PSM). PSM would be a service that peers would query for information on other swarms in the BitTorrent network that are, at that time, sharing the same content. With this information, peers would participate in multiple swarms but would only announce and request the pieces the content they are downloading has in common with the content being shared in the other swarms. This solution would have no need for modification to the BitTorrent protocol. However, the BitTorrent client application would require an extension or add-on to be able to use this service. By having the service outside the network, we can also have information regarding all swarms from different trackers. As this service is not critical, but rather a performance enhancer, it need not be a part of the core protocol. Figures 6.1 and 6.2 show the workflow of the service. Figure 6.1 demonstrates how PSM would build and populate its databases with the information from the different torrent files. First, a peer sends a magnet URI1 with the infohash and other relevant information such as tracker names. If the PSM service doesn’t have that infohash in its databases, it requests the info key from the peer using the (DHT) used by modern BitTorrent clients [31]. This info key has the piece hashes, file name and piece size value. When the peer sends the info key to the PSM service, the latter associates the infohash to all this information. Then, it analyzes and compares that information with the data in its databases and returns the magnet links to swarms sharing, to some extent, the same content, if any. If the infohash received from the first message is already in the databases, message two and three are not exchanged. After receiving the magnet links from the PSM service, peers use it to request peers from the tracker and join the other swarms. However, peers only announce and request, from the newly joined swarms, pieces in common with their original swarm. Figure 6.2 represents the workflow of the solution. Peers should periodically contact the PSM server in order to learn of other similar torrents that the PSM server has found out since the previous request. This architecture allows PSM servers to be deployed autonomously as they don’t rely on other ser- vices. Ideally, in order to maximise their efficiency, PSM servers should communicate among them- selves, in order to guarantee that they know the most torrents. However, for the purpose of increasing the locality opportunities within a single ISP, an isolated PSM server would produce the optimal result as long as all the peers within that ISP used it.

1http://magnet-uri.sourceforge.net/ , last accessed August 2011

48 PSM Service

y o

e t y r

k

r e a l k o a i s

l f k i k o m n n n i f i i i m

s l l i n t

i s t s t

n t e e e r d s n u n u n e t g g q e u e a a S e

q R - R

m m

e - -

3

R

4 2 -

1

Peer

Figure 6.1: PSM - Populating databases

6.2 Use Case

Besides the opportunities described in the previous Section, PSM could also be used in other scenarios. All that is required is for someone packaging different content, with common parts, into different torrent files, to place the common content first and in the same order. For instance, a GNU/Linux distribution supporting different architectures, could arrange the common, non-architecture specific, files into the first pieces and place the architecture specific binaries at the end. Peers downloading different architecture versions could share the common pieces among themselves. The same principal could be applied to any software publisher that had their different products or product versions sharing common library packages. This small change in the publishers’ behaviour could allow PSM to be more beneficial in more situations. Another advantage would be a more efficient load-balancing. Even when content is first published, if it has pieces in common with other versions of the software already being shared in the network, peers are not fully dependent on what the original seeders are sharing. Peers can download common pieces from other peers in the network and request from the original seeders just the unique pieces. This way, peers would download the content faster and become seeders quicker.

49 PSM Service Tracker

2

-

m R

1 e a t g - u

t R n s o r r e e n e s m q t r s e u l i p e a i m e n e g t k s p n s i s l e t a n e u s r r t i q u m l e t i n R e i R k l - a - r 3 4

5 – Request/Announce pieces in common

Peer 1

Swarm 2 Swarm 1

Figure 6.2: PSM’s workflow

50 Chapter 7

Conclusions and future work

7.1 Conclusions

The increase of inter-ISP traffic caused by Peer-to-Peer communications is becoming more and more of a problem for ISPs because of the cost of these connections. One of the most traffic generating P2P protocols is BitTorrent, used for file sharing. This protocol, like many P2P protocols has very distinctive properties that provide it with very high robustness. The best solution for solving the inter-ISP problem is a locality mechanism. By having peers connect and exchange data with others close to them, inter-ISP traffic can be decreased by increasing intra-ISP traffic. Our study showed that there is a great amount of locality in BitTorrent swarms that can be exploited. Through the monitoring of more than 3200 live Internet swarms, our findings suggest that the usage of a locality mechanism is not expected to have a significant impact on the peers’ usage experience. We show that large swarms have a great amount of locality, having peers evenly distributed over the different countries and ISPs. However, we also demonstrate that smaller swarms also have a significant percentage of their peers belonging to the same country or ISP,mainly the big ones like US or ComCast. We also found that there is a significant percentage of regional swarms. These swarms are groups of peers that share content specific to a region, country or language, showing a remarkable amount of locality. Our results suggest that a locality mechanism would benefit greatly if it was implemented in both client application and tracker, due to short download times which limit the peer ability to discover close by peers. During the experiment, we also found a significant amount of redundancy in the network, where different and isolated swarms share the same or very similar data. The cause of this problem is the competition there is between the different publisher teams, that want to be the first to publish a specific content and gain reputation within the network. To solve this problem, we propose a solution named Partial Swarm Merger (PSM), a service outside the BitTorrent network that helps peers find other swarms sharing the same content or just most parts of that content. Peers only announce and request from other swarms the parts that are common to their own swarm. Through this partial swarm merging, the availability of the

51 redundant content is increased as is the existing locality for those swarms, as supported by our findings. Finally, we also show that BitTorrent swarms aren’t as unpredictable as one would expect. We found a day-night behavior for most swarms and showed that free-riding isn’t an issue in BitTorrent networks. All these findings are important information for a locality mechanism developer to have in mind. We believe that we collected and analysed enough information to show that BitTorrent’s performance can be increased and the amount of traffic it generates restrained close by in the network, if a locality mechanism is used.

7.2 Future Work

As for future work, we intended to develop and implement a working version of PSM, and then test it. These tests should first be performed in a simulated enviroment and then in an Internet deployment using instrumented BitTorrent clients on PlanetLab nodes. For the Internet deployment test, a popular BitTorrent client application should be modified or an extension added to make the best use of the service. Regarding the locality properties of the swarms, in the future we want to repeat the experiment but using a BitTorrent client application that supports both tracker queries and DHTs. This can help us understand the exact number of peers one can obtain during the download time, and thus be able to de- termine what kind of impact a locality mechanism would have on small size torrents and fast downloads.

52 Bibliography

[1] Vinay Aggarwal, Anja Feldmann, and Christian Scheideler. Can isps and p2p users cooperate for improved performance? SIGCOMM Comput. Commun. Rev., 37:29–40, July 2007.

[2] Satish Balay, Kris Buschelman, Victor Eijkhout, William D. Gropp, Dinesh Kaushik, Matthew G. Knepley, Lois Curfman McInnes, Barry F. Smith, and Hong Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 2.3.0, Argonne National Laboratory, 2004.

[3] Pei Cao Chan W. Medved J. Suwala G. Bates T. Zhang A. Bindal, R. Improving traffic locality in bittorrent via biased neighbor selection. In ICDCS 2006. 26th IEEE International Conference on Distributed Computing. IEEE, pages 66 – 66, July 2006.

[4] Ruchir Bindal, Pei Cao, William Chan, Jan Medval, George Suwala, Tony Bates, and Amy Zhangan. Improving Traffic Locality in BitTorrent via Biased Neighbor Selection. In Proceedings of the 26th International Conference on Distributed Computing Systems (ICDCS 2006), Lisboa, Portugal, July 2006.

[5] Yansheng Lu Bo Liu, Yi Cui and Yuan Xue. Locality-awareness in bittorrent-like p2p applications. In IEEE Transactions on Multimedia, volume 11, pages 361 – 371, April 2009.

[6] David R. Choffnes and Fabian´ E. Bustamante. Taming the torrent: a practical approach to reducing cross-isp traffic in peer-to-peer systems. In Proceedings of the ACM SIGCOMM 2008 conference on Data communication, SIGCOMM ’08, pages 363–374, New York, NY, USA, 2008. ACM.

[7] Nicolas Christin, Andreas S. Weigend, and John Chuang. Content availability, pollution and poison- ing in file sharing peer-to-peer networks. In Proceedings of the 6th ACM conference on Electronic commerce, EC ’05, pages 68–77, New York, NY, USA, 2005. ACM.

[8] Brent Chun, David Culler, Timothy Roscoe, Andy Bavier, Larry Peterson, Mike Wawrzoniak, and Mic Bowman. Planetlab: an overlay testbed for broad-coverage services. SIGCOMM Comput. Commun. Rev., 33:3–12, July 2003.

[9] . Incentives Build Robustness in BitTorrent. In Proceedings of the 1st Workshop on Economics of Peer-to-Peer Systems, Berkeley, USA, June 2003.

[10] Ruben Cuevas, Michal Kryczka, Angel Cuevas, Sebastian Kaune, Carmen Guerrero, and Reza

53 Rejaie. Is content publishing in bittorrent altruistic or profit-driven. In Proceedings of the 6th Inter- national COnference, Co-NEXT ’10, New York, NY, USA, 2010. ACM.

[11] Ruben´ Cuevas, Nikolaos Laoutaris, Xiaoyuan Yang, Georgos Siganos, and Pablo Rodriguez. Deep diving into bittorrent locality. In Proceedings of the ACM SIGMETRICS international conference on Measurement and modeling of computer systems, SIGMETRICS ’10, pages 349–350, New York, NY, USA, 2010. ACM.

[12] G. Dan and G. Carlsson. Dynamic swarm management for improved bittorrent performance. In Proceedings of the 8th international conference on Peer-to-peer systems, IPTPS’09, Berkeley, CA, USA, 2009. USENIX Association.

[13] M. Frans Kaashoek Frank Dabek, Russ Cox and Robert Morris. Vivaldi: a decentralized network coordinate system. In Proceedings of the 2004 Conference on Applications, Technologies, Archi- tectures, and Protocols for Computer Communications, SIGCOMM 2004, 2004.

[14] Lei Guo, Songqing Chen, Zhen Xiao, Enhua Tan, Xiaoning Ding, and Xiaodong Zhang. Measure- ments, analysis, and modeling of bittorrent-like systems, 2005.

[15] Jinyoung Han, Taejoong Chung, Hyunchul Kim, Ted ”Taekyoung” Kwon, and Yanghee Choi. Sys- tematic support for content bundling in bittorrent swarming. In INFOCOM IEEE Conference on Computer Communications Workshops, 2010, 2010.

[16] Jinyoung Han, Taejoong Chung, Seungbae Kim, Hyunchul Kim, Ted ”Taekyoung” Kwon, and Yanghee Choi. An empirical study on content bundling in bittorrent swarming system. In CoRR, 2010.

[17] Tobias Hoßfeld, David Hock, Simon Oechsner, Frank Lehrieder, Z. Despotovic, W. Kellerer, and M. Michel. Measurement of bittorrent swarms and their as topologies. Technical Report 464, Institut fur¨ Informatik, November 2009.

[18] D. Horovitz, S. Dolev. Liteload: Content unaware routing for localizing p2p protocols. In IPDPS 2008. IEEE International Symposium on Parallel and Distributed Processing, pages 1 – 8, April 2008.

[19] Daniel Hughes, Geoff Coulson, and James Walkerdine. Free riding on gnutella revisited: The bell tolls? IEEE Distributed Systems Online, 6:1–, June 2005.

[20] Antony Jameson, Niles A. Pierce, and Luigi Martinelli. Optimum aerodynamic design using the Navier–Stokes equations. In Theoretical and Computational Fluid Dynamics, volume 10, pages 213–237. Springer-Verlag GmbH, January 1998.

[21] Thomas Karagiannis, Pablo Rodriguez, and Konstantina Papagiannaki. Should internet service providers fear peer-assisted content distribution? In Proceedings of the 5th ACM SIGCOMM conference on Internet Measurement, IMC ’05, pages 6–6, Berkeley, CA, USA, 2005. USENIX Association.

54 [22] Balachander Krishnamurthy and Jia Wang. On network-aware clustering of web clients. In Pro- ceedings of the conference on Applications, Technologies, Architectures, and Protocols for Com- puter Communication, SIGCOMM ’00, pages 97–110, New York, NY, USA, 2000. ACM.

[23] E. W. Biersack P. A. Felber A. A. Hamra M. Izal, G. Urvoy-Keller and L. Garces-Erice. Dissecting bittorrent: Five months in a torrent’s lifetime. In Proceedings of Passive and Active Measurements (PAM), 2004.

[24] Harsha V. Madhyastha, Tomas Isdal, Michael Piatek, Colin Dixon, Thomas Anderson, Arvind Kr- ishnamurthy, and Arun Venkataramani. iplane: an information plane for distributed services. In Proceedings of the 7th symposium on Operating systems design and implementation, OSDI ’06, pages 367–380, Berkeley, CA, USA, 2006. USENIX Association.

[25] Saikat Guha Krishna P. Gummadi Ratul Mahajan Stefan Saroiu Marcel Dischinger, Massimil- iano Marcon. Glasnost: enabling end users to detect traffic differentiation. In NSDI’10, Proceedings of the 7th USENIX conference on Networked systems design and implementation, April 2010.

[26] A. C. Marta, C. A. Mader, J. R. R. A. Martins, E. van der Weide, and J. J. Alonso. A methodology for the development of discrete adjoint solvers using automatic differentiation tools. International Journal of Computational Fluid Dynamics, 21(9–10):307–327, October 2007.

[27] Andre C. Marta, Sriram Shankaran, D. Graham Holmes, and Alexander Stein. Development of adjoint solvers for engineering gradient-based turbomachinery design applications. In Proceedings of the ASME Turbo Expo 2009: Power for Land, Sea and Air, number GT2009-59297, June 2009.

[28] Joaquim R. R. A. Martins, Juan J. Alonso, and James J. Reuther. High-fidelity aerostructural design optimization of a supersonic business jet. Journal of Aircraft, 41(3):523–530, May 2004.

[29] Daniel S. Menasche, Antonio A.A. Rocha, Bin Li, Don Towsley, and Arun Venkataramani. Content availability and bundling in swarming systems. In Proceedings of the 5th international conference on Emerging networking experiments and technologies, CoNEXT ’09, New York, NY, USA, 2009. ACM.

[30] John P. John Arvind Krishnamurthy Michael Piatek, Harsha V. Madhyastha and Thomas Anderson. Pitfalls for isp-friendly p2p design. In Proceedings of the 8th ACM Workshop on Hot Topics in Networks (HotNets’09), October 2009.

[31] G. Neglia, G. Reina, Honggang Zhang, D. Towsley, A. Venkataramani, and J. Danaher. Availabil- ity in bittorrent systems. In INFOCOM 2007. 26th IEEE International Conference on Computer Communications. IEEE, pages 2216 –2224, Anchorage, USA, may 2007.

[32] Jorge Nocedal and Stephen J. Wright. Numerical optimization. Springer, 1999.

[33] F. Hossfeld T. Metzger F. Staehle D. Pussep K. Oechsner, S. Lehrieder. Pushing the performance of biased neighbor selection through biased unchoking. In P2P ’09. IEEE Ninth International Con- ference on Peer-to-Peer Computing, pages 301 – 310, September 2009.

55 [34] PeerApp. Comparing P2P Solutions. White Paper, March 2007.

[35] Ricardo Lopes Pereira, Teresa Vazao,˜ and Rodrigo Rodrigues. Adaptive Search Radius - Lowering Internet P2P File-Sharing Traffic through Self-Restraint. In Proceedings of the 6th IEEE Interna- tional Symposium on Network Computing and Applications (IEEE NCA07), Cambridge, USA, July 2007.

[36] Sandvine. Meeting the challenge of today’s evasive p2p traffic, 2004.

[37] Hendrik Schulze and Klaus Mochalski. Internet Study 2008/2009, 2009.

[38] Dennis Schwerdel, Daniel Gunther,¨ Robert Henjes, Bernd Reuther, and Paul Muller.¨ German-lab experimental facility. In Proceedings of the Third future internet conference on Future internet, FIS’10, pages 1–10, Berlin, Heidelberg, 2010. Springer-Verlag.

[39] Kiesel S. Stiemerling M. Seedorf, J. Traffic localization for p2p-applications: The alto approach. In P2P ’09. IEEE Ninth International Conference on Peer-to-Peer Computing, 2009., pages 171 – 177, September 2009.

[40] Tian Luo Songqing Chen Lei Guo Xiaodong Zhang Shansi Ren, Enhua Tan. Topbt: A topology- aware and infrastructure-independent bittorrent client. In 2010 Proceedings IEEE INFOCOM, pages 1 – 9, March 2010.

[41] Ao-Jan Su, David R. Choffnes, Aleksandar Kuzmanovic, and Fabian´ E. Bustamante. Drafting Behind Akamai (Travelocity-Based Detouring). In Proceedings of the 2006 Conference on Ap- plications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM 2004), Pisa, Italy, September 2006.

[42] Haiyang Wang, Jiangchuan Liu, and Ke Xu. On the locality of bittorrent-based video file swarming. In Proceedings of the 8th international conference on Peer-to-peer systems, IPTPS’09, pages 12– 12, Berkeley, CA, USA, 2009. USENIX Association.

[43] Yixin Jiang Fengyuan Ren Xiaowei Chen, Xiaowen Chu. Measurements, analysis and modeling of private tracker sites. In 2010 18th International Workshop on Quality of Service (IWQoS), pages 1 – 2, August 2010.

[44] Haiyong Xie, Y. Richard Yang, Arvind Krishnamurthy, Yanbin Grace Liu, and Abraham Silberschatz. P4p: provider portal for applications. In Proceedings of the ACM SIGCOMM 2008 conference on Data communication, SIGCOMM ’08, pages 351–362, New York, NY, USA, 2008. ACM.

56