<<

CUEVAS LAYOUT 8/23/11 9:38 AM Page 144

TOPICS IN NETWORK TESTING Measuring the BitTorrent Ecosystem: Techniques, Tips, and Tricks

Michal Kryczka, University Carlos III Madrid and Institute IMDEA Networks Rubén Cuevas, Carmen Guerrero, and Arturo Azcorra, University Carlos III Madrid Angel Cuevas, Institut Telecom, Telecom SudParis

ABSTRACT far; however, to the best of the authors’ knowl- edge there is no document that compiles, BitTorrent is the most successful peer-to-peer describes, and classifies these techniques. In this application. In the last years the research com- article we first describe the main aspects and munity has studied the BitTorrent ecosystem by functionality of the complete BitTorrent ecosys- collecting data from real BitTorrent swarms tem. Afterward, we present a survey of the exist- using different measurement techniques. In this ing BitTorrent measurement techniques. Finally, article we present the first survey of these tech- we describe the main challenges these tech- niques that constitutes a first step in the design niques face and possible solutions to them before of future measurement techniques and tools for concluding the article. analyzing large-scale systems. The techniques are classified into macroscopic, microscopic, and IT ORRENT COSYSTEM complementary. Macroscopic techniques allow B T E us to collect aggregated information of torrents BitTorrent [5] is the name used by Brian Cohen and present very high scalability, able to monitor to define the peer-to-peer file protocol up to hundreds of thousands of torrents in short he designed a decade ago. Due to the great suc- periods of time. Microscopic techniques operate cess of this protocol, a complex system was cre- at the peer level and focus on understanding ated around it. In this article we adopt the performance aspects such as the peers’ down- terminology used by Zhang et al. [16] to refer to load rates. They offer higher granularity but do this complex system as the BitTorrent ecosystem. not scale as well as macroscopic techniques. In this section we describe the main players of Finally, complementary techniques utilize recent this ecosystem as well as its functionality. This is extensions to the BitTorrent protocol in order to summarized in Fig. 1. obtain both aggregated and peer-level informa- tion. The article also summarizes the main chal- DESCRIPTION OF lenges faced by the research community to BITTORRENT FUNCTIONAL ELEMENTS accurately measure the BitTorrent ecosystem such as accurately identifying peers and estimat- •A BitTorrent portal is a server into which ing peers’ upload rates. Furthermore, we provide content publishers upload .torrent files and Bit- possible solutions to address the described chal- Torrent clients download those .torrent files. lenges. •A BitTorrent swarm is formed by a set of peers downloading a given content using the Bit- INTRODUCTION Torrent protocol. •A BitTorrent tracker is a server that main- BitTorrent [5] is the most successful peer-to- tains a list of clients forming the BitTorrent peer application. Indeed, it is respon- swarm associated with given content. Further- sible for a major portion of the Internet traffic more, the tracker is aware of the download [10] and is used daily by dozens of millions progress of each peer within the swarm. of users. This has attracted the interest of the •A BitTorrent or peer is an entity that research community, which has thoroughly eval- participates in a BitTorrent swarm by download- uated the performance and demographic aspects ing and/or uploading pieces of the content. Two of BitTorrent. Due to the complexity of the sys- categories of peers may be distinguished: A seed- tem, the most relevant studies have tried to er is a client that has a complete copy of the con- understand different aspects by performing real tent, and thus only uploads pieces to other peers. measurements of BitTorrent swarms in the wild, A leecher is a client that does not have a com- that is, inferring information from real swarms in plete copy of the content, and thus uploads and real time. downloads pieces to and from other peers, Several techniques have been used in order respectively. to measure different aspects of BitTorrent so •A . is a meta-information file asso-

144 0163-6804/11/$25.00 © 2011 IEEE IEEE Communications Magazine • September 2011 CUEVAS LAYOUT 8/23/11 9:38 AM Page 145

ciated with content shared through BitTorrent. The .torrent file includes the following informa- User connects to tracker using hash tion: content name, file size, number and size of Public bit torrent of the swarm from .torrent file. Tracker identifies swarm and responds the pieces that form the content (named chunks), with random pool of IP’s which torrent infohash (an identifier that uniquely Tracker participate in swarm. identifies the swarm associated to the .torrent file) and IP address(es) of the Tracker(s) man- aging the swarm associated to the file. Swarm UBLISHING ONTENT IN IT ORRENT Seeder Leecher [82%] Leecher [21%] Leecher [0%] P C B T [100%] In order to make available content C in Bit- Torrent, the content publisher creates a .- C User connects with rent file associated with . After creating the peers and downloads/ .torrent file, the content publisher uploads it to uploads chunks User User downloads Bit torrent a BitTorrent portal. A detailed analysis of the .torrent file portal BitTorrent content publishing phenomenon connected with content he is can be found at [6]. There are a few BitTorrent interested in. portals such as The Pirate Bay1 indexing mil- lions of torrents and receiving millions of daily Figure 1. BitTorrent Ecosystem basic functionality: i) the BitTorrent client con- visits. These portals are critical for the BitTor- tacts a BitTorrent portal to download the .torrent file associated to the desired rent ecosystem as demonstrated by Zhang et al. content (the .torrent file includes the IP address of the Tracker managing the [16]. They offer detailed information regarding swarm associated to the desired content); ii) the BitTorrent client contacts the each indexed torrent. This information slightly Tracker that provides the IP addresses of a set of peers within the swarm; and varies from one portal to another, but in gen- iii) the BitTorrent client connects to these peers for downloading the content. eral it includes category of the content, num- ber of associated files, size of the whole content in the torrent, complete name of the file, BITTORRENT DELIVERY PROCEDURE uploading date, username who uploaded the torrent, number of seeders and leechers partic- In BitTorrent two peers communicate using the ipating in the torrent swarm (this data is updat- peer wire protocol. Every communication starts ed every few minutes), and a description text with an initial handshake. Just after the handshake giving more detailed information regarding the sequence is completed (and before any other mes- content. Figure 2 shows an example of a tor- sages are sent), the peers exchange bitfields by rent web page from portal. using a BITFIELD message. The bitfield indicates Finally, it is worth noting that some of these which chunks of the file a peer has already down- major portals offer a Really Simple Syndica- loaded. Furthermore, every time a peer gets a new tion (RSS) feed to announce new published chunk, it informs its neighbors by using a HAVE torrents. message. Hence, every peer is aware of the chunks each neighbor has at any moment. JOINING A BITTORRENT SWARM AND BitTorrent uses Tit-for-Tat as the incentive DISCOVERING PEERS model for the delivery mechanism; basically, each leecher uploads chunks to those other leechers When a BitTorrent user wants to download a from whom it is downloading more chunks. The given content C, it looks for the .torrent file Choking Algorithm is responsible for providing associated with C in a BitTorrent portal and this behavior. It is a periodical operation where downloads it. The .torrent file can be opened every 10 s a leecher selects (unchokes) n other with any of the existing BitTorrent clients.2 leechers from its neighborhood to upload chunks Upon opening the .torrent file, the BitTorrent to. These n (typically 4) unchoked leechers are client connects to one of the trackers included in those from whom the peer downloaded more this one. A new peer first contacts the tracker chunks during the last 20 s. The rest of the using an announce started request that is neighbors are blocked (choked). In the case of a answered by the tracker with the number of seeder, it unchokes n (typically 4) leechers to seeders and leechers participating in the swarm whom more chunks it uploaded in the last 20 s along with the IP addresses of N (between 40 (i.e., those with higher download rate).4 In addi- 1 This is currently the and 200) randomly selected peers. These N tion to the regular unchoke operation, BitTor- largest BitTorrent portal peers form the initial neighborhood of the new rent implements the optimistic unchoke based on Alexa ranking. node. Furthermore, if a peer’s neighborhood operation. Every 30 s (i.e., every 3 regular size falls below a given threshold (typically 20), it unchoke operations) both leechers and seeders 2 http://en.wikipedia.org/ again sends an announce started request to the randomly select one choked neighbor to upload wiki/Comparison_of_ tracker in order to get new neighbors. Finally, chunks to. Finally, when a leecher is unchoked BitTorrent_clients when a peer leaves the swarm, it sends an by a neighbor, it applies the Rarest First Policy in announce stopped request to the tracker that order to choose which chunk to request of this 3 http://www.openbittor- removes this peer from the list of participants in neighbor. Since leechers have full knowledge rent.org and the swarm. about the availability of every chunk in their http://www.publicbt.org It is worth noting that, in practice, the Bit- neighborhood, it always requests the rarest one. Torrent ecosystem relies on a few trackers that 4 Note that different Bit- manage a large number (up to a few million) of BITTORRENT EXTENSION Torrent clients may imple- torrents in parallel. The OpenBitTorrent and Several extensions to BitTorrent have been pro- ment different variations PublicBitTorrent trackers3 are currently the posed so far. Here we just mention those rele- of the explained unchoke most important ones. vant to measurement studies. algorithms.

IEEE Communications Magazine • September 2011 145 CUEVAS LAYOUT 8/23/11 9:38 AM Page 146

Distributed hash table (DHT): Trackers are a mance information. A summary of different single point of failure in the BitTorrent ecosys- techniques is presented in Table 1. tem. Indeed, they are typically threatened by legal actions. The BitTorrent developers reacted MACROSCOPIC TECHNIQUES to this by designing a trackerless mechanism that The main objective of these techniques is under- allows a BitTorrent user to learn the IP address- standing the demographics of the BitTorrent es of peers without contacting the tracker. This ecosystem: the type of content published, the mechanism is based on a DHT [15]. popularity of this content, the distribution of Bit- (PEX): This is a simple gossip- Torrent users per country (or Internet service ing protocol used to get IP addresses of peers providers, ISPs), the relevance of the different participating in the swarm. In more detail, PEX portals and trackers, and so on. Furthermore, works as follows: a given peer P sends a PEX the macroscopic measurements also allow some request to one of its neighbors (e.g., N). If N sup- performance aspects to be studied, such as the ports PEX, it responds with the list of IP address- ratio of seeders/leechers, the session time of Bit- es of all its neighbors. Hence, by using a few Torrent users, the arrival rate of peers, the seed- PEX queries a given peer can learn the IP less state (period the torrent is without a seeder) addresses of a large number of participants in the duration, and so on. swarm without requesting them from the tracker. We classify the macroscopic techniques into two subcategories: BitTorrent portals crawling and BitTorrent trackers crawling. TECHNIQUES FOR MEASURING THE IT ORRENT COSYSTEM BitTorrent Portals Crawling — As shown ear- B T E lier, the (major) BitTorrent portals index mil- In this section we describe the BitTorrent mea- lions of torrents in a structured way. surement techniques defined in the literature so Furthermore, they provide detailed information far. We classify them into two main categories, about each indexed torrent (typically) in a spe- macroscopic and microscopic, depending on the cific torrent web page. For instance, in the case retrieved information. The former obtains demo- of The Pirate Bay, the torrent web page associat- graphic and high-level performance information, ed with a torrent with an assigned torrent-id whereas the latter gathers peer-level perfor- equal to i can be accessed through the URL http://thepiratebay.org/torrent/i (Fig. 2). Hence, once we know the id assigned to a given torrent in The Pirate Bay, we just need to access its web page and parse it (using an html parser) to retrieve the torrent information. However, in order to analyze the demographics of BitTorrent we need to crawl a large number of torrents. Next we describe two types of crawling tech- niques that can be used in order to systematical- ly crawl up to millions of torrents from a specific portal (we use The Pirate Bay as our example).

Backward Crawling — In this case the aim is to retrieve the information associated with the alive torrents published in The Pirate Bay from a given past date to the current instant. For this purpose the crawler sequentially parses all the torrents’ web pages from the last published tor- rent (http://thepiratebay.org/torrent/last_tor- rent_id/) decreasing up to the first torrent published on the target date, for instance, with torrent id k (http://thepiratebay.org/torrent/k/). The last published torrent-id can be identified either manually or using the RSS feed.

Upward Crawling — In this case the aim is to retrieve the information associated with every torrent published in The Pirate Bay from now through a given time (e.g., one month). In this case, each new torrent will be assigned a torrent- id that can be learned from the RSS feed. We will use these learned torrent-ids to crawl the torrents’ web pages. Figure 2. Example of a Pirate Bay torrent web page. HTML parsing techniques By post-processing the retrieved data from can retrieve the following information: content name (Predators 2010 R5), BitTorrent portal crawling, very relevant aspects content category and subcategory (video and movies), number of files (3), size of BitTorrent ecosystem demographics can be of the whole content (1.36 Gbytes), language (English), upload date (2010-09- characterized. Next we describe a few represen- 24), username uploading the .torrent file (cgaurav007), current number of tative examples. We refer the reader to [16] for seeders and leechers (4535 and 6671), and a text box with further information a detailed analysis of the BitTorrent ecosystem regarding the content. demographics.

146 IEEE Communications Magazine • September 2011 CUEVAS LAYOUT 8/23/11 9:38 AM Page 147

Property Portal crawling Tracker crawling Peers crawler Own client/plugin

Category Macroscopic Macroscopic Microscopic Microscopic

Demographics and High Type of information Torrents Peer Level Performance Peer Level Performance level performance

Cost of crawler preparation Low Medium High High

Scalability Very High High Medium Medium-Low

Obtained details Basic Medium Advanced Very Advanced

Completeness of torrent pop- — High Very High Low ulation

Table 1. Comparison of main BitTorrent measurement techniques.

Content popularity distribution: For this pur- crawler that implements part of the BitTor- pose we obtain the number of leechers and seed- rent protocol to communicate with the ers for each specific torrent from the tracker. More specifically, this crawler html-parsing. Note that if we want to study the works as follows. First, we define the list of evolution of popularity for a given torrent we torrents whose participants’ IP addresses have to periodically parse its web page to we want to obtain. This list of torrents can retrieve the evolution of the torrent population be retrieved, for instance, from a BitTor- (i.e., number of leechers and seeders). rent portal. For each torrent in the list, the Distribution of number of published content crawler performs an initial announce start- per category and subcategory: For this purpose ed request to the correspondent tracker. we obtain the category and subcategory for each From this request the crawler retrieves the specific torrent from the html parsing. number of participants (seeders and leech- Torrents Publishing Rate per date: For this ers) in the swarm and an initial list of IP purpose we obtain the date when each specific addresses. Afterward, the crawler performs torrent was uploaded from the html parsing. as many announce started requests as need- By applying the described measurement study ed to obtain as many IP addresses as the to different portals, we can perform a compara- number of participants in the swarm. tive study of the relevance of these portals in the Hence, by using any of the previous tech- BitTorrent ecosystem. niques we are able to collect the IP addresses of Finally, by tracking the evolution of the num- the participants in a large number of torrents. ber of seeders and leechers for a given torrent, This data allows us to study some relevant demo- we can also infer some performance metrics graphics and BitTorrent performance features. such as the seeder-to-leecher ratio and its evolu- Next, we briefly describe some of them: tion along time. The distribution of clients per country or ISP: Some studies have applied the described BitTorrent Tracker Crawling — The crawling of crawling technique to a large number (even mil- a BitTorrent portal gives detailed information lions) of torrents [16]. Afterward, the IP address regarding the torrents (type, publishers) and some of each client is mapped to its country and ISP aggregated numbers such as the number of seeders (e.g., using the MaxMind database [1]). From and leechers. However, this does not suffice if we this data we can compute the distribution of Bit- aim to study more detailed demographics parame- Torrent users per country and/or ISP. ters such as the distribution of BitTorrent users per Heavy hitters: By doing a cross-torrent country (or ISP) or relevant performance aspects inspection we can find those users (IP addresses) such as peers arrival rate and peers session time. In present in a large number of torrents [3]. We order to study these issues we need to collect the call these users heavy hitters. IP addresses of the peers participating in the BitTorrent traffic: The authors of [7] per- swarms. This can be obtained from trackers formed the described crawling technique in a (remember that a tracker managing a given swarm short period of time (90 min) over the most knows the IP addresses of all the participants). recent 40,000 torrents announced by a BitTor- There are various ways to access the informa- rent portal. This can be viewed as a snapshot of tion of a tracker (i.e., IP addresses of partici- a portion of the BitTorrent ecosystem. By com- pants in the swarms managed by the tracker): puting the traffic flowing between the BitTorrent •Getting access to the tracker logs [12]. This clients in the different torrents, the authors esti- requires the tracker owner’s collaboration. mate the intra-ISP and inter-ISP traffic generat- •Using a tracker where the information is ed by BitTorrent in a large number of ISPs. publicly available [9]. Unfortunately, only Peers’ arrival rate and session time: If we minor trackers offer this functionality. apply any of the described techniques periodical- •Using measurement techniques (i.e., crawl- ly on a given torrent, we are able to continuously ing the Tracker as depicted in Fig. 3). In monitor the peers participating in the torrent. this case we need to use a BitTorrent Therefore for each single user (i.e., IP address)

IEEE Communications Magazine • September 2011 147 CUEVAS LAYOUT 8/23/11 9:38 AM Page 148

To perform Micro-

scopic techniques we PublicBit torrent need to implement different parts of the BitTorrent peer wire Crawler BitTorrent portal Tracker protocol. Any Microscopic Crawler Get torrent file i has to implement the functions to i Fetch torrent file perform the hand- Get tracker shaking procedure. managing swarm This is essential to associated to torrent i connect to other Announce started req peers.

Number of seeders, number of leechers, n IPs (at random) Announce started req

N IPs (at random)

Send as many announce started requests as needed until obtain a number of different IPs equal to number of seeders + number of leechers

Figure 3. BitTorrent tracker crawler basic functionality: The BitTorrent crawler retrieves the .torrent file from a BitTorrent portal and obtains the IP address of the tracker managing the swarm from it. Afterward, it sends as many announce started requests as needed until it obtains the IP addresses of all the peers partici- pating in the swarm.

we can approximately determine the instant in literature to measure the most important peer- which it joins and leaves the torrent, thus being level performance aspects of BitTorrent. able to define the session time for each user. Furthermore by looking at the time between the Peer Type — After the handshake procedure subsequent arrivals of peers we can infer the succeeds with a peer, this one immediately sends arrival rate. Authors of [9, 13] have performed a BITFIELD message to the crawler. By analyz- this analysis on a large number of torrents. ing the bitfield, the crawler classifies the peer as seeder or leecher [13, 14]. MICROSCOPIC TECHNIQUES Furthermore, when using an active crawler The described macroscopic techniques exclusively there are some peers that do not respond to the retrieve the peers’ IP addresses, so only metrics crawler’s handshake messages. These peers are associated with the presence/absence of the peer typically located behind network address transla- can be studied. Unfortunately, an IP address tion (NAT) or a firewall that prevents the estab- does not suffice to infer relevant performance lishment of incoming connections. Thus, these metrics at the peer level such as peers’ download peers are classified as NATed [13, 14]. In order and upload rate. For this purpose we need to to infer if a NATed peer is a seeder or a leecher, apply more sophisticated (but less scalable) tech- we need to apply passive techniques and wait niques that we call microscopic techniques. until the peer contacts the crawler. To perform microscopic techniques we need to implement different parts of the BitTorrent Instantaneous Download Rate — After the peer wire protocol. Any microscopic crawler has handshake procedure is completed (either pas- to implement the functions to perform the hand- sively or actively), the crawler waits until it shake procedure. This is essential to connect to receives two HAVE messages from a given peer. other peers. The handshake procedure can be The size of the chunk (e.g., 4 Mbytes) used in a done actively (the crawler initiates it) or passive- given torrent is well known.5 Furthermore, the ly (the crawler waits until a peer starts the hand- crawler measures the time between the reception shake). Once the crawler is connected to a peer, of these two consecutive HAVE messages from it exploits different messages of the peer wire a peer, which is approximately the time needed 5 This information is protocol in order to measure different parame- to download a chunk. Hence, by dividing the size available in the .torrent ters. This process is illustrated in Fig. 4. Next we of the chunk by the time needed to download it, file. describe the specific techniques proposed in the we can infer the instantaneous download rate of

148 IEEE Communications Magazine • September 2011 CUEVAS LAYOUT 8/23/11 9:38 AM Page 149

When using an

Public BitTorrent active Crawler there are some peers that do not respond to Peer Crawler the Crawler’s BitTorrent portal Tracker (seeder or leecher) handshake mes-

Get Torrent file sages. These peers i are typically located i behind a NAT or a Fetch Torrent file firewall that prevents Get tracker managing swarm the establishment associated to torrent i of incoming announce started req connections.

#seeders, #leechers, N IPs (at random) For each retrieved IP performs the following operations handshake request

handshake response

Connection established with the peer

Once the connection is established, different types of messages are exchanged

BITFIELD msg HAVE msg

Figure 4. BitTorrent peer crawler basic functionality: The crawler retrieves the IP addresses of peers partici- pating in a given swarm as explained in the macroscopic tracker crawling technique. Afterward, the crawler contacts each individual peer, performs the handshake procedure, and exchanges different messages (BIT- FIELD, HAVE) to obtain different peer-level performance information.

the peer. By repeating this operation periodical- properly measured the upload bandwidth. ly, we can obtain the evolution of the instanta- Rather, some have measured some parameters neous download rate of a given peer [14]. related to the upload rate. On one hand, Isdal et al. [11] measure the physical upload capacity (* Average Download Rate — In this case the upload rate dedicated to BitTorrent). For this crawler connects to a peer, obtains its bitfield, purpose, the authors implement a passive crawler and disconnects. After some time (e.g., 1 h) the that measures the peers’ upload capacity using crawler repeats the same operation on the same the chunks sent by these peers to the crawler peer. Then, by comparing the two bitfields, we during optimistic unchokes. On the other hand, can compute the number of downloaded chunks Siganos et al. [14] measure the number of IP between the two connections to the peer. Since packets sent by a node. For this purpose, the we know the size of each chunk (S), the number authors implement an active technique that uses of downloaded chunks (D), and the time a special type of Internet Control Message Pro- between the two connections to the peer (T), we tocol (ICMP) message. The peers’ answer to this can easily compute the peer’s average download ICMP packet includes the number of IP packets rate as (S * D)/T. sent since the last time the computer was switched on. Hence, this crawler sends two of Upload Rate — This is probably the hardest these ICMP packets separated a given time T. parameter to measure. Indeed, to the best of the The answers to the first and second ICMP mes- authors’ knowledge there is no work that has sages indicate a number of packets equal to P1

IEEE Communications Magazine • September 2011 149 CUEVAS LAYOUT 8/23/11 9:38 AM Page 150

and P2. Therefore the rate of IP packets sent by thousand torrents in parallel. This means at least An important aspect the peer is computed as (P2-P1)/T. Note that one order of magnitude less than the macroscop- this rate includes IP packets associated with Bit- ic counterpart. to consider is the Torrent as well as with other applications. Moreover, we have briefly defined a set of simplicity of the complementary techniques. On the one side, the measurement tool. Chunk Distribution (Rarest First Perfor- usage of DHT and/or PEX to learn the IP mance) — An important aspect of the BitTor- addresses of the peers participating within a In this case, rent delivery mechanism is the Rarest First swarm compete directly with the traditional measurement tools Algorithm. In order to study its performance we technique of learning the peers from the tracker. have to analyze how the distribution of the num- PEX and DHT allow the measurement software based on traditional ber of available copies of each chunk in a swarm to speed up the IP addresses collection and tracker crawling are looks like. For this purpose we implemented a eliminate the risk of being blacklisted by the simpler than crawler that collects the bitfields of a large num- tracker. However, the tracker provides relevant ber of peers in a swarm (ideally all) in a relative information such as the number of peers partici- enhanced versions short period of time (a few minutes). By analyz- pating in the swarm; thus, even when using PEX that implement a ing the collected bitfields we achieve our objec- and DHT to learn peers, it is strongly recom- tive: computing the number of available copies mended that the measurement software query PEX and/or DHT of each chunk in the swarm and calculating its the tracker regularly in order to have an estima- module. distribution. We performed this study in [13], tion of the the number of peers participating in demonstrating that the Rarest First Algorithm the swarm. An important aspect to consider is guarantees a uniform distribution of pieces. the simplicity of the measurement tool. In this case, measurement tools based on traditional COMPLEMENTARY TECHNIQUES tracker crawling are simpler than enhanced ver- Some researchers have used measurement tech- sions that implement a PEX and/or DHT mod- niques that can complement the above described ule. macroscopic and microscopic techniques. On On the other side, the collection of data one hand, some crawlers [13, 14] have imple- based on a specific client or plug-in implementa- mented the DHT and/or PEX functionalities in tion competes directly with the microscopic tech- order to learn the IP addresses of the peers par- niques. Using a client that reports logs to a ticipating in a given swarm. On the other hand, server is the most accurate method to measure some research groups have implemented their the activity of a peer (download rate, upload own client [2] or a plug-in for a popular BitTor- rate, etc.) and surely provides more accurate rent client such as [4]. These clients (or results than the traditional microscopic tech- plug-ins) report information to a log server. This niques. On the downside, the scalability of this technique complements the microscopic mea- technique is limited to the number of clients surement mechanisms since it gives very accu- running the BitTorrent client (or plug-in), which rate information regarding peer-level means we have a partial view of the analyzed performance parameters; for instance, it can torrent. Moreover, the retrieved data is only rep- provide precise information about a peer’s down- resentative of a specific client with a specific load and upload rates. implementation, so the obtained results may not be generalized to other clients. Traditional TECHNIQUE COMPARISON microscopic techniques lack the level of accuracy In this subsection we compare the different mea- of client-based measurement but offer better surement techniques introduced above, stating scalability and coverage of the analyzed torrents. the pros and cons of each. Macroscopic tech- Finally, it is important to highlight that the niques are the most scalable, allowing us to ana- described techniques are not necessarily exclu- lyze up to hundreds of thousands of torrents. sive. Therefore, it is strongly recommended to These techniques are valid to retrieve: perform a study of the required data to be col- •Aggregated information at the swarm level lected and then decide which of the described •Specific information regarding the presence techniques the measurement software has to of a peer (represented by the IP+port) in a implement. given swarm Therefore, they are useful to characterize impor- HALLENGES tant information such as the content publishing C phenomenon (i.e., which users are responsible In this section we enumerate the main chal- for making available the content shared through lenges faced by the previously described tech- BitTorrent), popularity distribution of torrents, niques as well as possible solutions for some of seeder/leecher ratio, peers’ session time, and them. cross-torrent interactions (e.g., peers participat- ing in multiple torrents). However, these tech- Peer Identification niques cannot provide information at the peer level (e.g., a peer’s download progress, download Description — In BitTorrent the peers do not rate, chunk distribution) since this requires con- have a permanent Peer-ID. Every time a BitTor- tacting the peer. Microscopic techniques were rent client is started, a new random Peer-ID is designed to perform peer-level analysis. For this generated. Thus, it is not possible to follow a purpose the measurement software connects peer across multiple sessions using its Peer-ID. periodically to a large number of peers (poten- Most of the studies performed so far utilize the tially to all the peers participating in the set of IP address or IP address+port to identify a sin- analyzed torrents). This makes microscopic tech- gle user across multiple sessions. This works for niques to scale up to analyze (at most) a few all those users having static IP addresses. How-

150 IEEE Communications Magazine • September 2011 CUEVAS LAYOUT 8/23/11 9:38 AM Page 151

ever, most BitTorrent users are residential users address. Note that TOR is used by tens of thou- with a dynamic IP address that is frequently sands of BitTorrent clients in order to preserve In the case of changed by their ISP. Hence, identifying these their privacy while downloading content through peers by their IP addresses introduces inaccura- BitTorrent. This is well known by the trackers’ microscopic cies in the obtained data. Furthermore, in the administrators that do not ban the TOR proxies’ measurements the current Internet a single IP address may be IP addresses. Furthermore, the load created by crawler always shared by multiple users located behind NAT the BitTorrent measurement tools in the TOR [8], so using the IP address to identify a peer proxies is low compared to that created by the performs the hand- may lead to wrongly mapping several users as a tens of thousands of BitTorrent clients using this shake procedure single peer. service. Finally, we can increase the rate of requests with the target peer. Possible Solutions — One way of guaranteeing to the tracker using several instances of the Afterward, it the correct identification of a peer across ses- crawler distributed among different machines retrieves the needed sions is using measurement techniques based on with different IP addresses. the implementation of your own BitTorrent information, and can client/plug-in. Each installed instance of your Crawler’s IP Address Blacklisted by the then either stay client is assigned a unique and permanent ID Client (different than the Peer-ID used in the swarms) connected or which is used by the client to report the logs to Description — In the case of microscopic mea- disconnect and the log server. Another option is to get access to surements the crawler always performs the hand- reconnect after a the logs of private trackers. In most private shake procedure with the target peer. Afterward, trackers the users are required to register with a it retrieves the needed information (e.g., the bit- while. username and password. Each time a user initi- field), and can then either stay connected or dis- ates a session in the tracker it has to log in, and connect and reconnect after a while. In the first thus can be uniquely identified across sessions. case, since our crawler does not provide any Unfortunately, both described techniques have chunk to the peer, due to the optimistic connect scalability limitations. On the other hand, identi- algorithm implemented by the most important fying a peer by the combination of IP+port typi- BitTorrent clients, the peer is likely to substitute cally eliminates the problem of wrongly mapping the crawler by another peer in its neighborhood. users with the same IP address as a single peer. Once the crawler has been removed from the BitTorrent clients typically select a random port peer’s neighborhood, it is typically hard to recon- to operate, so it is unlikely that two peers behind nect since the peer recognizes the crawler as a a NAT select the same port. useless peer. In the second case, after the crawler connects and disconnects from a given peer a Crawler’s IP Address Banned by the Track- few times (two or three), this peer also blacklists er the crawler’s IP address. The IP addresses in the blacklist have a timer associated with them; after Description — The described macroscopic this timer expires, an IP address is removed tracker crawling technique may cause the from the blacklist. This means that the crawler crawler’s IP address to be banned by the tracker. can contact a given peer in intervals * blacklist In some studies [3, 6, 16] the crawler performs timer.7 large-scale crawling by continuously sending announce started requests to a specific tracker Possible Solution — When we want to monitor for a large number (e.g., tens of thousands) of the peers with a higher resolution than that torrents. Then the rate of announce started imposed by the peer’s blacklist timer, the solu- requests is very high, which is detected by the tion is using several instances of our crawler, tracker. The reaction of the tracker is to block each with a different IP address, and contact a the IP address showing this anomalous behavior. given peer following a round-robin schedule Therefore, the crawler has to limit its announce [13]. We could also use TOR; if two connections started requests rate to avoid being banned by to the same destination are at least 10 min apart, the tracker. TOR establishes a new overlay path with a new egress node. Thus, TOR guarantees a 10 min Possible Solutions — LeBlond et al. [3] resolution. describe a technique to avoid being banned while keeping a very high rate of announce start- Completeness of a Torrent Population ed requests. The technique consists of sending an announce stopped just after the announce Description — BitTorrent developers have started request. Then the tracker removes the IP recently implemented magnet links. Basically, address of the crawler from its log just after these are ids that allow a peer to learn IP answering the announce started request. By addresses of peers participating in the swarm using this simple technique, the authors report directly from the DHT service without connect- that they are able to crawl up to 750,000 torrents ing to the tracker. Therefore, the tracker is in around 30 min. unaware of the presence of these peers. 6 http://www.torproject. A second option is using an anonymization Although clients using magnet links to access a org/ service such as TOR.6 By using this service, the swarm are still a minority, this brings some diffi- messages sent by the crawler pass through an culties in obtaining the complete list of peers 7 This timer value varies overlay of proxies before reaching the tracker. participating within specific swarms because we among different clients. A Then the IP address seen by the tracker is that need to obtain those that are available from conservative estimate of the egress node from the proxies overlay, so tracker information and those that are available based on our studies is the tracker cannot block the actual crawler’s IP from the DHT service (note that some peers will two hours.

IEEE Communications Magazine • September 2011 151 CUEVAS LAYOUT 8/23/11 9:38 AM Page 152

be available from both). Unfortunately, this Measurements,” PAM, 2007. problem is even more complicated. Some tor- [12] M. Izal et al., “Dissecting BitTorrent: Five Months in a We believe that the Torrent’s Lifetime,” Proc. PAM ’04. rents are associated with multiple trackers that described techniques [13] S. Kaune et al., “Unraveling BitTorrent’s File Unavail- form separated swarms. In short, to retrieve the ability: Measurements, Analysis and Solution Explo- can constitute whole set of peers downloading given content, ration,” IEEE P2P ’10, 2010. we should collect the peers from each of the [14] G. Siganos, X. Yang, and P. Rodriguez, “Apollo: the basis for Remotely Monitoring the Bittorrent World,” tech. rep.; trackers and those available through the DHT http://research.tid.es/georgos/images/apollo_imc09., the design of service. Telefonica Research, 2009. measurement tools [15] M. Steiner and E. W. Biersack, “Crawling Azureus,” Possible Solution — In order to retrieve the tech. rep., Institut Eurecom; http://www.eurecom.fr/ for the analysis of ~btroup/BPublished/RR-08-223.pdf, 2008. whole set of peers downloading given content, [16] C. Zhang et al., “Unraveling the BitTorrent Ecosystem,” current and future we need to crawl all the trackers included in the IEEE Trans. Parallel and Distrib. Sys., 2010. large-scale systems .torrent file. Furthermore, we have to retrieve the list of peers that use the DHT instead of BIOGRAPHIES in the Internet, using a tracker. This crawling can be quite costly MICHAL KRYCZKA received his M.Sc. in computer science as well as other since some torrents can use tens of trackers. from the Technical University at Lodz, Poland, and his M.Sc. in telematics engineering from the University Carlos environments. Upload Rate Estimation III of Madrid, Spain, in 2008 and 2009, respectively. Since 2008 he is a research assistant at Institute IMDEA Net- works and a Ph.D. candidate in the Telematic Engineering Description — We have discussed above the Department at University Carlos III of Madrid. His current difficulties in measuring the upload rate and research lines include peer-to-peer technologies, Internet what other parameters have been measured as measurements, social networking, and content distribution. an approximation of the upload rate so far. RUBÉN CUEVAS obtained his Ph.D. and M.Sc. in telematics engineering and M.Sc. in telecommunications engineering Possible Solutions — The only available tech- at University Carlos III of Madrid in 2010, 2007, and 2005, nique that allows to accurately measure the respectively. Furthermore, he received his M.Sc. in network upload rate of a peer is the one based in our own planning and management from Aalborg University, Den- mark, in 2006. He is a member of the Telematics Engineer- BitTorrent client (or plugging) implementation. ing Department and NETCOM research group at University Carlos III of Madrid since 2006, where he is currently an assistant professor. He has been involved in several interna- CONCLUSION tional and national research projects. Moreover, he is coau- thor of more than 20 papers in prestigious international In this article we have presented and classified journals and conferences such as IEEE Network, Elsevier the main measurement techniques applied in , ACM CoNEXT, IEEE INFOCOM, IEEE order to understand different aspects of one of P2P, IEEE ICC, and IEEE VTC. His main research interests are the largest-scale systems in the current Internet, content distribution, P2P networks, online social networks, and Internet measurements. BitTorrent. We believe that the described tech- niques can constitute the basis for the design of ANGEL CUEVAS received his Master’s in telecommunication measurement tools for the analysis of current engineering, and M.Sc. and Ph.D. in telematics engineering and future large-scale systems in the Internet, as from the University Carlos III of Madrid in 2006, 2007, and 2011, respectively. He is currently a postdoc researcher at well as other environments. Institut Telecom SudParis in the Department of Wireless Networks and Multimedia Services. His research is focused ACKNOWLEDGMENT on the areas of sensor networks, peer-to-peer systems, and The research leading to these results has social networking. He has published more than 20 papers in high-quality international conferences and journals. He received funding from the European Union Sev- was recipient of the Best Paper award at the ACM Interna- enth Framework Program (FP7/2007-2013) tional Conference on Modeling, Analysis and Simulation of through the TREND NoE project (grant agree- Wireless and Mobile Systems 2010. ment no. 25774) and the Regional Government CARMEN GUERRERO received her Telecommunication Engi- of Madrid through the MEDIANET project (S- neering degree in 1994 from the Technical University of 2009/TIC-1468). Madrid (UPM), Spain, and her Ph.D. in computer science in 1998 from the Universidade da Coruna (UDC), Spain. She is currently associate professor since 2003 at Universidad Car- REFERENCES los III de Madrid, teaching computer networks courses. She [1] MaxMind-GeoIP, http://www.maxmind.com/app/ip-loca- has been involved in several national and international tion. research projects related to green networking, future Inter- [2] , http://www.tribler.org. net, and content distribution. More details are available at [3] S. Le Blond et al., “Spying the World from Your Lap- http://www.it.uc3m.es/carmen. top,” LEET ’10, 2010. [4] D. R. Choffnes and F. E. Bustamante, “Taming the Tor- ARTURO AZCORRA received his M.Sc. degree in telecommuni- rent: A Practical Approach to Reducing Cross-ISP traffic cations engineering from UPM in 1986 and his Ph.D. from in Peer-to-Peer Systems,” SIGCOMM Comp. Commun. the same university in 1989. In 1993 he obtained an Rev., vol. 38, no. 4, 2008, pp. 363–74. M.B.A. from the Instituto de Empresa. He has recently [5] B. Cohen, “Incentives Build Robustness in BitTorrent,” been appointed director general at the Centre for the Proc. 1st Wksp. Economics of Peer-to-Peer Sys., Berke- Development of Industrial Technology (CDTI), an agency of ley, CA, June 2003. the Spanish Ministry of Science and Innovation. As a result, [6] R. Cuevas et al., “Is Content Publishing in Bittorrent he is on leave from his double appointment as full profes- Altruistic or Profit-Driven?,” ACM CoNEXT 2010, 2010. sor (with chair) at the Telematics Engineering Department [7] R. Cuevas et al., “Deep Diving Into Bittorrent Locality,” of University Carlos III of Madrid and director of IMDEA Proc. IEEE INFOCOM ’11, 2011. Networks. He has participated in and directed 49 research [8] M. Ford et al., “Issues with IP Address Sharing,” draft- and technological development projects, including Euro- ietf-intarea-shared-addressing-issues-05, 2011. pean ESPRIT, RACE, ACTS, and IST programs. He has served [9] L. Guo et al., “Measurements, Analysis, and Modeling as a Program Committee member in many international of Bittorrent-Like Systems,” Proc. ACM IMC ’05. conferences, including several editions of IEEE PROMS, [10] Ipoque, Ipoque Internet Study 2007, http://www. IDMS, QofIS, CoNEXT, and IEEE INFOCOM. He has pub- ipoque.com/userfiles/file/internet_study_2007.pdf. lished over 100 scientific papers in books, international [11] T. Isdal et al., “Leveraging Bittorrent for End Host magazines, and conferences.

152 IEEE Communications Magazine • September 2011