Rainbow: a Robust and Versatile Measurement Tool for -based DHT Networks

Xiangtao Liu∗†, Tao Meng∗, Kai Cai∗, and Xueqi Cheng∗ ∗Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, 100080 †Graduate University, Chinese Academy of Sciences, Beijing, China, 100049 Email: {liuxiangtao, mengtao, caikai}@.ict.ac.cn, [email protected]

Abstract—In recent years, peer-to-peer (P2P) file Blizzard [5]. However, none of these crawlers can be directly applications have dominated the Internet traffic volumes, and used to measure both BTDHT and KAD at a deep level. This among them, BitTorrent and eMule constitute the majority. is because of the distributed nature of these DHT networks BitTorrent and eMule deploy their distributed networks based on Kademlia, a robust distributed (DHT) protocol, to (i.e., they have no central directory servers), and which makes facilitate the delivery of content. Kademlia-based DHT networks it difficult to obtain the detailed information of peers in the have intrigued researchers in P2P community to measure and networks. analyze them. However, to the best of our knowledge, there is still In this study, we develop Rainbow, a robust and versatile not a well-designed crawler to carry out intensive measurement crawler for Kademlia-based DHT networks. We theoretically and analysis on them. In this paper, we develop Rainbow, a robust and versatile crawler for Kademlia-based DHT networks. For the analyze the convergence (i.e., the crawler can complete its first time, we theoretically analyze its convergence (a main issue task within a limited time, which is also a main issue of of robustness), that is, Rainbow can complete the crawling within robustness) of the P2P crawlers using the same sampling a limited time. Our analysis can also be applied to other P2P nature as Rainbow. Moreover, we demonstrate that Rainbow crawlers with the same sampling nature. Finally, we demonstrate can be applied as a versatile measurement tool to identify that Rainbow can be applied as a versatile measurement tool to identify various characteristics of Kademlia-based DHT networks various characteristics of BTDHT and KAD at a deep level. at a deep level. Our primary contributions are listed below. Index Terms—Peer-to-peer, Kademlia, measurement, Rainbow, • We develop Rainbow, a well-designed crawler for convergence. Kademlia-based DHT networks. Compared with previous P2P crawlers (e.g. Cruiser [4] and Blizzard [5]), it is I.INTRODUCTION rubost and versatile. In today’s Internet, peer-to-peer (P2P) file sharing applica- • For the first time, we analyze the convergence (a main tions become more and more popular [1]. According to the issue of robustness) of Rainbow. Our analysis can also be 2008/2009 Internet traffic report of Ipoque [2], 43% ∼ 70% applied to other P2P crawlers using the same sampling of Internet traffic (e.g. in Northern Africa 43% and Eastern nature. Europe 70%) are from P2P applications and services. Among • We add a new module, the peer information gatherer, to P2P traffic, BitTorrent and eMule constitute the majority. Rainbow. This module makes Rainbow able to measure Specifically, BitTorrent accounts for 30% ∼ 81% (e.g. in south various detailed characteristics. Especially, we find that America 30% and Eastern Europe 81%) and eMule accounts the popularity of the files in KAD fits a power law for up to 47%. distribution f(x) ∼ x−α with α = 0.6732. Kademlia is a robust (DHT) protocol The rest of the paper is organized as follows. Section II designed by Maymounkov and Mazieres` [3]. In this paper, introduces related work. In Section III, we present the frame- Kademlia-based DHT networks refer to the DHT networks work of Rainbow and its crawling algorithm. In Section IV, we which are implemented and deployed by P2P applications theoretically prove the convergence of Rainbow. In Section V based on Kademlia. Both BitTorrent and eMule have deployed we demonstrate how Rainbow can be applied as a versatile their Kademlia-based DHT networks to facilitate the deliv- tool to measure and analyze BTDHT and KAD. Finally, we ery of content. These networks are called as BTDHT and conclude our work in Section VI. KAD in BitTorrent and eMule, respectively. Each peer in BTDHT/KAD has an identifier (or ID), which is randomly II.RELATED WORK generated using a hash function by the . Usually BTDHT The measurement of P2P network has intrigued many ID is 160 bits in length, and KAD ID is 128 bits in length. researchers such as [4] [5] [6] [7] [8] [9] [10] [11] [12]. The wide-use of Kademlia-based DHT networks has in- The measurement tools (i.e., the P2P crawlers) can work with trigued many researchers in P2P community to measure them. three different modes: k-bit zone crawl, full crawl and random The usual solution is to develop crawlers, which access the on- crawl. Firstly, the k-bit zone crawl, which is often abbreviated line peers in these DHT networks and record their information as the zone crawl, only collects the peers which have the same for statistics. Representative crawlers include Cruiser [4] and k-bit BTDHT/KAD ID prefix. It has the advantage of taking short processing time. For example, Blizzard [5] only takes 2.5 develop Rainbow, a well-designed P2P crawler for the wide- seconds to complete an 8-bit zone crawl. However, it can not used Kademlia-based DHT networks: BTDHT and KAD, obtain a complete snapshot, e.g., the 8-bit zone crawl can only considering that is unavailable due to legal issues. collect the information of one-256th of all peers. Secondly, the Compared with previous P2P crawlers, Rainbow will be shown full crawl aims at collecting information of all peers, but it has robust and versatile. the disadvantage of needing more time to complete the full In [5], Steiner et al. considered the convergence of their crawl. This disadvantage would distort the trace data, because crawler. They regarded the crawler converged when the UDP peer churn (i.e., the phenomenon that peers joining or leaving query messages had been sent to 99% of currently-collected a network frequently) may lead to the overlapping of peers in peers. This point of view seems reasonable but there really a snapshot, that is, the peers which have left the P2P network needs an accurate estimation of the total peer number N would be mistakenly deemed as online peers in the current in the system. For the first time, we theoretically analyze snapshot. Therefore, some researches (e.g. [4] [5]) prefer to this convergence problem of the P2P crawlers. Note that our use the trace data of the zone crawl rather than that of the analysis is based on Rainbow but it can be applied to other full crawl. Nevertheless, in Section V-A, we try to measure P2P crawlers using the same sampling nature. part of the characteristics of BTDHT and KAD using the full crawl with the consideration that the full crawl may reveal III.RAINBOW some important properties even though the trace data is partly A. The Framework of Rainbow distorted. Thirdly, the random crawl randomly collects a part The framework of Rainbow is shown in Fig. 1. It comprises of all peers. three modules: the peer crawler, the peer information gatherer Kademlia-based DHT networks include Overnet, BTDHT and the writing module. For the peer crawler, it adopts an and KAD. Among them, Overnet has been measured by many iterative crawling method. Specifically, the peer crawler starts researchers such as [6] [7]. Bhagwan et al. [6] measured the the crawl by sending UDP query messages1 to some initial peer availability (i.e., the degree to which a peer is online) peers; after a while, it will acquire a batch of new peers of Overnet. They found that one peer might have multiple IP through parsing UDP response messages2; next it will query addresses at different times, which was called as “IP aliasing”. those newly-acquired peers to obtain the next batch of new They also pointed out that IP aliasing could lead to the peers, and so on. Using this method, the peer crawler will underestimation of peer availability. Kutzner and Fuhrmann [7] collect partial information (e.g. IP and UDP port) of a peer set measured Overnet for two weeks in many aspects, such as S . For the peer information gatherer, it attempts to collect network size, peer availability, peer distribution and message UDP more detailed information (e.g. the client version) of peers in delays. S through TCP communication. For the writing module, For BTDHT, much measurement work has been carried out. UDP it writes the information of peers in S into database or For instance, Falkner et al. [8] measured the Azureus DHT, UDP log files. an implementation of Kademlia protocol by a BitTorrent client database table “PeerInfo” or txt format file. named Azureus. They focused on the measurement of session time of peers and the overheads of peers during bootstrapping. Furthermore, Sadafal [9] implemented a BTDHT crawler BTDHT/ Log files and statistically analyzed their trace data focusing on such Database characteristics as the lifetime of peers and the preference of UDP TCP peers towards special torrents. communication communication For KAD, Stutzbach et al. [10] proposed an analysis frame- work to compute lookup performance. They developed two Peer Peer crawler information Writing module measurement tools, kFetch and kLookup, to collect data for gatherer computing parameters of lookup performance. Stutzbach et Initial peers al. [4] measured the peer churn for three P2P file sharing networks, , KAD and BitTorrent, using the modified Fig. 1. The framework of Rainbow. crawler “Cruiser” which was originally used in Gnutella. Steiner et al. [5] [11] [12] developed the crawler of “Blizzard”, and measured KAD for 179 days for the common character- B. The Crawling Algorithm of Rainbow istics such as geographic distribution, session time and peer Below lists some of the data structures of Rainbow. availability. 1 Although many measurement tools have been proposed for In BTDHT, UDP query message is find for the zone/full/random crawl; while in KAD, UDP query message is kademlia req when carrying the Kademlia-based DHT networks—Overnet, BTDHT and out the zone crawl and bootstrap req when carrying out the full/random KAD, they mainly focused on the common characteristics such crawl. as peer availability and geographic distribution. Furthermore, 2In BTDHT, response message to find node is find node res, which usually returns 8 peers; while in KAD, response message to kademlia req they could only be used for one special P2P network, and (or bootstrap req) is kademlia res (or bootstrap res), which usually could only collect limited information. In this study, we returns 11 (or 20) peers. • qelem: {IP, UDP port, TCP port}; Algorithm 2 UDP receive thread • key: {IP, UDP port, BTDHT/KAD ID}; Input: UDP response message • status: {UDP Requested, UDP Responded, Output: updated qpeersUDP , qpeersT CP and mpeers TCP Requested, TCP Responded, TCP Cantcon}; • peer: {IP, UDP port, TCP port, BTDHT/KAD ID, status, 1: while true do client version, . . . }; 2: if UDP over then • shared queue qpeersUDP/qpeersT CP : queue of qelem 3: exit thread elements; 4: wait for UDP response message • shared map mpeers: map of < key, peer > elements; 5: if wait(UDP response message) > Ti then • UDP over/T CP over: the Boolean variable to ensure the 6: sample some peers from mpeers and send UDP convergence of UDP/TCP communication. query messages to them {incentive mechanism} We develop Rainbow by customizing rTorrent and eMule 7: for all p ∈ UDP response message do clients. The three modules presented above are implemented 8: qelem ← p.info, key ← p.info, peer ← in five threads as shown in Alg. 1 to Alg. 5. Among them, the p.info UDP send thread (Alg. 1) and the UDP receive thread (Alg. 2) 9: peer.status ← UDP Responded correspond to the peer crawler; the TCP send thread (Alg. 3) 10: if key∈ / mpeers then and the TCP receive thread (Alg. 4) correspond to the peer 11: mpeers.add(< key, peer >) information gatherer; the writing thread (Alg. 5) corresponds 12: qpeersUDP .add(qelem) to the writing module. 13: qpeersT CP .add(qelem)

Algorithm 1 UDP send thread Algorithm 3 TCP send thread Input: k Initial peers in local routing table Input: qpeersT CP Output: updated qpeersUDP Output: updated qpeersT CP

1: initialize qpeersUDP with k initial peers 1: while true do 2: read those peers with status = UDP Responded from 2: if T CP over then the latest log file into qpeersUDP {positive feedback 3: exit thread mechanism} 4: repeat 3: start thread 2, 3 and 4 5: wait the number of current TCP connections > 4: while size(mpeers) ≤ n do m {m is the maximum number of current TCP 5: repeat connections} 6: send UDP query message to the first element p 6: try to establish asynchronous TCP connection of qpeersUDP with the first element p of qpeersT CP , and 7: remove p from qpeersUDP send TCP query message to p 8: if p ∈ mpeers then 7: p.status ← TCP Requested 9: p.status ← UDP Requested 8: remove p from qpeersT CP 10: until size(qpeersUDP ) == 0 9: until size(qpeersT CP ) == 0 11: wait(Tw) {Tw is the waiting time needed for the last response response message} ← 12: UDP over 1 Rainbow has two mechanisms: positive feedback mecha- 13: start thread 5 nism (see statement 2 in Alg. 1) and incentive mechanism (see statement 5 and 6 in Alg. 2), to ensure its robustness. The More specifically, the UDP send thread sends UDP positive feedback mechanism reads those responsive peers of query messages to peers in qpeersUDP ; the UDP re- the latest crawl into qpeersUDP to accelerate the crawling. ceive thread parses UDP response messages and store new And the incentive mechanism prevents the accidental deceler- peers into qpeersUDP , qpeersT CP and mpeers; the TCP ation or even suspension of Rainbow. Specifically, whenever send thread attempts to establish asynchronous TCP con- Rainbow does not receive any UDP response messages in time, nections with peers in qpeersT CP , then it sends TCP it will sample some peers which have already been collected query messages (e.g. in KAD, query message is hello or and send UDP query messages to them. view shared files) to acquire more detailed information These five threads can be executed concurrently if the data of the peers; the TCP receive thread parses TCP response dependency is met for each peer, that is, a TCP query message messages (e.g. in KAD, response message is hello answer is sent to the peer only after the corresponding UDP query or view shared files answer) and store more detailed message is responded, and the database or log file is appended information into the corresponding peers of mpeers; the only after a TCP query message is responded. The time order writing thread writes the information into database or log files. of the three modules is depicted in Fig. 2. Sometimes in the Algorithm 4 TCP receive thread IV. CONVERGENCE ANALYSIS OF RAINBOW Input: TCP response message Assume there are totally N peers in a system, and each Output: updated mpeers time Rainbow only collects different peers, is it convergent to collect n(n ∈ [1,N]) peers from N peers? We formalize 1: while true do this convergence problem into a sample problem as below. 2: if T CP over then Given a set S1 with N elements (corresponding to the 3: exit thread peers in the network), from which we draw c elements “with 4: wait for TCP response message replacement” each time. Here we call c as the sampling 5: parse TCP response message and write detailed in- granularity and denote it as Gs = c. Let S2 denote the formation into corresponding peer p in mpeers elements drawn from S1. Then the sampling times X, which 6: disconnect TCP connection with p makes |S2| ≥ n (n ∈ [1,N]), is a random variable. Our 7: p.status ← TCP Responded problem is: what is E(X)? ∈ 8: for all TCP connection tc currently established For example, let S1 = {1, 2, 3, 4, 5}, c = 2 and n = 4. If we TCP connections do draw {1, 2} for the first time, and {2, 3} for the second time, ≥ 9: if the waiting time of tc T then then S2 = {1, 2, 3} after two times of sampling. If we draw 10: disconnect TCP connection tc {1, 3} for the third time, then S2 is still {1, 2, 3}, and if we 11: set the status of corresponding peer as draw {3, 5} for the fourth time, then S2 becomes {1, 2, 3, 5} TCP Cantcon and we have |S2| = 4 finally. From the above sampling process, we obtain a valid sampling such that |S2| ≥ 4 with Algorithm 5 Writing thread the sampling times X = 4. To solve this problem, we first consider the case G = 1. Input: mpeers s The problem is illustrated as the state transition chart (Fig. 3), Output: updated database or log files where the circles denote the states, the number in the circle denotes the current size of S , and the number on the arrow ∈ 2 1: for all p mpeers do denotes the transition probability between the corresponding 2: wait p.status = TCP Responded or TCP Cantcon states. For example, the transition probability from state |S2| = 3: write the information of p into database or log files 0 to state |S2| = 1 equals 1.

1 (N-1)/N real crawling, the peer information gather may be bypassed 0 0 1/N 1 2/N 2 ... i/N i ... 1 N to accelerate the crawling if we do not want the detailed information of each peer. Fig. 3. The chart of state transition.

Lemma 1: Let |S1| = N and Gs = 1. Let qi be the random variable of the sampling times from the state |S2| = i − 1 to the state |S2| = i, for i ∈ [1,N]. Then, Writing module N Peer information gatherer E(qi) = . (1) N − i + 1 Peer crawler Proof of Lemma 1: From Fig. 3, we can deduce that the transition probability from state |S2| = i − 1 to state |S2| = i N−i+1 | | Fig. 2. The pipeline chart of the modules of Rainbow. equals N , and the probability retaining at state S2 = − i−1 N−i+1 i 1 equals N . Therefore the probability of qi = 1 is N . N−i+1 × i−1 Similarly, the probability of qi = 2 is N N . Generally, Under such a design, Rainbow is robust and versatile. Its N−i+1 i−1 j−1 we have that P r(qi = j) = × ( ) . Therefore, robustness is ensured by the positive feedback mechanism, ∑ ( )N− N N−i+1 +∞ × i−1 j 1 N the incentive mechanism, and the convergence which will be E(qi) = N j=1 j N = N−i+1 and Lemma 1 theoretically analyzed in Section IV. And its versatility is is proved. | | provided by its special module—the peer information gatherer. Lemma 2: Let S1 = N and Gs = 1. Let X(N, n, 1) be | | This module make Rainbow able to acquire more detailed in- the random variable of total sampling times from state S2 = | | ∈ formation of peers than previous P2P crawlers (e.g.Cruiser [4] 0 to state S2 = n, for n [1,N]. Then, and Blizzard [5]). ∑n N E[X(N, n, 1)] = . (2) It should be mentioned that Rainbow can be configured to N − i + 1 work in both BTDHT and KAD networks, and can also be i=1 configured to launch random crawl, besides the commonly- Proof of Lemma 2: It is∑ easy to see that E[X(N, n, 1)] = n used crawl modes, the zone crawl and the full crawl. E(q1 + q2 + ... + qn) = i=1 E(qi). By Lemma 1, the N ∈ sampling times qi satisfies E(qi) = N−i+1 , i [1, n] for a receives this response message, it will parse this message and | | − single- state transition from state S2 ∑= i 1 to state store the peers in mpeers. Therefore, the sampling granularity | | n N S2 = i. Therefore, E[X(N, n, 1)] = i=1 N−i+1 and of Rainbow satisfies Gs = c. By Theorem 1, we know that Lemma 2 holds. the mathematical expectation of total sampling times equals By (2), one can obviously obtain that E[X(N, 1, 1)] = 1 E[X(N, n, c)]. and E[X(N, n, 1)] > n when n > 1. With the preparations a) When carrying out full crawl, we have n = N and ∑ of the case Gs = 1, we can obtain the result for the case N N i=1 N−i+1 Gs = c ∈ [1, n] as follows. ∑ E[X(N, N, c)] = c N . (4) Theorem 1: Let |S1| = N and Gs = c ∈ [1, n], and i=1 N−i+1 ∈ suppose we need sample n(n [1,N]) elements into S2 with According to Steiner et al. [5] and our careful observation, replacement. Let X(N, n, c) be the random variable of the 7 6 N∑ ∼ 10 (or 10 ) >> c in BTDHT (or KAD), so we have sampling times from state |S2| = 0 to state |S2| = n, for c N − ≈ c. Therefore, n ∈ [1,N]. Then, i=1 N i+1 N[ln(N + 1) + γ] E[X(N, n, 1)] E[X(N, N, c)] ≈ , (5) E[X(N, n, c)] = . (3) c E[X(N, c, 1)] where γ = 0.577218 is the Euler’s constant. From (5), we Proof of Theorem 1: To prove the result, we convert know that E[X(N, N, c)] = O(NlnN), so thread 1 and 2 the “original sampling problem”of Gs = c to an “equivalent only need more sampling times on removing duplicate peers sampling problem” of Gs = 1. These two problems are shown by a factor of lnN. Therefore thread 1 and 2 are convergent as in the left part and the right part of Fig. 4, respectively. when carrying out full crawl. Consider the i-th time of sampling c elements, which cor- b) When carrying out zone/random crawl, we have n < N responds to the i-th box in the left part of Fig. 4. If we equiv- and E[X(N, n, c)] < E[X(N, N, c)]. Therefore averagely the alently use Gs = 1, averagely we need βi = E[X(N, c, 1)] sampling times of zone/random crawl are less than that of times of sampling, as shown in the right part of Fig. 4. full crawl. Since thread 1 and 2 are convergent when carrying Note that the elements are sampled with replacement, thus out full crawl, they are also convergent when carrying out we have ∀i ∈ [1,E[X(N, n, c)], βi ≡ E[X(N, c, 1)]. Then by zone/random crawl. adding up all E[X(N, n, c)] times of single-step sampling of 2) Secondly, we consider the convergence of thread 3, 4 and the left part of Fig. 4, we have 5. Note that thread 3, 4 and 5 do not carry out the iterative E[X∑(N,n,c)] crawl. More precisely, thread 3 and 4 just TCP communicate with those peers which have been collected by thread 1 and 2; E[X(N, n, 1)] = βi. i=1 thread 5 just writes detailed information of peers into database or log files. Therefore we need not consider the additional × Therefore, E[X(N, n, 1)] = E[X(N, n, c)] E[X(N, c, 1)] sampling times spending on removing duplicate peers. and thus (3) holds. From the description of threads 3 to 5, we see that the In (3), if c = 1, then E[X(N, n, c)] = E[X(N,n,1)] = E[X(N,1,1)] convergence of thread 3 is ensured by the Boolean variable E[X(N, n, 1)], which means that (3) is a more general form E[X(N,n,1)] T CP over and statements 2 and 3 in Alg. 3; similarly, the of (2). If c = n, then E[X(N, n, c)] = E[X(N,n,1)] = 1, convergence of thread 4 is ensured by the Boolean variable which means that we only need one time to sample n elements T CP over and statements 2 to 3 in Alg. 4; thread 5 only has from S1 to S2 when Gs = n; this is consistent with the real a simple “for” loop, and it is convergent. situation. By all the discussions above, theorem 2 holds. Theorem 2: The crawl algorithm of Rainbow is convergent when carrying out full/zone/random crawl. V. THE VERSATILITY OF RAINBOW Proof of Theorem 2: 1) Firstly, we consider the conver- Rainbow is a versatile measurement tool for Kademlia- gence of thread 1 and 2. based DHT networks. It can be applied to identify more Without considering the sampling times on removing du- intensive characteristics of BTDHT and KAD than previous plicate peers, the convergence of thread 1 is ensured by P2P crawlers (e.g. Cruiser [4] and Blizzard [5]) can. the threshold judgment (see statement 4 in Alg. 1) and the In our experiments, the data are collected with two servers, incentive mechanism (see statement 5 and 6 in Alg. 2); the one for BTDHT and another for KAD, respectively. They are convergence of thread 2 is ensured by the Boolean variable configurated with 2.21GHz dual-core processor, 8GB RAM UDP over and statement 2 and 3 in Alg. 2. and 10M bit / sec network. We carried out 443 random crawls When considering the additional sampling times on remov- on BTDHT/KAD from May 29, 2009 to June 9, 2009 without ing duplicate peers, from statement 6 in Alg. 1, we see that bypassing the peer information gatherer (see Fig. 1) to collect Rainbow carries out the iterative crawl through sending UDP detailed information of peers. Furthermore, we carried out 24 query message to a peer p. When p receives the query message, full crawls on BTDHT and KAD from Aug. 7, 2009 to Aug. 8, usually it will answer a UDP response message attaching c 2009 bypassing the peer information gatherer. The analysis (c = 8, 11, or 20, see footnote 2) peers in it. When thread 2 based on these trace data are presented as follows. The granularity The granularity of sampling=c of sampling=1 1 1 2 2 ...... 1 ...

^ E[X(N,c,1)] E[X(N,c,1)] ......

1 (i-1)*E[X(N,c,1)]+1 2 (i-1)*E[X(N,c,1)]+2 i ......

^ E[X(N,c,1)] i*E[X(N,c,1)] ......

1 E[(X(N,n,c)]-1)*E[X(N,c,1)]+1 2 E[(X(N,n,c)]-1)*E[X(N,c,1)]+2 E X N n c ......

[ ( , , )] ...

^ E[X(N,c,1)] E[X(N,n,c)]*E[X(N,c,1)] = E[X(N,n,1)]

Fig. 4. The chart of equivalent sample problem.

A. Geographical Distribution Regularity of Peers (a) BTDHT We selected two full-crawl snapshots, where the BTDHT 0.2 snapshot was taken at 4:37 a.m., Aug. 7, 2009 (Beijing time) 0.15 and the KAD snapshot was taken at 4:33 a.m., Aug. 7, 2009, 0.1 to compare the geographical distribution between BTDHT Peer ratio and KAD peers. We use the latest IP-to-Country database 0.05 of webnet77.com [13] to locate the geographical location. 0 USCNRUGBFRCAES PL RO IT UABR JPTWDEBGHU IN SE NL We observe that users of BTDHT and KAD are distributed Country (b) KAD in nearly 200 counties and the top 20 countries are shown 0.35 in Fig. 5. We see that for BTDHT, 14% of the peers are 0.3 distributed in America, 8% in China, 8% in Russia, 4% in 0.25 0.2 Canada, 2% in Japan, and the others mainly in Europe; for 0.15 KAD, 30% of the peers are distributed in China, only 2% Peer ratio 0.1 0.05 in America, 0.5% in Russia, and the others also mostly in 0 CNES IT FRBRDE IL EUTWUS PL KRAR PT CAGBHKBECHNL Europe. Thus, we can infer that American and Russia users Country are accustomed to use BitTorrent; while Chinese and European users are accustomed to use both of them. Fig. 5. The peer geographic distribution in BTDHT and KAD.

B. Client Version Distribution

We sampled 14, 526 peers in BTDHT and 38, 236 peers 5.2% others 0.5% lphant1.9% Mldonkey 2.6% Azureus 3.2% aMule in KAD from the trace data of random crawl. We observed 3.6% that there were 22 kinds of clients in BTDHT, and Fig. 6(a) 4.4% KTorrent 7.5% Bram's shows the distribution of the top 6. One can see that µTorrent BitTorrent accounts for 60.1% for its open-source, conciseness and high performance. While we only observed four kinds of clients 16.6% BitComet of KAD (Fig. 6(b)), eMule, aMule, , and Mldonkey. One can see that eMule accounts for 94.4%, where its latest 60.1% uTorrent 94.4% eMule version, eMule v0.49 and eMule v0.48 occupy 65.3% and 24.2%, respectively. We speculate that these versions may have fixed some bugs of the earlier-versions and thus are more Fig. 6. The distribution of client versions in BTDHT and KAD. welcomed by the P2P users. (a) The popularity of files REFERENCES Number of shared users vs. Rank

2 Power law [1] T. Karagiannis, A. Broido, M. Faloutsos, and K. Claffy, “Transport layer 10 identification of P2P traffic,” in Proceedings of the Internet Measurement Conference (IMC), Taormina, Sicily, Italy, Oct. 2004, pp. 121–134.

1 [2] Ipoque, http://www.ipoque.com/resources/internet-studies/. 10 f(x)=1122*x−0.6732 [3] P. Maymounkov and D. Mazieres, “Kademlia: a peer-to-peer information system based on the xor metric,” in International Workshop on Peer-to-

corresponding to a file 0 Peer Systems (IPTPS), Cambridge, MA, USA, Mar. 2002, pp. 53–65. Number of shared users 10 0 1 2 3 4 [4] D. Stutzbach and R. Rejaie, “Understanding churn in peer-to-peer 10 10 10 10 10 Rank networks,” in Proceedings of the Internet Measurement Conference (IMC), Rio de Janeriro, Brazil, Oct. 2006, pp. 189–202. (b) The distribution of file types [5] M. Steiner, T. En-Najjary, and E. Biersack, “Long term study of peer behavior in the KAD DHT,” IEEE/ACM Transaction on Networking 16.3% other (ToN), vol. 17, no. 5, pp. 1371–1384, Oct. 2009. 27.9% video [6] R. Bhagwan, S. Savage, and G. Voelker, “Understanding availability,” in 6.3% doc Proceedings of the 2nd International Workshop on Peer-to-Peer Systems (IPTPS), Berkeley, CA, USA, Feb. 2003, pp. 256–267. [7] K. Kutzner and T. Fuhrmann, “Measuring large overlay networks — 10.5% arc the overnet example,” in Proceedings of the 14th KiVS, Kaiserslautern, Germany, Mar. 2005, pp. 193–204. 25.8% audio 13.2% image [8] J. Falkner, M. Piatek, J. John, A. Krishnamurthy, and T. Anderson, “Pro- filing a million user DHT,” in Proceedings of the Internet Measurement Conference (IMC), New York, NY, USA, Oct. 2007, pp. 129–134. [9] V. Sadafal, “Measurement and analysis of BitTorrent,” Texas A&M University, Aug. 2008, master thesis. Fig. 7. The regularities of files in KAD. [10] D. Stutzbach and R. Rejaie, “Improving lookup performance over a widely-deployed DHT,” in Proc. INFOCOM, Barcelona, Spain, Apr. 2006, pp. 1–12. C. The Regularities of Files [11] M. Steiner, T. En-Najjary, and E. Biersack, “A global view of KAD,” in Proceedings of the Internet Measurement Conference (IMC), San Diego, In this section, we take KAD as an example to demonstrate CA, USA, Oct. 2007, pp. 117–122. how Rainbow can be applied to identify the regularities of [12] M. Steiner, E. Biersack, and T. En-Najjary, “Actively monitoring peers in KAD,” in Proceedings of the 6th International Workshop on Peer-to- files. We focus on the aspects of the popularity of files (i.e. Peer Systems (IPTPS), Bellevue, WA, USA, Feb. 2007. the distribution of user number of files) and file types, and we [13] “Ip-to-country database of webnet77.com,” http://webnet77.com/. sample 30, 845 file records from the trace data of random crawl to analyze. The popularity of files is presented in Fig. 7(a), from which we see that the popularity of files fits a power law distribution f(x) ∼ x−α with α = 0.6732. This indicates that only a small part of files are especially welcomed by eMule users. Specifically, we observed that only 3.5% of the files are shared by more than ten users, and averagely there are 2.78 users sharing one file. The distribution of file types is presented in Fig. 7(b), from which we see that over 50% of files are video and audio files. Hence we infer that movies and music are preferred by the eMule users.

VI.CONCLUSION In this paper, we presented a robust and versatile measure- ment tool (Rainbow) for the Kademlia-based DHT networks, BTDHT and KAD. We have shown that its robustness was ensured by the positive feedback mechanism, the incentive mechanism and the convergence. We have theoretically proven the convergence of Rainbow by formalizing the convergence problem as a sample problem. It should be mentioned that our theoretical results could also be used to other P2P crawlers with the same sampling nature. Moveover, the peer informa- tion gatherer module makes Rainbow able to measure various detailed characteristics, such as the version of clients and the popularity of files for BTDHT and KAD.

ACKNOWLEDGMENT This work is supported in part by China NSF under Grants No. 60872036, No. 60803085 and No. 60873245.