A Robust and Versatile Measurement Tool for Kademlia-Based DHT Networks
Total Page:16
File Type:pdf, Size:1020Kb
Rainbow: a Robust and Versatile Measurement Tool for Kademlia-based DHT Networks Xiangtao Liu∗y, Tao Meng∗, Kai Cai∗, and Xueqi Cheng∗ ∗Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, 100080 yGraduate University, Chinese Academy of Sciences, Beijing, China, 100049 Email: fliuxiangtao, mengtao, [email protected], [email protected] Abstract—In recent years, peer-to-peer (P2P) file sharing Blizzard [5]. However, none of these crawlers can be directly applications have dominated the Internet traffic volumes, and used to measure both BTDHT and KAD at a deep level. This among them, BitTorrent and eMule constitute the majority. is because of the distributed nature of these DHT networks BitTorrent and eMule deploy their distributed networks based on Kademlia, a robust distributed hash table (DHT) protocol, to (i.e., they have no central directory servers), and which makes facilitate the delivery of content. Kademlia-based DHT networks it difficult to obtain the detailed information of peers in the have intrigued researchers in P2P community to measure and networks. analyze them. However, to the best of our knowledge, there is still In this study, we develop Rainbow, a robust and versatile not a well-designed crawler to carry out intensive measurement crawler for Kademlia-based DHT networks. We theoretically and analysis on them. In this paper, we develop Rainbow, a robust and versatile crawler for Kademlia-based DHT networks. For the analyze the convergence (i.e., the crawler can complete its first time, we theoretically analyze its convergence (a main issue task within a limited time, which is also a main issue of of robustness), that is, Rainbow can complete the crawling within robustness) of the P2P crawlers using the same sampling a limited time. Our analysis can also be applied to other P2P nature as Rainbow. Moreover, we demonstrate that Rainbow crawlers with the same sampling nature. Finally, we demonstrate can be applied as a versatile measurement tool to identify that Rainbow can be applied as a versatile measurement tool to identify various characteristics of Kademlia-based DHT networks various characteristics of BTDHT and KAD at a deep level. at a deep level. Our primary contributions are listed below. Index Terms—Peer-to-peer, Kademlia, measurement, Rainbow, • We develop Rainbow, a well-designed crawler for convergence. Kademlia-based DHT networks. Compared with previous P2P crawlers (e.g. Cruiser [4] and Blizzard [5]), it is I. INTRODUCTION rubost and versatile. In today’s Internet, peer-to-peer (P2P) file sharing applica- • For the first time, we analyze the convergence (a main tions become more and more popular [1]. According to the issue of robustness) of Rainbow. Our analysis can also be 2008/2009 Internet traffic report of Ipoque [2], 43% ∼ 70% applied to other P2P crawlers using the same sampling of Internet traffic (e.g. in Northern Africa 43% and Eastern nature. Europe 70%) are from P2P applications and services. Among • We add a new module, the peer information gatherer, to P2P traffic, BitTorrent and eMule constitute the majority. Rainbow. This module makes Rainbow able to measure Specifically, BitTorrent accounts for 30% ∼ 81% (e.g. in south various detailed characteristics. Especially, we find that America 30% and Eastern Europe 81%) and eMule accounts the popularity of the files in KAD fits a power law for up to 47%. distribution f(x) ∼ x−α with α = 0:6732. Kademlia is a robust distributed hash table (DHT) protocol The rest of the paper is organized as follows. Section II designed by Maymounkov and Mazieres` [3]. In this paper, introduces related work. In Section III, we present the frame- Kademlia-based DHT networks refer to the DHT networks work of Rainbow and its crawling algorithm. In Section IV, we which are implemented and deployed by P2P applications theoretically prove the convergence of Rainbow. In Section V based on Kademlia. Both BitTorrent and eMule have deployed we demonstrate how Rainbow can be applied as a versatile their Kademlia-based DHT networks to facilitate the deliv- tool to measure and analyze BTDHT and KAD. Finally, we ery of content. These networks are called as BTDHT and conclude our work in Section VI. KAD in BitTorrent and eMule, respectively. Each peer in BTDHT/KAD has an identifier (or ID), which is randomly II. RELATED WORK generated using a hash function by the client. Usually BTDHT The measurement of P2P network has intrigued many ID is 160 bits in length, and KAD ID is 128 bits in length. researchers such as [4] [5] [6] [7] [8] [9] [10] [11] [12]. The wide-use of Kademlia-based DHT networks has in- The measurement tools (i.e., the P2P crawlers) can work with trigued many researchers in P2P community to measure them. three different modes: k-bit zone crawl, full crawl and random The usual solution is to develop crawlers, which access the on- crawl. Firstly, the k-bit zone crawl, which is often abbreviated line peers in these DHT networks and record their information as the zone crawl, only collects the peers which have the same for statistics. Representative crawlers include Cruiser [4] and k-bit BTDHT/KAD ID prefix. It has the advantage of taking short processing time. For example, Blizzard [5] only takes 2.5 develop Rainbow, a well-designed P2P crawler for the wide- seconds to complete an 8-bit zone crawl. However, it can not used Kademlia-based DHT networks: BTDHT and KAD, obtain a complete snapshot, e.g., the 8-bit zone crawl can only considering that Overnet is unavailable due to legal issues. collect the information of one-256th of all peers. Secondly, the Compared with previous P2P crawlers, Rainbow will be shown full crawl aims at collecting information of all peers, but it has robust and versatile. the disadvantage of needing more time to complete the full In [5], Steiner et al. considered the convergence of their crawl. This disadvantage would distort the trace data, because crawler. They regarded the crawler converged when the UDP peer churn (i.e., the phenomenon that peers joining or leaving query messages had been sent to 99% of currently-collected a network frequently) may lead to the overlapping of peers in peers. This point of view seems reasonable but there really a snapshot, that is, the peers which have left the P2P network needs an accurate estimation of the total peer number N would be mistakenly deemed as online peers in the current in the system. For the first time, we theoretically analyze snapshot. Therefore, some researches (e.g. [4] [5]) prefer to this convergence problem of the P2P crawlers. Note that our use the trace data of the zone crawl rather than that of the analysis is based on Rainbow but it can be applied to other full crawl. Nevertheless, in Section V-A, we try to measure P2P crawlers using the same sampling nature. part of the characteristics of BTDHT and KAD using the full crawl with the consideration that the full crawl may reveal III. RAINBOW some important properties even though the trace data is partly A. The Framework of Rainbow distorted. Thirdly, the random crawl randomly collects a part The framework of Rainbow is shown in Fig. 1. It comprises of all peers. three modules: the peer crawler, the peer information gatherer Kademlia-based DHT networks include Overnet, BTDHT and the writing module. For the peer crawler, it adopts an and KAD. Among them, Overnet has been measured by many iterative crawling method. Specifically, the peer crawler starts researchers such as [6] [7]. Bhagwan et al. [6] measured the the crawl by sending UDP query messages1 to some initial peer availability (i.e., the degree to which a peer is online) peers; after a while, it will acquire a batch of new peers of Overnet. They found that one peer might have multiple IP through parsing UDP response messages2; next it will query addresses at different times, which was called as “IP aliasing”. those newly-acquired peers to obtain the next batch of new They also pointed out that IP aliasing could lead to the peers, and so on. Using this method, the peer crawler will underestimation of peer availability. Kutzner and Fuhrmann [7] collect partial information (e.g. IP and UDP port) of a peer set measured Overnet for two weeks in many aspects, such as S . For the peer information gatherer, it attempts to collect network size, peer availability, peer distribution and message UDP more detailed information (e.g. the client version) of peers in routing delays. S through TCP communication. For the writing module, For BTDHT, much measurement work has been carried out. UDP it writes the information of peers in S into database or For instance, Falkner et al. [8] measured the Azureus DHT, UDP log files. an implementation of Kademlia protocol by a BitTorrent client database table “PeerInfo” or txt format file. named Azureus. They focused on the measurement of session time of peers and the overheads of peers during bootstrapping. Furthermore, Sadafal [9] implemented a BTDHT crawler BTDHT/KAD Network Log files and statistically analyzed their trace data focusing on such Database characteristics as the lifetime of peers and the preference of UDP TCP peers towards special torrents. communication communication For KAD, Stutzbach et al. [10] proposed an analysis frame- work to compute lookup performance. They developed two Peer Peer crawler information Writing module measurement tools, kFetch and kLookup, to collect data for gatherer computing parameters of lookup performance. Stutzbach et Initial peers al. [4] measured the peer churn for three P2P file sharing networks, Gnutella, KAD and BitTorrent, using the modified Fig.