User contribution in peer-to-peer communities

User contribution in peer-to-peer communities

Proefschrift

ter verkrijging van de graad van doctor aan de Technische Universiteit Delft, op gezag van de Rector Magnificus prof. ir. K. C. A. M. Luyben, voorzitter van het College voor Promoties, in het openbaar te verdedigen op maandag 6 juli 2015 om 10:00 uur

door

Mihai Capotă inginer diplomat, Universitatea Politehnica din București, Roemenië geboren te Buzău, Roemenië. Dit proefschrift is goedgekeurd door de: Promotor: Prof. dr. ir. D. H. J. Epema Copromotor: Dr. ir. J. A. Pouwelse Samenstelling promotiecommissie: Rector Magnificus, voorzitter Prof. dr. ir. D. H. J. Epema, Technische Universiteit Delft & Technische Universiteit Eindhoven, promotor Dr. ir. J. A. Pouwelse, Technische Universiteit Delft, copromotor Onafhankelijke leden: Prof. dr. F. M. T. Brazier Technische Universiteit Delft Prof. dr. ir. G. J. P. M. Houben Technische Universiteit Delft Prof. dr. F. E. Bustamante Northwestern University, United States of America Prof. dr. P. Michiardi EURECOM, France Dr. S. Voulgaris Vrije Universiteit Amsterdam

Advanced School for Computing and Imaging

This work was partly carried out in the ASCI graduate school. ASCI dissertation series number 330.

Printed by: Uitgeverij BOXPress, ‘s-Hertogenbosch

© 2015 Mihai Capotă ISBN 978-94-6295-269-0 doi:10.4233/uuid:a03b025b-1fd5-4cd1-9423-a5c1bd85fefe Acknowledgments

A PhD represents proof of independence in research. At the same time, obtaining a PhD can only be done with help from others. I am thankful to all who have contributed to this thesis as co-authors of my publications, colleagues, and friends. Dick, you are first in the list not because you are my PhD supervisor, but be- cause you really helped me the most. You believed in me when I was behind with my publications halfway through my contract and also towards the end when, with one chapter of my dissertation missing, I switched focus away from my PhD topic toward the exciting new area of data-intensive computing. Your input helped me at all stages of research, from extracting a problem statement out of the many ideas floating around in our conversations to designing sound experiments. You always supported me, but never let me forget I have a PhD to finish. Our conversations un- related to work were always fascinating, especially the ones about the Netherlands. I could not have asked for a better supervisor. Johan, you gave me the opportunity to come to Delft and meet all the won- derful people you gathered around . I will always be immensely grateful for that. Furthermore, a large part of the work in this dissertation started during our brainstorming sessions, the remains of which will probably still be visible on the whiteboards in our offices for generations to come. Outside work, you were always a great discussion partner, whether the topic was technology, privacy, or politics. Alex, you taught me how to focus on the important things and forget about minutiae that were preventing me from making progress; in other words, you taught me how to get things done in research. Most importantly, you helped me find a path beyond my PhD and I can confidently face the fast-approaching next step in my career thanks to you. You and Ana were also real friends to me from the first day I arrived in the Netherlands. You welcomed me to your home to celebrate birthdays and other special events, we travelled to exciting places together, and we simply went out for a drink and a chat many times. Thank you both for being my friends. Naza, you were there for me when I needed you most, at the beginning of my PhD, and you taught me how to conduct research on my own. You offered plenty of help, but made it clear that I am the one responsible for my PhD. I can certainly say that the research visit to your university in Brazil you arranged together with Dick for me is the highlight of my PhD, for many more reasons that just paper we published as a result. Outside of research, I will only say one thing: you and Aida were the closest thing to a family I had in the Netherlands. Lucia, you are a great friend. We shared several fantastic years in the same office, during which we had plenty of fun, and we even wrote a paper together! I will not attempt to list all the wonderful moments we shared, but I have to thank you for introducing me to Jelle, the coolest Dutch person I have met. I hope I will be hanging out with both of you for many years to come.

v vi Acknowledgments

Niels, from writing papers together, to designing and enjoying the social room, to home improvement, to partying, you were always there for me. Thank you for being such a good friend. You and Janna even introduced me to the Sint Jansbrug student society, certainly the most authentic aspect of Delft student life I have experienced. Dimitra, it was very nice sharing the journey toward a PhD with you. You always lightened the atmosphere in our group with your cheerfulness. While we never got to write a paper together, our discussions were very helpful for my research. Outside the office, I fondly remember our great chats about art, especially about film. And the partying, also with Georgios! Henk, thank you for accepting my PhD application, for teaching me Dutch during lunch breaks and random conversations at the office, and for sharing your insights into research, education, TU Delft, and life in the Netherlands. I would like to thank Nicolae Țăpuș for being an excellent MSc supervisor and for giving me the opportunity to come to Delft through the collaboration between Politehnica University of Bucharest and TU Delft. Bogdan, you helped me explore new avenues for research in data-intensive com- puting. The papers we wrote together opened doors on my career path and I am very grateful for that. I also want to thank you for the many conversations we had about life, the universe, and everything. Maciek, we see eye-to-eye on so many things... I was very lucky to have met you at the beginning of my adventure in the Netherlands. Your friendship in the first year of my PhD was crucial for my sticking with the program. You are an example of honesty and fairness in research and in life. I hope we will be again geographically close soon. Tamás, your mathematics skills helped me make crucial progress in my PhD. More importantly, in research and outside research, your warm character was a bright spot in my life. Thank you for everything. Riccardo, you were always one of the most cheerful persons around and I always enjoyed hanging out with you. Incredibly, we also managed to publish a paper together and you presented it so well that we obtained a best paper award for it! This has already been very useful in my career, thank you. Dave, you helped me see that there is more to science than the narrow field of one’s PhD. Your interdisciplinary vision will always be a model for my career. Thank you for sharing your wisdom and helping me become a more open-minded person. Tim, you are the best student anyone can wish for. I am thankful for your independent hard work that got our data-intensive computing papers published and allowed me to wrap up my dissertation. Michel, you are the main author of my first paper and you gave me the oppor- tunity to present it and attend NSDI at the same time. Needless to say, that made quite an impression on me and instilled in me a desire to reach the highest level of expertise. Thank you. Nitin, thank you for being a great companion during our visit to Brazil. During both work and holidays we shared an experience that I will cherish forever. I would like to thank you as well for the helpful comments you made on my work. I will also Acknowledgments vii remember our lively conversations about US politics, which nobody else seemed to understand. Rahim, while we were colleagues for so many years, we collaborated best outside the office, on the tennis courts and at PhD parties, in Leiden and elsewhere. Adele, thank you for sharing your passion for music with me, and for putting up with my ramblings about politics. Boxun, I learned a lot about alternative Chinese culture from you, as well as about old-school video games. You made life at the office very entertaining. Thank you. Otto, you were my go-to person for everything Dutch related. I hope I returned the favor and helped you with Corina with Romanian info at least a bit. Corina, thank you for teaching me computer science at Politehnica University in Bucharest. And thank you both for the wonderful events we celebrated together in the Nether- lands. Elric, thank you for being a great nerd and talking to me about nerdy things for the past couple of years. And also for helping me with the Tribler infrastructure. Lipu, you were a great office mate. I will remember all the fun lunch breaks we had and the delicious tea you brought from China. Egbert, thank you for your help with the Tribler code. I would not have com- pleted the experiments without your help. Flávio, thank you for our fruitful collaboration which resulted in the publication of two papers. It was also nice hanging out with you in Amsterdam. Yong, you started the Graphalytics project and I am very thankful for that. I am sure the project will make a big impact in academia and industry and will help your career as it has already helped mine. Victor, it was a lot of fun going with you on the trips to California and Hungary. Your spontaneity always brought something unexpected to the conversation, both inside and outside the office. Rameez, our heated debates about politics helped me understand the world bet- ter. I have changed a lot during my PhD and your insights are surely an important reason for this change. Andrei, I am very grateful for your friendship. Throughout the years, we shared many experiences that became pleasant memories. Many are related to our (mostly futile) attempts at finding girlfriends for ourselves. Having said that, the best memories come from spending time with you and Andra together. Andra, you helped me not feel too homesick in Delft through the wonderful parties you organized, always involving delicious cooking. Thank you both. I would also like to thank my other Romanian friends in Delft, Ștefan, Bogdan, Mihai, Iulia, Marius, Maria, George, and Mădălina. Thank you for inviting me to your homes and for celebrating Romanian holidays together. I am thankful to my colleagues in the Parallel and Distributed Systems group, Jie, Jianbin, Siqi, Arno, Boudewijn, Cor-Paul, and Alexey, who contributed to a pleasant atmosphere at the office. Also, thank you Paulo, Stephen, and Munire for the technical support. And thank you Ilse, Rina, and Shemara for the administrative support. viii Acknowledgments

Alex, George, Mircea, and Mugurel, you kept me connected to Romania during your research visits to TU Delft. Thank you. Andrei, you organize the best holidays. I hope I will be able to join them for a very long time. I am also grateful to my other friends who made my time away from TU Delft so enjoyable all these years: Andra, Silviu, Aur, Șerban, Luana, Georgiana, Tudor, Călin, Vali, Cosmin, Eliana, Alin, and Ciprian. It is impossible to put into words the gratitude I feel toward my parents. I would not have achieved much without their unconditional and full support. Mamă, tată, vă mulțumesc mult! For the past couple of years I have been sharing my life with my amazing girlfriend, Malwina. I feel extraordinarily lucky to have met her and I am thankful for her accepting me with all my flaws. I love you, Malwina!

Mihai Capotă Holloways Beach, 8 June 2015 Contents

1 Introduction 1 1.1 Peer-to-peer systems ...... 2 1.2 Research context ...... 6 1.3 The user contribution problem...... 8 1.4 Thesis outline ...... 9 2 Towards observing the global BitTorrent file-sharing network 13 2.1 Problem statement ...... 14 2.2 The design of BTWorld ...... 17 2.3 A week in the life of 769 BitTorrent trackers ...... 21 2.4 New BitTorrent phenomena ...... 24 2.5 Related work ...... 28 2.6 Conclusion ...... 30 3 Inter-swarm resource allocation 31 3.1 The inter-swarm resource allocation problem ...... 32 3.2 Simulators for current RAP solutions ...... 35 3.3 Optimal bandwidth allocation in BitTorrent communities ...... 38 3.4 Datasets ...... 42 3.5 Can current algorithms provide high throughput? ...... 44 3.6 Are current algorithms appropriate for video streaming?...... 48 3.7 Related work ...... 51 3.8 Conclusion ...... 51 4 Investment strategies for credit-based P2P communities 53 4.1 Problem statement ...... 54 4.2 Speculative download mechanism design ...... 55 4.3 Trace based evaluation ...... 58 4.4 Results ...... 61 4.5 Related work ...... 65 4.6 Conclusion ...... 66 5 Towards a P2P bandwidth marketplace 67 5.1 Bandwidth marketplace design...... 68 5.2 A model for helper contribution bounds...... 69 5.3 helper algorithm ...... 72 5.4 Evaluation of helper contribution impact ...... 74 5.5 Related work ...... 79 5.6 Conclusion ...... 80

ix x Contents

6 Decentralized credit mining in P2P systems 83 6.1 Background ...... 84 6.2 Credit Mining System design ...... 86 6.3 Internet deployment challenges ...... 88 6.4 Tribler implementation ...... 90 6.5 Evaluation of credit mining in Tribler ...... 91 6.6 Conclusion ...... 98 7 Conclusion 99 7.1 Summary ...... 99 7.2 Future work ...... 101 Bibliography 103 Summary 115 Samenvatting 117 Curriculum Vitæ 119 List of Publications 121 1 Introduction

Online communities are emerging as a new way for people to interact with each other, unobstructed by geographical bounds. Online communities enable collabora- tion leading to the creation through peer production of unprecedented artifacts, like Wikipedia, the online encyclopedia, and Linux, the open-source computer kernel. Underlying the success of peer production is of course the contribution of each indi- vidual in the community. Thus, one of the main challenges faced by community orga- nizers is stimulating this contribution. Incentive systems aimed at providing extrin- sic motivation, like publishing top-contributors lists or excluding non-contributors, are frequently employed to increase user contribution. However, online communi- ties are more exposed than traditional communities to the free rider problem, i.e., people taking resources without contributing back. Because of the volatile nature of online identities, free riders can whitewash their tarnished identities and adopt new, blank identities often as trivially as creating new email addresses. Therefore, incen- tive systems should be accompanied by appropriate security measures to prevent free riding. At the same time, community organizers must consider altruism and intrinsic motivation, which ensure people will always contribute to a certain extent, irrespective of any incentive system. If free riding is cumbersome and contributing is easy, users will contribute, because they make decisions based on bounded ratio- nality. Thus, online communities should strike a balance between preventing free riding and removing barriers to user contribution. Simple computer communication protocols are used to create complex online peer-to-peer communities where users share in a peer-to-peer fashion files and com- puting resources, such as storage space and network bandwidth. In this thesis, we study and aim to increase the contribution of users to these peer-to-peer communi- ties. We design a system capable of observing communities on a global scale without requiring costly infrastructure, and use it to discover and describe new peer-to-peer phenomena. We evaluate the efficiency of peer-to-peer communication protocols, expanding previous work to the community scale, and consider the viability of new types of peer-to-peer communities centered on video streaming instead of file shar-

1 .2 1. Introduction

ing. Throughout the thesis we focus on the conflict between the self-interest of 1 users and the overall interest of the community and try to harmonize the two in our designs. Ultimately, our goal is to create an automatic and autonomous system that helps honest users increase their contribution to the communities in which they participate.

1.1. Peer-to-peer systems

Large-scale distributed computing systems are used pervasively these days in many areas of our lives. The current growth in applications running on mobile devices, for example, is enabled by the distributed systems that connect the applications with their cloud back ends. Peer-to-peer (P2P) systems are a popular type of distributed systems in which all computers have the same role, in contrast with client-server systems, where client computers make requests that are fulfilled by server computers. The architecture of P2P systems enables them to scale to hundreds of millions of nodes, to provide reliability in the presence of network failures, and to be resilient to attacks that target centralized control points. In this section, we introduce the key concepts from the field of P2P systems required to understand the rest of the thesis.

1.1.1. Early systems Arguably, the Internet itself is the very first P2P system, given that Internet hosts can communicate with each other unrestricted, and that each host can function both as a client and as a server at the same time, depending on the application. Other examples of early distributed systems using P2P elements in their architecture are the Usenet news network [1], introduced in 1980, and SMTP email [2], introduced in 1981. Usenet servers receive news messages from users and forward them through P2P connections to other Usenet servers, where users can read them. In SMTP email, servers act as peers, sending and receiving email messages from each other according to the address domains of the message recipients. The system that popularized P2P was Napster [3], a file-sharing system that enabled users to download music files from each other. Napster still included in its architecture a server that indexed the files shared by users and replied to user searches. This element of centralization was also the cause of Napster’s demise, as copyright infringement lawsuits shut down the whole P2P system by shutting down the indexing server. Napster was followed by a wave of P2P file-sharing systems characterized by vary- ing levels of centralization, including Gnutella [4], DirectConnect [5], FastTrack [6], and eDonkey [7]. Even though these systems survived legal challenges thanks to the absence of single points of control, their popularity declined in time, largely because of poor download performance and spam. These systems failed to incentivize user contribution and were affected by high levels of free riding [4]. 1.1. Peer-to-peer systems .3

1.1.2. BitTorrent In 2001, Cohen introduced BitTorrent [8, 9], a P2P file-sharing system whose pop- 1 ularity grew fast, reaching 30% of the whole Internet bandwidth in 2005 [10], and 50% in 2008 [11]. Despite the recent growth of HTML video streaming, BitTorrent still accounted for 35% of residential upload bandwidth in the US in 2013 [12]. In BitTorrent, a user who wants to share a file must first use a BitTorrent client to create a torrent—a metadata file that records required information about the file to be shared. The BitTorrent client scans the file and records in the torrent, for each fixed-sized piece of the file, a SHA-1 cryptographic hash [13]. The piece hashes are used to uniquely identify the file and verify the correctness of data transfers, preventing errors introduced by network failure or malicious users. Once the torrent is created, the user must distribute it to other users out-of-band, e.g., through the Web or email. The BitTorrent protocol itself has no provision for distributing torrents. Other users receiving the torrent add it to their BitTorrent clients, which find other clients sharing the file with the help of trackers—servers that keep track of clients. The BitTorrent clients sharing a file are said to form a swarm. The clients that have the complete file are called seeding peers, or seeders, and the clients that are downloading the file are called leeching peers, or leechers. The most inovative feature of BitTorrent compared to the previous P2P file-sharing systems is enabling leechers to upload to other leechers, thereby improving download speed in the system. As soon as a piece of the file is downloaded, a leecher can verify the correctness of the download using the SHA-1 hash recorded in the torrent, and immediately start uploading the piece to other leechers. Furthermore, leechers allocate their upload bandwidth using an algorithm inspired by tit for tat, a strategy proven to be efficient in game theory [14]. Tit for tat means that a leecher ranks the download requests of other leechers according to the upload bandwidth they provided in the past, which that ensures fairness: leechers that contribute more are rewarded with better performance. In addition to improved performance, BitTorrent also offers improved reliability compared to previous systems, again thanks to piece-based data transfers. After joining a swarm and downloading a few random pieces, a leecher selects for download the rarest piece, i.e., the piece that is least replicated among the peers to which it is connected in the swarm. Using this strategy, the initial seeder can leave the swarm after it has uploaded a single copy of the file, even before any single leecher has the complete copy, because pieces will be uniformly spread amongst leechers who will upload to each other. Developers made numerous improvements to BitTorrent over the years through a standardized process for adding extensions to the protocol [15]. For example, the original protocol only permitted a single tracker to be used for a torrent, which represents a single point of failure; extensions added multiple trackers [16], and a (DHT) [17]—a distributed network used by peers find each other. .4 1. Introduction

1

Figure 1.1: A BitTorrent community.

1.1.3. Communities The lack of support in the BitTorrent protocol for distributing torrents and search- ing for files led to the emergence of indexing websites that provide these services. Frequently, indexing websites also offer other functionality, such as a BitTorrent tracker, or a message board. As a result, communities emerge around these web- sites. In Figure 1.1, we depict such a community. Users interact with the website to search for files and download torrents, or to upload new torrents and make files available to the community. The torrents usually contain the address of the com- munity tracker, and BitTorrent clients contact the tracker to discover peers and join swarms. At any moment, there are many swarms in the community, and users can be part of several swarms concurrently. Furthermore, there are hundreds of com- munities in the Internet and users can be members of any number of communities at the same time. Most BitTorrent communities offer centralized user accounts that can be used to identify the uploaders of torrents or the authors of messages posted to the message board. Sometimes, new torrents can only be added by privileged user accounts, in order to reduce duplicate content and prevent spam. In private communities [18], the website itself is not accessible without an user account, so non-members cannot download torrents. Furthermore, private communities enforce membership through the tracker, which does not reply to non-members. Also, the private flag is used in torrents [19] to stop non-members from bypassing the tracker through the DHT. The BitTorrent protocol requires peers to report to the tracker their activity 1.1. Peer-to-peer systems .5 statistics, consisting of the total number of bytes uploaded and downloaded. These statistics were originally meant to provide information to the tracker operator and 1 cannot be rigorously verified automatically to detect misreporting. However, most private communities use these self-reported statistics to provide a centralized ac- counting mechanism, based either on credit (i.e., the difference between uploaded and downloaded bytes) or on sharing ratio (i.e., the ratio between uploaded and downloaded bytes). Furthermore, many private communities employ the account- ing mechanism for sharing ratio enforcement [18] and they restrict certain services, like access to the newest file, only to users who have a minimum sharing ratio or amount of credit. Some private communities that aim to offer high download speeds even expel users who do not contribute enough to maintain a required minimum amount of credit.

1.1.4. Tribler Tribler [20] is a research project started in 2004 in the Parallel and Distributed Systems Group at the Delft University of Technology with the aim of extending the design of P2P systems. Tribler is backwards compatible with BitTorrent, but adds multiple missing features, like decentralized content discovery, video-on-demand and live streaming, permanent peer identification, and a decentralized reputation system. Several features of Tribler were adopted by the larger P2P community, such as a new data transfer protocol [21], and an improved torrent format [22]. The Tribler client software is released under an open-source license and is written in the Python programming language for cross-platform compatibility—Windows, Mac OS, and Linux are all supported. BitTorrent only operates at the swarm level. It is not possible to provide higher level services without extending the protocol. A BitTorrent client creates a sep- arate random identifier for each swarm it joins. Instead, Tribler uses permanent peer identifiers based on public key cryptography [23]. Furthermore, Tribler uses a message synchronization protocol, called Dispersy [24], to provide peer connections that are independent of activity in swarms. Thus, peers are constantly connected to each other through Dispersy overlays, whether participating in swarms or not. All other Tribler higher level features are based on this basic functionality. Tribler uses the overlay peer connections for decentralized torrent collection and decentralized search. When Tribler peers connect, they exchange lists of torrents and download from each other new torrents. Each peer uses its cache of torrents to build an SQL database of file names which is used for decentralized keyword search. Additionally, peers forward user queries to other peers for improved search results. Video-on-demand streaming is possible in BitTorrent if the piece selection strat- egy is changed from rarest first to sequential. Nevertheless, the BitTorrent tit for tat algorithm for leechers is not appropriate for sequential piece selection, because two leechers starting playback at different times will not be interested in the same set of pieces. Tribler introduces give-to-get [25], an algorithm that divides the file in high, medium, and low priority sections and applies a different strategies to each section: sequential for the high priority one, and rarest first for the others. Peers download pieces from the section with the highest priority that is not yet complete. .6 1. Introduction

This increases the chances of obtaining uninterrupted playback. 1 Live streaming is impossible with BitTorrent, because there is no initial file to be hashed to create a torrent, instead data is produced constantly by the live streaming source. In Tribler, the peer generating the live stream assigns pieces of data a cryptographic signature produced with its private key and a sequence number [26]. The peer then only needs to distribute its own public key and the name of the live stream to other peers, which will be able to verify the correctness of piece transfers using the signatures. The BitTorrent incentive mechanism only affects leechers; seeders are not re- warded for contributing. Tribler addresses this gap through BarterCast [27], a decentralized reputation system. Each data transfer between two Tribler peers pro- duces a certificate recording the amount of data transferred each way. The certificate is signed by both peers to prevent forgery and is then sent to other peers through a Dispersy overlay. Each Tribler peer builds its own view of the reputation of other peers based on the certificates it receives. Tribler also includes a decentralized publish-subscribe mechanism for user cre- ated content. Users can create personal channels and publish content through channels, as well as subscribe to the channels of other users to be automatically notified when new content is published. Channels are implemented using Dispersy overlays [28]. All content in a channel is indexed in a Bloom filter [29], which is periodically sent to other peers in the channel overlay. Peers discover newly created channels through an overlay included in Tribler, to which all peers are subscribed. Users can add comments and even edit the content in channels, depending on the permissions granted by the channel creator.

1.2. Research context The work in this thesis was carried out in the context of two projects funded by the European Union through the Seventh Framework Programme for Research and Technological Development. We summarize the projects and highlight the work car- ried out concurrently with the work presented in this thesis by fellow PhD candidates from the Parallel and Distributed Systems Group.

1.2.1. P2P-Next P2P-Next [30] was an integration project running between 2008 and 2012, bring- ing together 21 partners in 12 different countries, representing both academia and industry. The goal of the project was to build a P2P content delivery platform capable of concurrently streaming content to millions of people. P2P-Next aimed to maximize access to the technology it developed, and therefore released open-source software based on open standards. The project was based on the core P2P technology developed in Tribler. As part of P2P-Next, the technology has been ported to new platforms, including set-top boxes and Android mobile devices [21]. At the same time, a collaboration with Wikipedia was established, which enabled the serving of videos embedded in the encyclopedia website using P2P technology [31]. 1.2. Research context .7

PhD candidates from the Parallel and Distributed Systems Group contributing to the project made considerable scientific advances over the years. In the area of 1 video streaming, D’Acunto et al. [32] proposed improved peer selection strategies for video streaming in swarms composed of heterogeneous peers. Zhang et al. [33] characterized flash crowds—periods exhibiting exceptionally high demand—in Bit- Torrent, and D’Acunto et al. [34] introduced a bandwidth allocation policy that improves playback continuity in flash crowd conditions. Petrocco et al. [35] intro- duced Deftpack, a new piece-picking algorithm designed for video encoded using layered coding techniques such as Scalable Video Coding (SVC), that minimizes the frequency of layer changes during playback. In terms of security, P2P benefits from decentralization compared to tradi- tional broadcast networks, but it also introduces challenges that must be addressed. Chiluka et al. [36] proposed a novel technique to mitigate Sybil attacks which lever- ages negative feedback from honest users, as well as an improvement to the estab- lished EigenTrust algorithm, which takes into account the connections explicitly trusted by each user [37].

1.2.2. QLectives The second project providing context for this work is QLectives [38], which ran be- tween 2009 and 2013 and aimed to bring together social networks, peer production, and P2P systems to form quality collectives—self-organizing communities that pro- duce quality artifacts with the help of information technology systems. There were eight partners from seven countries collaborating in the project, with expertise in a wide range of areas, including sociology, physics, psychology, artificial intelligence, and distributed systems. The highly interdisciplinary QLectives team conducted research in two living labs, a media sharing community called QMedia, and a platform for scientific pub- lications called QScience. The project organization was based on repeating cycles consisting of data collection, processing and analysis, theory and algorithm refor- mulation, and living lab deployments. The scientific contributions made by Parallel and Distributed Systems Group PhD candidates as part of QLectives concerned first and foremost communities. Meulpolder et al. [18] studied public and private BitTorrent communities and con- trast their characteristics to understand the cause of performance differences. Delaviz et al. [39] characterized the reputation mechanism in the Tribler community (which forms the basis of QMedia), using network science techniques. Gkorou et al. [40] ex- plored the effect of discarding community history for the purpose of enabling scaling and capturing emerging community phenomena. Jia et al. [41] proposed a model for credit-based P2P communities that is able to predict systemic problems caused by too much or too little available credit. Additionally, Delaviz proposed SybilRes [42], a protocol designed to guard the Tribler reputation mechanism against Sybil attacks. Also on the topic of free riding prevention, Gkorou proposed a scalable and attack- resistant method for collecting trusted and relevant information for computing peer reputation called EscapeLimit [43], which is based on random walks with restarts. .8 1. Introduction

1.3. The user contribution problem 1 User contribution is central to the functioning of P2P communities. Scalability, reliability, and all other desirable properties of the P2P architecture depend on it. In this thesis, we aim to study user contribution and ultimately build a system for increasing it. We identify the following research questions, which we must answer in order to reach this goal.

RQ1: How to study the operation of global P2P networks? Studying and understanding a system is a prerequisite for improving it. The study of the global P2P networks enables the creation of models that can drive simulations and eventually lead to better performance for the users. However, this is not a trivial undertaking, because of the scale involved: there are thousands of communities with hundreds of millions of users [44] sharing content on the Internet. Studying them at a global level is challenging in an academic laboratory with limited capabilities for monitoring. In particular, studying BitTorrent poses many problems. For example, the data collected regards private individuals and must me anonymized in accordance with data protection laws, which raises privacy issues. The lack of centralized control over the global network is also a problem for researchers, because there is no centralized directory listing all BitTorrent communities, so collecting data on communities is a laborious process.

RQ2: How do P2P protocols perform at the community level? In the context of individual file transfers, there have been numerous studies of P2P protocols, and BitTorrent has been shown to perform almost optimally in terms of utilizing user contributed bandwidth [45]. Actual P2P usage, however, takes place in communities where multiple file transfers take place at the same time, with complex interdependencies [46]. Thus, there is a need to understand the performance of P2P protocols at this higher level. It is also worth investigating whether the type of community influences the per- formance of the protocol. Most communities are oriented towards file sharing, but in the context of the Tribler project, video streaming represents an equally relevant target. While there has been research into video streaming [25], there is little anal- ysis of the community-level allocation of resources. For the community as a whole, the goal is maximizing the number of peers benefiting from uninterrupted video playback. It is not clear how BitTorrent accomplishes this goal.

RQ3: Is it possible to increase the contribution of honest users? Incentive mechanisms deployed in P2P communities encourage users to contribute. BitTorrent is the foremost example, with its tit for tat reciprocation mechanism at- tracting more user upload bandwidth contribution than earlier P2P systems. Fur- thermore, accounting mechanisms used by the Tribler community or by private BitTorrent communities were shown to be effective: private communities with strict 1.4. Thesis outline .9 sharing ratio enforcement for members offer better performance than public com- munities that lack an accounting mechanism [27]. 1 On the other hand, honest users who intend to contribute are limited by tech- nological barriers (such as a limited Internet connection bandwidth [47]), or by expertise barriers (such as the lack of knowledge regarding the economics of P2P communities [48]). In this context, it is clear that simply providing more incentives is not the answer to facilitate the contribution of these honest users. There is a need to explore ways to increase user contribution that break the barriers users encounter nowadays in P2P communities.

RQ4: Can user self-interest be aligned with overall community interest? The design of P2P systems should take into account both user self-interest and the overall community interest, and align the two as much as possible. Users control the behavior of P2P software, and can alter the default behavior to suit their needs, very easily in case of open-source software. Thus, if system designers put community interest at odds with user self-interest, usually the latter prevails at the expense of the former, resulting in poor general performance [3, 4]. We must, therefore, investigate whether proposed mechanisms for increasing user contribution are aligning user and community interests. Small-scale testing is not sufficient, proposed mechanisms must be tested in conditions corresponding to widespread deployment.

1.4. Thesis outline In this thesis, we answer the research questions we identified in the previous section through the following research contributions.

A study of the global BitTorrent network In Chapter 2 we design, implement, and deploy BTWorld, a system for monitoring the global BitTorrent network. The design choices we make allow us to use only a few machines to capturing the activity of a large part of the peers in the network. BTWorld monitors the trackers that connect peers, instead of monitoring peers directly. The trackers offer sufficient information to analyze the performance, scal- ability, and reliability of BitTorrent, without a prohibitive investment in monitoring infrastructure. BTWorld also includes a component for discovering new trackers, which ensures that our monitoring covers a large part of the global network. Us- ing BTWorld data, we give insight into BitTorrent specific phenomena, like spam trackers and giant swarms. This chapter is based on the following paper: M. Wojciechowski, M. Capotă, J. Pouwelse, and A. Iosup, “BTWorld: Towards observing the global BitTorrent file-sharing network”, in Workshop on Large- Scale System and Application Performance (LSAP) in conjunction with ACM International Symposium on High Performance Distributed Computing (HPDC), doi:10.1145/1851476.1851562 (2010). Chapter 2 answers research question RQ1. .10 1. Introduction

A study of community-level resource allocation 1 In Chapter 3 we study the inter-swarm resource allocation mechanisms used by the most popular BitTorrent clients, i.e., the mechanisms for peers to divide their bandwidth between multiple swarms that are active at the same time. The original BitTorrent client only supported downloading one torrent at a time, but it evolved to support multiple torrents. Similarly, other BitTorrent clients, like Azureus and uTorrent, support multiple torrents. However, little attention has been given in the scientific community to the mechanisms these clients use for allocating resources across swarms. We model the problem of resource allocation in BitTorrent using a flow network and use graph-theoretical techniques to derive performance bounds. Using trace-based simulation, we compare BitTorrent mechanisms to the perfor- mance bounds, considering the two use cases encountered in practice: file sharing and video streaming. This chapter is based on the following paper: M. Capotă, N. Andrade, T. Vinkó, F. Santos, J. Pouwelse, and D. Epema, Inter- swarm resource allocation in BitTorrent communities, in IEEE International Conference on Peer-to-Peer Computing (P2P), doi:10.1109/P2P.2011.6038748 (2011). Chapter 3 answers research question RQ2.

A method for predicting upload potential for swarms In Chapter 4, we take the first step towards increasing user contribution by devel- oping a predictor for upload potential in swarms, i.e., a method of selecting from a set of swarms the swarm where demand is highest. Previous work identified the idle seeding problem, where seeders offering their upload bandwidth do not actu- ally contribute to the community because of seeding over-supply [47]. Furthermore, it was shown that a major cause of the idle seeding problem in high-performance private communities is selecting swarms without upload potential [48]. We study a trace of activity in a private community to determine a way to select swarms with high upload potential. Through simulation, we show that we can correctly rank swarms according to upload potential, both using a complex predictor based on multivariate adaptive regression splines, and a simple heuristic. Our ranking does not require any BitTorrent protocol extensions and ensures that seeders contribute their bandwidth to the swarms that have the most leechers. This chapter is based on the following paper: M. Capotă, N. Andrade, J. Pouwelse, and D. Epema, “Investment strategies for credit-based P2P communities”, in Euromicro International Conference on Par- allel, Distributed, and Network-Based Processing (PDP), doi:10.1109/PDP.2013. 70 (2013). Chapter 4 (together with Chapter 6) answers research question RQ3.

A model for user contribution bounds In Chapter 5, we introduce a model that shows the effect of increased user contri- bution in BitTorrent, based on the established BitTorrent fluid models of Qiu and Srikant [49] and Guo et al. [50]. Intuitively, user contribution cannot increase per- formance beyond the bottleneck represented by the injector of data in the system, 1.4. Thesis outline .11 i.e., the original seeder. Our model gives the bound for the maximum upload band- width that can be added through user contribution before the injector bottleneck 1 limit is reached. Additionally, we test the model predictions in practice using a minimal BitTorrent client based on the Libtorrent [51] library running in a labora- tory environment. Our experiments show that widespread deployment of our client results in diminishing user gains, but does not degrade community performance. This chapter is based on the following paper: M. Capotă, J. Pouwelse, and D. Epema, “Towards a peer-to-peer bandwidth marketplace”, in International Conference on Distributed Computing and Net- working (ICDCN), doi:10.1007/978-3-642-45249-9_20 (2014). Chapter 5 (together with Chapter 6) answers research question RQ4.

A system for increasing user contribution In Chapter 6, we design, implement, and evaluate with real-world swarms a de- centralized, autonomous system for increasing user contribution in Tribler. When implementing our system, we consider also practical issues that we encounter in In- ternet deployments, such as duplicate content and spam. Furthermore, we tackle the disappearance of old content, a known problem in BitTorrent communities, through an archival mode of operation. Using our system is in the best interest of both users, who gain credit in Tribler for using it, and the community as a whole, which benefits from increased performance. M. Capotă, J. Pouwelse, and D. Epema, “Decentralized credit mining in P2P systems”, in IFIP Networking Conference, (2015). Chapter 6 (together with Chapters 4 and 5) answers research questions RQ3 and RQ4.

2 BTWorld: Towards observing the global BitTorrent file-sharing network

BitTorrent is an Internet-based P2P file-sharing application that has experienced phenomenal growth due to a variety of factors, such as robustness to content pollu- tion [53] and good performance [53–55]. BitTorrent deployments of different com- munity sizes and shared content types serve hundreds of millions of daily users. However, the growth of BitTorrent into a global file-sharing network has begun to raise similar issues as the Internet did more than a decade ago [56] and researchers began to recognize the importance of trace-driven research and development. Many empirical studies [53–55, 57, 58] have been dedicated to collecting traces from Bit- Torrent and to understanding its properties, but these studies have been limited in size and have thus introduced various sampling biases [59]. Complementing and improving upon these studies, in this chapter we introduce BTWorld, which is a system for observing and analyzing the global BitTorrent network, and we use it to characterize BitTorrent deployments world-wide. The need for new data is not unique to BitTorrent among Internet applica- tions. The research community can greatly benefit from usage traces taken from real applications, for example to understand the trends and characteristics of real deployments, to validate models and drive simulations, and ultimately, to improve the quality of service for millions of users. Motivated by the wide applicability of trace-based research, the US National Academy of Sciences has challenged the In- ternet research community to develop the means to capture a day in the life of the

This chapter is based on work published in ACM LSAP 2010 [52].

13 .14 2. Towards observing the global BitTorrent file-sharing network

Internet in 2001 [56], and gathered the data collected by small independent mea- surements into public archives such as DatCat [60] and CRAWDAD [61]. BitTorrent has grown to become a significant part of the Internet and the BitTorrent research community has produced numerous measurement studies. However, we have not yet produced a comprehensive view of the global BitTorrent network, that is, we 2 have not yet formulated our “one day in the life of …” grand challenge and we have not created a public archive with our traces. What should be included in a global BitTorrent view? To guide the development of the network we envision obtaining, from a a large and diverse swath of BitTorrent deployments, information about BitTorrent’s performance, scale, and reliability of service. The problem of observing the global BitTorrent network is not trivial. While large Internet studies are enabled by cooperation with the ISPs, for BitTorrent few ISPs will agree to observing the network, for fear of legal consequences; for exam- ple, the RIAA has subpoenaed ISPs to reveal the identities of customers involved in alleged copyright infringements [62]. Thus, the BitTorrent community has been forced to perform its own measurements, without assistance from major ISPs. Due to the scale, complexity, and rapid growth of the network, the results of numer- ous previous BitTorrent measurements are difficult to interpret coherently. It is symptomatic for the current (lack of) understanding of the current global status of BitTorrent that there is no consensus on the Internet traffic share generated by Bit- Torrent users—caching companies have proposed estimates of 30% in 2005 [10] and over 50% in 2008 [11]. Moreover, academic studies in consecutive years report the doubling of the average download capacity [53, 55], studies spread five years apart of the same BitTorrent deployments acknowledge significant network changes [59], etc. Furthermore, the existing BitTorrent studies are made on a changing network: while the BitTorrent protocol has remained virtually unchanged since its introduc- tion, a number of additional protocols have made the network less centralized, more available, and more difficult to observe. In this chapter, our main contributions toward a system for observing the global status of BitTorrent are: 1. We formulate the requirements for the problem of observing the global status of BitTorrent (Section 2.1); 2. We design the BTWorld architecture for observing the global status of Bit- Torrent (Section 2.2); 3. We implement BTWorld and use the resulting system to observe and analyze one week in the life of a large part of the BitTorrent network (Section 2.3); 4. We show that the wealth of data collected by BTWorld can shed new light on BitTorrent phenomena such as spam trackers and giant swarms (Section 2.4).

2.1. Problem statement In this section we formulate the problem of observing the global BitTorrent network. We first introduce BitTorrent, and then the system requirements for addressing the problem. 2.1. Problem statement .15

Table 2.1: Major changes in BitTorrent.

Change Year Effect Multitracker Metadata 2003 Higher tracker availability DHT Protocol 2005 Global BitTorrent network (PEX) 2007 Trackerless BitTorrent 2 P2P Metadata 2008 Webless BitTorrent search

2.1.1. The BitTorrent file-sharing network BitTorrent is a peer-to-peer file-sharing system for exchanging content in the form of (bundles of) files. Some BitTorrent file-sharing is private [18], with access restricted to access to a group of authenticated users. In this chapter, we focus only on public BitTorrent file-sharing activities. In our terminology [59], BitTorrent has three levels of operation, the peer level, the swarm level, and the community level, which we describe in turn. The peer level comprises individual peers, that is, individual participants in the process of exchanging content that may contribute to or use the system’s resources and services. Peers are present in the system during non-contiguous intervals called sessions; the system size at any moment in time is given by the number of sessions overlapping at that moment in time. The swarm level groups together, from all the peers in a P2P system, the peers that exchange the same content, for example the latest Linux distribution [54]. A swarm starts being active (is born) when the first peer joins that swarm, and ends its activity (dies) when its last peer leaves. The lifetime of a swarm is the period between the start and the end of the swarm; the swarm is alive during its lifetime. The community level comprises all the swarms and all the peers in a BitTorrent deployment such as SuprNova [53], ThePirateBay [55, 57], Etree [63], etc. The main function of the community level is to facilitate the formation of swarms, that is, to allow the peers interested in exchanging the same content to start collaborating. This collaboration is facilitated by a tracker, which is a specialized system component that operates at both the community and the swarm levels, and acts as a meeting point for peers. A tracker can manage several swarms. The content transferred in BitTorrent is accompanied by a .torrent metadata file. The metadata includes, among other pieces of information, a hash that uniquely identifies the content, and a directory for the trackers that manage swarms exchang- ing the content. Peers interested in a file obtain the file’s metadata from a web site (the community level of BitTorrent) and use the peer location services offered by a tracker (the community and swarm levels of BitTorrent) to find other peers inter- ested in sharing the file. The raw file is then exchanged between peers (the peer level of BitTorrent). Thus, to obtain a file, a user partakes in all three operational levels of BitTorrent. Several important shortcomings in the original BitTorrent design have been ad- dressed over time, and four main changes are summarized in Table 2.1. First, the directory part of the metadata was initially only able to specify a single tracker, which raised an availability problem; the Multitracker Metadata Extension [16] in- .16 2. Towards observing the global BitTorrent file-sharing network

troduces support for multiple trackers. Second, peers were able to discover other peers only through the services of trackers, which raised anonymity concerns; the DHT Protocol [17] introduces support for finding and exchanging information di- rectly between peers and creates a coarse global BitTorrent network. Third, even when connected through the DHT into a global BitTorrent network, peers had no 2 protocol to inform each other of the presence of other peers in a swarm; the Peer EXchange (PEX) protocol [64] implements this feature, thus enabling swarms to function trackerless. Fourth, to be able to exchange content peers needed to be able to know the content metadata, which was only obtainable from BitTorrent (web) sites. With the Extension for Peers to Send Metadata Files [65], peers can send to each other metadata, effectively creating a webless BitTorrent network.

2.1.2. System requirements The global BitTorrent network comprises all the BitTorrent communities in the world. Thus, the problem of observing the global BitTorrent network is, in a first formulation, the problem of building a system capable of observing and recording the complete state (traffic, connections, and participation at different BitTorrent levels) for all the public BitTorrent communities in the world. A similar problem has been proposed by the general networking research community as “a day in the life of the Internet”1, but with a focus on network instead of application operation. The first formulation of the problem is the expression of an archival effort, which would require observing and recording everything in the BitTorrent network. How- ever, our goal is more modest: to reveal the performance, scalability, and reliability of a large, diverse swath of the BitTorrent network. This allows us to make the problem tractable by focusing on BitTorrent trackers, and not each individual peer. Besides the good properties of any measurement system, that is, fault-tolerance, scalability, and non-intrusiveness, we identify three system requirements specific to our problem:

1. The system must be able to find the important trackers serving public Bit- Torrent communities. Since most BitTorrent trackers are short-lived [53, 66], we define the important trackers as the long-lived trackers that serve either large communities or communities with special characteristics (e.g., religious content);

2. The system must be able to observe and record information about the per- formance, the scalability, and the reliability of the important trackers. To increase system reliability, some of the trackers have been replicated; the sys- tem should observe only one instance of each replicated set of trackers;

3. The system must be able to facilitate data analysis. The volumes expected for the observation data may easily run into the tens of terabytes per year.

1A paraphrase on Solzhenitsyn’s book One Day in the Life of Ivan Denisovich, which depicted the horrors of the Soviet labor camps in the 1950s by recounting the details of a day spent in a camp by the eponymous prisoner. 2.2. The design of BTWorld .17

2

Figure 2.1: The architecture of BTWorld.

Moreover, the observation process may take days or even weeks before accu- mulating sufficient data for the problem at hand, so timely detection of errors is critical in the observation process. Thus, the system must provide periodical summaries of the recorded data.

2.2. The design of BTWorld In this section we present the design of BTWorld, which follows the system require- ments introduced in Section 2.1.

2.2.1. Overview Our goal is to create a global view of the publicly accessible BitTorrent network that will enable us to understand system characteristics such as performance and scalability, and to identify and characterize BitTorrent phenomena. We design this architecture around data observed from trackers (see Section 2.2.2). A distinguishing feature of the BTWorld architecture is its focus on low resource usage. BTWorld needs to balance two conflicting requirements: scaling to the global size of BitTorrent and keeping the amount of consumed resources (such as machines and IP addresses) low. We specifically want to use an order of magnitude fewer machines than previous infrastructures [55, 57], which have each employed several hundred machines; we use only two machines (eight cores) to perform a large-scale BitTorrent measurement and process its results (see Section 2.3), and to observe and analyze new BitTorrent phenomena (see Section 2.4). Figure 2.1 shows the four BTWorld components and the data flows between them. The Measurement component (component 1) connects to trackers in the public BitTorrent network and collects their status (the raw data). The Tracker Search component (2) gathers information about new trackers, and produces a list of the most important trackers based on various sources of information. The BT- .18 2. Towards observing the global BitTorrent file-sharing network

Table 2.2: Advantages and disadvantages of the BitTorrent measurement techniques considered.

Level Advantage Disadvantage Internet Excellent coverage ISP collaboration Community Implementation Peer details 2 Swarm Details Context Peer Details Scalability

World design specifically permits domain experts to configure and guide the tracker selection process. The Measurement component contacts only the most important trackers (see Section 2.2.3), which are provided by the Tracker Search component. The Data Processing component (3) creates statistics from the raw data. Last, the Reporting component (4) transforms the output of the Data Processing component into useful reports, including tables, graphs, and animations.

2.2.2. Measurement techniques While designing BTWorld, we have identified four different measurement techniques, distinguished by the level at which the measurement is performed. We describe in the following paragraphs each of these techniques and summarize their main advantages and disadvantages in Table 2.2. Internet-level measurements involve monitoring the traffic exchanged between peers, for example by observing the traffic passing through the routers of an ISP [67, 68]. While this approach can result in observing the complete BitTorrent activity (perfect coverage), it also requires access to the routers of many large ISPs. More- over, interpreting the traffic is difficult due to the use of encryption in BitTorrent. The next option we have considered is to observe BitTorrent at the community level. The passive measurement technique of observing the logs produced by track- ers [54] leads to good measurements, but requires the collaboration of community administrators, which raises legal concerns. We opt instead to use the scrape ser- vice available on most of the trackers. The scrape service is a lightweight protocol running over HTTP, which provides swarm status information to users. It includes, for each swarm, the numbers of seeders, leechers, and downloads. In contrast to community-level measurements, swarm-level measurements focus on individual swarms. They can also use the scrape service, but query informa- tion for a single swarm [53, 55]. In addition, they can collect information through PEX [57]. Swarm-level measurements lack the context information provided by community-level measurements, e.g., multi-swarm participation of users [50]. Peer-level measurements rely on communicating with peers using a custom Bit- Torrent client [55, 69], which can can log activity and even deviate from the strict BitTorrent protocol [55, 69]. It is difficult to deploy this approach at the global scale required by BTWorld. Alternatively, users can voluntarily install an instru- mented client that reports data to measurement servers [70], although it is difficult to convince a large number of users to do this. Considering the requirements in Section 2.1.2, we base our architecture on com- 2.2. The design of BTWorld .19

Table 2.3: Advantages and disadvantages of the tracker search techniques considered.

Technique Advantage Disadvantage Well-known sites Implementation Coverage Pre-compiled lists Niche sites Reliability Parsing torrents Exploit data Data size 2 Metadata extension Hash analysis Success rate Web search engines Unknown sites Implementation munity-level measurements performing tracker scrapes and web site scrapes. Addi- tionally, we employ torrent file retrieval—obtaining the torrent file corresponding to a hash observed in a tracker scrape. We have a twofold objective for doing this: improving our tracker knowledge with multi-tracker torrents as we show in the next subsection, and determining the size of the content seen in tracker scrapes. How- ever, torrent file retrieval incurs a significant overhead for millions of torrents—to obtain file size information it collects full torrent files, sized 30-100 kB each. Instead, we use custom scripts that parse the directory pages of well-known BitTorrent sites such as thepiratebay.org, .com, and yourbittorrent.com, and extract content size information. The raw data that we collect includes the hashes identifying shared content in each community, the number of peers participating in swarms (both as seeders and as leechers), the number of downloads for each swarm and the size of the content being shared.

2.2.3. Tracker search The Tracker Search component selects the trackers observed during measurements. We have identified five techniques for finding BitTorrent trackers. We describe in the following each of these techniques and summarize their main advantages and disadvantages in Table 2.3. The Well-known BitTorrent sites technique uses common knowledge to select trackers. For example, we have selected SuprNova in 2004 [53] and PirateBay in 2005 [55] as the largest from a set of well-known BitTorrent sites; Andrade et al. [63] select one well-known site among the three used in their study; etc. Such an ap- proach requires domain expertise, does not react to changes in the status of the sites (SuprNova is no longer online, PirateBay was shut down several times for legal reasons), and may disregard small BitTorrent sites that host important trackers. Lists of trackers are publicly accessible on the web. The largest list we know about is located at p2ptrace.com; previous studies have used .com (now defunct). Other lists use expert knowledge to include sites focused on niche areas (e.g., Bollywood movies, Japanese manga, etc.). Reliability is a concern with this solution, since unmaintained lists include a quickly increasing number of dead trackers. p2ptrace.com traces about 900 trackers and is actively updated, but has a complex presentation format and its server is often overloaded. Parsing torrent files can lead to the discovery of new trackers because most of the .20 2. Towards observing the global BitTorrent file-sharing network

torrent files we encountered use multiple trackers. We extend our tracker coverage by parsing the torrent files collected through the other methods, and selecting from them the trackers that appear often. The BitTorrent metadata sending extension enables us to obtain torrent files for parsing using the hashes observed in scrapes. BTWorld implements this function- 2 ality as an automated process built on top of a BitTorrent client. Web search engines provide an alternative way to collect torrent files. Specialized search engines such as .com allow us to retrieve torrents corresponding to specific hashes, but limit the transfer rate offered to each IP and are thus very slow. Alternatively, a custom search on Google (“filetype:torrent”) returns about 5.5 million results at the time of writing. Parsing these files can extend the coverage of BTWorld, but requires effort to retrieve and integrate data. We have implemented and employed in BTWorld all these techniques, and found empirically that using them simultaneously increases the coverage of the tracker search, each of the methods adding certain trackers omitted by the other methods.

2.2.4. Data processing BTWorld generates large amounts of data, and processing may build in-memory working sets on the order of gigabytes. Thus, it is impractical to process the raw measurement data once for every statistics output. We are thus facing the problem of using a MapReduce-like approach vs. a Parallel DBMS [71]. With a MapReduce- like approach, the raw measurement data are first pre-processed into intermediate data, which are then used as input to several statistics. With a Parallel DBMS, the raw measurement data are loaded into the database, and then the execution queries for the different statistics are optimized by the database engine. The database community has recently observed [71] that the selection of either of these two al- ternatives depends on the parameters of the data processing problem (data size, duration of data loading, complexity of processing operations, etc.). We perform our own experiments with the dataset described in Section 2.3.1 and found that, for our problem and experimental setup, the MapReduce-like approach produces better results than the parallel DBMS.

2.2.5. BitTorrent analysis We aim to analyze BitTorrent in terms of global system characteristics and unique phenomena, both qualitative and quantitative. For the quantitative analysis, we use the following metrics, grouped by system characteristic: Performance - to quantify system performance, we look at the total number of downloads per swarm. We also consider the total downloaded volume per swarm, defined for a swarm as the product of its content size and its total number of downloads. We further define the total number of downloads per tracker as the sum of the number of downloads for all the swarms managed by the tracker, and the total downloaded volume per tracker as the sum of the downloaded volume of the swarms managed by the tracker. We also present the number of downloads per hour for a tracker defined as the difference in reported number of downloads for all the swarms of the tracker in two samples taken one hour apart. 2.3. A week in the life of 769 BitTorrent trackers .21

Table 2.4: Aggregate characteristics of all the traces collected from trackers.

Metric Value Tracing period 3 to 9 January 2010 Number of trackers 912 Number of alive trackers 769 2 Number of swarms 10 329 950 Number of hashes 6 314 318 Number of hashes with known size 1 024 573 Number of swarm samples 899 537 250

Scalability - to analyze scalability, we define the following two metrics: number of concurrent sessions and content size. The number of concurrent sessions per swarm is the sum of the number of seeders and leechers present concurrently in a swarm, as reported by the tracker in the scrape. We further define the number of concurrent sessions per tracker as the sum of the number of concurrent sessions for every swarm managed by the tracker. The content size for a swarm is the size appearing in the torrent file corresponding to the swarm. The content size for a tracker is the sum of the content swarm for every swarm managed by that tracker. Reliability - we express reliability by looking at the number of dead torrents, defined as the number of torrents with no seeders reported by the tracker in the scrape. We also investigate the seeders to leechers ratio, which is the ratio of the number of seeders to the number of leechers in that swarm, as reported by the tracker.

2.3. A week in the life of 769 BitTorrent trackers In this section we analyze a large-scale BitTorrent dataset collected by BTWorld over one week of operation.

2.3.1. BitTorrent traces We started collecting traces using BTWorld in December 2009, and the collection process still continues. For this chapter we have selected a one week subset of the entire trace collection, starting on January 3, 2010. January is typically a slow month in user content distribution, especially in comparison with the usual peak loads occurring in December; many of the users in US and Europe (the majority of the users of , ThePirateBay [55], and other large sites focusing on English- speaking content) return from vacations and start the work year. This dataset serves two purposes: as an assessment of the BTWorld capabilities, and to show the usefulness of the raw data acquired so far by BTWorld. BTWorld was able to collect in one week a large-scale dataset, as summarized by Table 2.4. The total number of swarms present in our traces exceeds 10 million, and overall we have collected close to 900 million swarm samples with swarm information (number of seeders and leechers, etc.) The collected data reveals interesting features .22 2. Towards observing the global BitTorrent file-sharing network

1.0

0.8

2 0.6 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● median:

CDF 0.4 964.42

0.2

100 102 104 106 Mean number of sessions (log scale)

Figure 2.2: The CDF of the mean number of concurrent sessions per tracker.

of the global BitTorrent network; we comment on two such features. First, we identify alive trackers as those that replied at least once with a meaningful response (i.e., we were able to parse the response and produce useful data). Out of our initial list of trackers, only 769 (84%) were alive during the selected week; this re-confirms the findings of previous tracker availability studies [53, 66]. Second, the total number of swarms is 1.63 times the number of hashes, indicating significant overlaps in the content of different trackers.

2.3.2. Global analysis We first analyze the trace from a global perspective, trying to identify characteristics of the BitTorrent network as a whole. We exclude here the spam trackers presented in detail in Section 2.4.1; these trackers return, via the scrape interface, information that cannot be trusted. Large diversity in tracker sizes. We investigate the number of users served by individual BitTorrent trackers. Figure 2.2 shows the cumulative distribution function (CDF) of the number of concurrent sessions for all the trackers. We observe in the figure and in Table 2.5 a wide range of values, between 75 and 27.6 million concurrent sessions. The median of around 1000 concurrent sessions shows that most BitTorrent trackers investigated are relatively small compared to the values reported in previous work [50, 53–55]. High proportion of seeders. We analyze the availability of content in the swarms by looking at the seeding levels per tracker. Figure 2.3 shows the CDF of the seeders to leechers ratio (SLR) for each tracker. We observe a median SLR of 5.64, which we find high. For 406 (almost 70%) trackers we observe a seeders to leechers ratio above 1. This shows that currently public BitTorrent trackers exhibit high content availability, and the “few seeders many leechers” scenario is not common anymore. The result is in accordance with another recent study, which 2.3. A week in the life of 769 BitTorrent trackers .23

Table 2.5: Statistics for the total number of concurrent sessions (aggregate number of seeders and leechers) per tracker per scrape sample.

Trackers All First Top-10 Top-100 Min 0 17.98 × 106 116 766 0 2 Q1 75 19.16 × 106 591 669 31 420 Median 964 21.91 × 106 773 222 54 982 Mean 154 300 22.15 × 106 5.36 × 106 893 445 Q3 7 077 25.03 × 106 8.70 × 106 122 630 Max 27.60 × 106 27.60 × 106 27.60 × 106 27.60 × 106 SD 1.48 × 106 2.97 × 106 7.35 × 106 3.49 × 106 Sum 7.53 × 109 3.70 × 109 7.01 × 109 7.44 × 109

Table 2.6: Statistics for the sizes of files.

Trackers All First Top-10 Top-100 Min 13 B 1.47 GB 186.37 MB 13 B Q1 61.17 MB 1.48 GB 1.75 GB 243.81 MB Median 366.77 MB 1.49 GB 5.69 GB 735.42 MB Mean 1.50 GB 20.46 GB 14.27 GB 3.01 GB Q3 1.05 GB 21.79 GB 12.00 GB 2.67 GB Max 423.55 GB 101.06 GB 329.13 GB 423.55 GB SD 5.18 GB 34.18 GB 26.12 GB 9.40 GB Sum 3.43 PB 225.09 GB 10.08 TB 716.02 TB has investigated private trackers [18], and shows that BitTorrent is overall a well- provisioned network. BitTorrent users share petabytes of unique content, mostly in large bundles of files. We focus on the sizes of the (bundles of) files shared, and study the content of 2.29 million swarms; Table 2.6 summarizes their basic statistics. The total size of the content shared by these 2.29 million swarms is 3.43 PB; assuming that all the 10.33 million swarms observed in our study have similar properties, we conclude that the BitTorrent network comprises at least 15 PB of content. The mean content item size is 1.50 GB, which shows that BitTorrent is mostly used for exchanging large (bundles of) files. The distribution of the studied characteristics, including the average number of concurrent sessions per tracker and the seeders to leechers ratio per tracker, show elements of exponential probability distribution in the bulk of the distribution, but have significant non-exponential features in the tails. For example, the CDF of the average number of sessions, depicted in Figure 2.2, shows elements of an exponential distribution between its 7th and 93rd percentiles. However, we notice that for the .24 2. Towards observing the global BitTorrent file-sharing network

0.8

2 0.6 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● median: CDF 0.4 5.64

0.2

10−3 10−2 10−1 100 101 102 103 Seeder to leecher ratio (log scale)

Figure 2.3: The CDF of the mean seeder to leecher ratio per tracker.

93rd–100th percentile the exponential distribution assumption does not hold—there are trackers with more than 20 million concurrent sessions, an order of magnitude larger than the next smaller trackers.

2.3.3. Per-tracker analysis To present tracker-level details, we first rank the trackers according to the average number of sessions per swarm. We then present in this section the characteristics of the largest ten trackers. We use the same color and symbol to depict the same tracker in Figure 2.4. Daily usage pattern. We focus on the number of users a tracker is serving. Figure 2.4 shows the number of concurrent sessions for the top ten trackers with respect to time. For almost all the trackers the daily usage pattern is visible. This seems to be a result of human activity, as there exists a strong day-night variation in the recorded data. Relatively constant seeders to leechers ratio. We study how the content availability evolves in time. Figure 2.5 shows the seeders to leechers ratio for the top ten trackers, with respect to time. Apart from one tracker, we do not observe large fluctuations in the ratio. Content availability is relatively constant over time, with some daily usage patterns and minor fluctuations visible.

2.4. New BitTorrent phenomena In this section we show evidence that BTWorld enables the study of new BitTorrent phenomena. In the following, we study two such phenomena, namely spam trackers and giant swarms, using the traces presented in Section 2.3.1. 2.4. New BitTorrent phenomena .25

● ● 7 ● ● 10 ● 2

106.5

106 ● ● ● ● ●

105.5 Number of sessions (log scale) 105

03 04 05 06 07 08 09 10 Day in January 2010

Figure 2.4: The number of sessions in time for the Top-10 trackers.

103

102.5

102

101.5

● ● 101 ● ● ●

100.5

● 0 ● ● ● ●

Seeders to leechers ratio (log scale) Seeders to leechers ratio 10

03 04 05 06 07 08 09 10 Day in January 2010

Figure 2.5: The seeder to leecher ratio in time for the Top-10 trackers. .26 2. Towards observing the global BitTorrent file-sharing network

Table 2.7: Spam Trackers: Number and percentage of trackers found by criterion and mixtures of criteria. Percentage computed from the number of alive trackers (see Table 2.4).

Found spam trackers Applied criteria Number Percentage 2 C1: TSessN ≥ 50 × 106 peers 83 11 C2: mean(SDldN) ≥ 106 downloads 7 1 C3: CV(TSessN) ≤ 2 41 5 C1 or C2 89 12 C1 or C3 124 16 C2 or C3 48 6 Total (C1 or C2 or C3) 130 17

2.4.1. Spam trackers Web spamming is understood as the process of misleading web search engines into giving a web site a rank that is unjustifiably high for the actual content [72]. Sim- ilarly, we define spam trackers as trackers that advertise false information about swarms for the purpose of spamming specialized BitTorrent search engines. There are multiple reasons for maintaining spamming trackers, such as disrupting the il- legal sharing of copyright-protected works, or spreading malicious software; we do not investigate these, but focus instead on the quantitative characterization of the phenomenon. We propose a rule based system for identifying spam trackers. Currently, the domain expert generates the rules based on studying the data, but in the future we plan to automatically generate the rules from the measurement data. Our system is based on three rules (criteria). Criterion C1 defines a spam tracker as a tracker advertising an improbably high number of peers participating in the swarms it man- ages; the current cut-off point is 50 million concurrent session (TSessN). Criterion C2 considers the mean number of downloads per swarm (SDldN) for all the swarms of a tracker; if this number is at any time higher than 1 million, the tracker is spamming. Finally, we define criterion C3 using the coefficient of variation (CV) of the number of concurrent sessions. Based on the collected data, we empirically set the cut-off point for criterion C3 at 2. All known trackers are well below the cut-off points for all criteria, thus false-positives are not an issue. We will investigate false-negatives in future work. A summary of the spam levels resulting from applying these criteria is presented in Table 2.7. We observe that most of the spam trackers advertise a very large number of peers, and are thus identified by criterion C1. A considerable number of trackers advertise lowly varying number of peers, and are identified by criterion C3. Last, there exist seven trackers that try to spam using false download numbers (C2). Overall, we notice a lower percentage of spamming compared to other areas—less than 20% vs. 50–80% for email and the Web—but our spam identification methods are still coarse in comparison with the state-of-the-art [72]. 2.4. New BitTorrent phenomena .27

Table 2.8: Statistics for the number of concurrent sessions per swarm (SSessN) per scrape sample for the largest active swarms.

Largest active swarms Top-1% Top-0.1% Top-0.01% Samples 15 711 1 571 157 2 SSessN Max Mean Max Mean Max Mean Min 546 9 4 301 255 31 490 9 055 Q1 708 448 7 292 2 088 33 680 15 150 Median 1 025 643 8 185 4 483 36 080 18 860 Mean 2 838 1 509 15 850 7 538 48 920 28 100 Q3 2 046 1 160 17 040 8 147 45 270 19 460 Max 372 800 365 600 372 800 365 600 372 800 365 600

2.4.2. Giant swarms Previous studies [53, 54] identify at most 5000 concurrent peers in the largest swarms observed. We classify swarms with a larger number of concurrent sessions as giant swarms. To analyze swarm sizes, we first introduce the notion of active swarms as swarms with the number of concurrent sessions above 10 at any time in their operation. Eliminating inactive swarms prunes the dataset of information irrelevant for P2P transfers. The spam trackers identified in the previous section are also excluded from the dataset used for analyzing giant swarms. We now investigate the characteristics of the largest swarms present in our dataset. Table 2.8 summarizes the basic statistical properties of the number of concurrent sessions per swarm (SSessN) across scrape samples. We present for each statistical property both the maximum and the mean for the values observed over time across scrape samples for each swarm. Giant swarms represent about half of the Top-0.1% of our sample. For the Top-0.01% of the swarms we observe for the maximum number of concurrent sessions a median of 36 080 peers, indicating much higher user participation than in previous measurement studies [53, 54]. Notably, there exist swarms with hundreds of thousands of users. We focus now on the trackers managing at least one swarm appearing in one of the largest swarms lists. Table 2.9 shows information about these trackers. The number of unique trackers that are together serving all the large swarms is 94 for Top-1%, 18 for Top-0.1%, and only 4 for Top-0.01%. We notice that a small per- centage of the trackers manage all of the largest swarms. Furthermore, many of the largest swarms are managed by single trackers. Table 2.9 shows that about half of the swarms in Top-1% and Top-0.1%, and almost all of the swarms in Top-0.01% are managed by single trackers. Thus, the most popular files are largely served by single trackers, which are effectively single points of failure, raising availability concerns. .28 2. Towards observing the global BitTorrent file-sharing network

Table 2.9: Largest active swarms and their trackers.

Largest active swarms Top-1% Top-0.1% Top-0.01% 2 Trackers of largest active swarms Number of trackers 94 18 4 Percentage of alive trackers 12.22 2.34 0.52 Largest active swarms with a single tracker Number of swarms 6 744 946 141 Percentage of large swarms 42.93 60.21 89.75

2.5. Related work We survey in this section related research efforts focusing on BitTorrent and other peer-to-peer file-sharing networks. BitTorrent There have been many BitTorrent measurement and analysis stud- ies in the past five years. Table 2.10 characterizes previous studies by size, measure- ment technique, and special feature. Size-wise, our study provides one of the largest datasets on the status of the BitTorrent network (though we realize that more is not necessarily better). Concerning measurement techniques, our study proposes novel tracker search functionality, including automated search. We also note that BTWorld involves many non-trivial design choices. In contrast, most of the studies reported here use improvised measurement infrastructure, without reporting design or technical details. The remaining two studies, Iosup [55] and Siganos [57], develop the MultiProbe and the APOLLO infrastructures, respectively, but these both focus on ThePirateBay, and therefore have more limited data collection and processing challenges. They also assume that the measurement infrastructure can comprise hundreds of globally-distributed machines, which contrasts to our single-location, few-machines measurement infrastructure. Last, we compare studies by special fea- ture, that is, the feature or aspect of BitTorrent the study was the first to investigate comprehensively. Our study is the first to focus on observing the global BitTorrent network; this enables better understanding of the network, including the identifica- tion and quantification of two new BitTorrent phenomena, giant swarms and spam trackers. Two recent studies by Dán and Carlsson [75, 76] are closely related to our work. In particular, they collect information from a large number of BitTorrent trackers (including the largest trackers we investigate), using the “well-known BitTorrent site” and “torrent file parsing” techniques. However, these studies have a different focus from ours, do not use several of the tracker search techniques described in Section 2.2.3, collect information about fewer swarms (3.3 and 5.2 million hashes, respectively, compared to our 10.3 million), and obtain a majority of their unique hashes from a single BitTorrent community (Mininova). More recently, after the start of BTWorld, Otto et al. [58] studied data gather 2.5. Related work .29 al .0 oioigstudies. Monitoring 2.10: Table hssuy79trackers 769 study This to[ Otto [ Meulpolder [ Menasche [ Zhang [ Siganos [ Andrade [ Piatek [ Neglia [ Guo [ Pouwelse [ Izal Study á [ Dán ou [ Iosup 54 75 50 58 55 ] 59 66 , 69 ] ] 57 ] 76 ] ] 63 , 53 74 ie 0 swarms 600 site, 1 ] 2 rces M5 wrsWb&takrscrape tracker & Web swarms 3M/5M trackers, 721 ] 73 sites 3 ] site 1 ] 18 6 wrs 4 users 14M swarms, 66k ] 0kusers 300k ] ie,50 users 500k sites, 5 ] 2 0 swarms 10M users 500k ie ksam,20 sr e rce cae nentmaueetPe n nentcharacteristics Internet and Peer swarms 35k sites, 9 measurement Internet scrape, tracker & Web users swarms 250k 21k swarms, 2k site, 1 swarm 1 Size ie,4 wrs 4kssin e cae nentmeasurement Internet scrape, Web sessions 845k swarms, 4k sites, 2 rce scrape, tracker search Tracker plugin User-installed e rce cae utmclient custom scrape, tracker & Web client custom scrape, tracker & Web PEX client, custom scrape, Tracker scrape Web probing DHT feed, Web client custom scrape, Web client custom log, Tracker Techniques e cae H probing DHT scrape, Web e cae E,cso client custom PEX, scrape, Web otn iesearch size content lblBtorn view BitTorrent Global impact economic Locality, pmtakr,gatswarms giant trackers, spam rvt communities Private balancing load popularity, Content bias Sampling detection client Deviant Collaboration availability Tracker management content Flashcrowd, swarm BitTorrent Flashcrowd, features Special H,altruism DHT, otn availability Content ut-onod,Btorn model BitTorrent Multi-downloads, .30 2. Towards observing the global BitTorrent file-sharing network

from 500 000 users who installed the plugin developed for the Ono project [70]. Their study benefits from custom code running on the computers of users that collects data related to the geographic and network locality of BitTorrent peers. This measurement approach has the disadvantage of requiring a large amount of users to install a plugin. 2 Other peer-to-peer file-sharing networks Similarly to the case of BitTor- rent, much work has been put into measuring peer-to-peer file-sharing networks such as Gnutella [77, 78], FastTrack [3], eDonkey [79], and Overnet [80, 81]. Due to the differences between BitTorrent and these networks, the data collection infrastruc- tures used in these studies face different data collection challenges than ours.

2.6. Conclusion As BitTorrent increases in scale and complexity, new measurements are needed for system optimization. Motivated by the experience of the Internet community more than a decade ago, in this chapter we formulate and take initial steps towards meeting an ambitious challenge: to obtain a global view of the BitTorrent network. We have first presented the design of BTWorld, an architecture for observing the global status of BitTorrent through information provided simultaneously by a large number of popular trackers. Then, we have implemented and deployed BTWorld, periodically collecting information from hundreds of trackers. BTWorld is then used to analyze one week of global BitTorrent operation, which resulted in a large- scale measurement—over 10 million swarms—and a holistic analysis of BitTorrent’s performance, scale, and reliability. Last, this broad analysis has enabled us to provide new and more detailed insights about two important BitTorrent phenomena, spam trackers and giant swarms. We continue to work towards obtaining comprehensive, long-term data about the global BitTorrent status. We are working towards making this dataset available to the research community as part of the Peer-to-Peer Trace Archive [82]. 3 Inter-swarm resource allocation in BitTorrent communities

A large body of research (e.g., [45, 46]) shows that BitTorrent’s resource alloca- tion mechanisms create efficient and scalable peer-to-peer swarms for content dis- tribution. However, nearly all evidence of BitTorrent’s efficiency has been found exclusively in the context of single-swarm operation. At the same time, measure- ments show that most BitTorrent users participate in multiple swarms, or torrents, simultaneously [50]. It is customary for users to download, or leech, multiple files concurrently, and to continue uploading the files they finished downloading, or seed, at the same time. In order to enable this multiple-swarm operation, designers of BitTorrent clients have introduced two mechanisms to perform inter-swarm resource allocation: the selection of torrents to seed in, and the allocation of upload band- width across all torrents in which a peer participates, either as a seeder or as a leecher. Although these mechanisms are routinely used by millions of users, their performance has not been evaluated, and it is not know if they can be improved. In this chapter, we provide an analysis of their performance. The evaluation of inter-swarm resource allocation mechanisms presumes a the- oretical understanding of the problem they address. We thus build a foundation for this chapter by formally defining the inter-swarm resource allocation problem. Our formalization shows that this problem is NP-hard. We find that to make it tractable, it is necessary to divide it into two parts, torrent selection and bandwidth allocation. This allows us to derive theoretical upper bounds for the performance of resource allocations in a BitTorrent community. In this chapter, we consider two types of communities: (1) file-sharing communities, which target maximizing

This chapter is based on work published in IEEE P2P 2011 [83].

31 .32 3. Inter-swarm resource allocation

aggregate download speed, and (2) video-streaming communities, which target max- imizing the number of users receiving sufficient download speed for streaming. We derive performance bounds by modeling the community as a flow network and us- ing graph theory. The performance upper bound for a file-sharing community is obtained by solving the maximum flow problem; for video-sharing communities, we introduce a novel algorithm to find the max-min fair allocation of flow. The theoretical results provide the context for the evaluation of the inter-swarm resource allocation mechanisms deployed in BitTorrent clients. Our analysis focuses on the mechanisms of uTorrent and Azureus, two clients that together account 3 for 80% of BitTorrent’s usage [46]. We describe their behavior and recreate it with simulators for torrent selection and bandwidth allocation. To obtain realistic results, we use traces of real-world BitTorrent usage from two BitTorrent communities to drive our simulations. Each trace captures the behavior of nearly 100 000 users over several months, representing large-scale real instances of the inter-swarm resource allocation problem. Our results show that it is possible to significantly improve inter-swarm resource allocation mechanisms, but that efforts should focus on bandwidth allocation. The torrent selection mechanism does not hamper performance, partly because of a reduced space of possibilities for selection algorithms in practice. In contrast, the bandwidth allocation mechanism performs poorly compared to the optimal: as low as 50% for file sharing, and 25% for video streaming. In this chapter, our main contributions are:

1. We formally define the inter-swarm resource allocation problem, show that it is NP-hard, and decompose it into two tractable subproblems (Section 3.1);

2. We introduce a graph-theoretical model for BitTorrent communities and use it to derive performance upper bounds for inter-swarm resource allocations (Section 3.3);

3. We analyze through trace-based simulation the performance of the BitTorrent inter-swarm resource allocation mechanisms (Sections 3.5 and 3.6).

3.1. The inter-swarm resource allocation problem In this section we formulate and formalize the problem of inter-swarm resource allo- cation in BitTorrent. We first describe the need for inter-swarm resource allocation. Next, a formulation of the general resource allocation problem is presented. Finally, we define the two goals for resource allocation that are considered in this chapter.

3.1.1. Background Consider a BitTorrent community with a set of users I and a set of torrents T .A user participating in multiple torrents simultaneously is said to have a session in each of these torrents, and these torrents are said to form the user’s active torrents set. Users downloading a torrent are called leechers, and their sessions are called leeching sessions, while users that have finished downloading a torrent and are only 3.1. The inter-swarm resource allocation problem .33 uploading it are called seeders, and their sessions are called seeding sessions. We denote by St and Lt the sets of seeders and leechers in a torrent t, respectively. Efficiently participating in a torrent requires a BitTorrent client to maintain a certain number of open TCP connections. However, too many simultaneous con- nections may reduce the overall upload or download speed of the client. Therefore, most BitTorrent clients have a configuration parameter to set the maximum num- ber of active torrents, thus limiting the total number of open TCP connections in the client. Furthermore, limiting the number of active torrents ensures that disk activity is also limited, thus preventing disk trashing. A BitTorrent client decides which torrents to keep active at any given moment. 3 Clients usually always keep their leeching sessions active. If a client has fewer leeching sessions than the maximum number of active torrents, it will start seeding sessions in torrents from its seeding library, i.e., the set of torrents that have been fully downloaded and that have not been removed. The seeding library of user i is denoted by Λi. The maximum number of seeding sessions of a client, called its seeding capacity and which is denoted ki, is determined by the maximum number of active torrents, the number of leeching sessions, and the size of the seeding library.

3.1.2. Problem formulation An inter-swarm resource allocation represents the decisions of all users in the com- munity about which leeching sessions they serve, and how much bandwidth they assign to each of these leeching sessions. An inter-swarm resource allocation must satisfy two constraints. Constraint C1 is the seeding capacity constraint: users can- not exceed their seeding capacity ki, and Constraint C2 is the bandwidth constraint: users cannot offer more bandwidth than their upload bandwidth µi, and they can- not accept more bandwidth than their download bandwidth δi. More formally, a resource allocation is a function A : P → R which satisfies:

(C1) ∀i ∈ I : {t ∈ Λi | ∃j ∈ Lt s.t. A(i, j, t) > 0} ≤ ki; and ∑ (C2) ∀i ∈ I: A(i, j, t) ≤ µ ∧ ∑ t∈T,j∈Lt i ∀j ∈ I: A(i, j, t) ≤ δ , t∈T,i∈St∪Lt j where P is the set of triplets (i, j, t) representing the potential data transfer con- nections, P := {(i, j, t) ∈ I × I × T | (i ∈ St ∨ i ∈ Lt) ∧ j ∈ Lt}. The Resource Allocation Problem (RAP) is finding a resource allocation that achieves a specific goal. Such a goal can be either the optimization of a metric or satisfying a set of constraints. The RAP for a particular community may be to maximize the total throughput in the community, whereas another community may be interested in maximizing the median download speed across all leeching sessions. Regarding constraint satisfaction, there can be communities interested in guaranteeing a certain minimum download speed for all users, or communities aiming at achieving some form of max-min fairness. We give precise definitions for goals we consider in the next subsection. In general, a RAP is a mixed-integer (non-)linear optimization problem, which is, regardless of the goal, NP-hard. Nevertheless, it is possible to divide RAP into two .34 3. Inter-swarm resource allocation

subproblems, which are the seeding sessions allocation problem and the bandwidth allocation problem. This decomposition may lead to an approximative solution for a RAP, and maps parts of the problem to more tractable equivalents. The Seeding Sessions Allocation Problem (SSAP) consists in selecting a subset Ps of P , such that the number of torrents in which any seeder uploads does

not exceed its seeding capacity: ∀i ∈ I : {t ∈ Λi | ∃j ∈ Lt s.t. (i, j, t) ∈ Ps} ≤ ki. Solving SSAP yields a set of possible data transfer connections Ps that satisfies the seeding capacity constraint C1. This corresponds to the torrent selection done by BitTorrent clients. B → R 3 We define a bandwidth allocation as a function : Ps such that the band- width constraint C2 holds. Because of the definition of Ps, the bandwidth allocation also satisfies C1. The Bandwidth Allocation Problem (BAP) is then finding a bandwidth allocation that achieves a RAP goal. BAP is a tractable relaxation of RAP. It is possible to map BAP to equivalent problems in flow networks, as we show in Section 3.3. Framing a RAP as being composed of SSAP and BAP also allows us to derive upper bounds for its solution. Applying an algorithm that optimally solves BAP to the complete set P will produce an upper bound to RAP. This is equivalent to relaxing RAP by ignoring C1. Although this upper bound is not necessarily a feasible solution of RAP, it can be used as a reference for the performance of heuristic solutions.

3.1.3. Maximizing download speed and optimizing streaming We consider two goals for BitTorrent communities in this chapter. The first one is suitable for a community interested in maximizing the average download speed of its users. This goal, named Maximum throughput, is in line with many existing file-sharing∑ communities. It is formally defined as finding an allocation A that A maximizes (i,j,t)∈P (i, j, t). The second goal reflects the requirements of video-streaming systems. In this case, the community intends to provide as many users as possible with enough download speed for streaming. One way to formalize this objective is to aim at providing a certain minimum streaming rate to as many leeching sessions as possible. However, we opt for a stronger formulation named Max-min fairness. In a max-min fair allocation, the download speed of a leeching session can only be increased at the cost of decreasing the download speed of another leeching session that has a lower speed. Formally, an allocation A is max-min fair iff

∀A′ : if ∃p ∈ P | A′(p) > A(p) then ∃q ∈ P | A(q) ≤ A(p) and A′(q) < A(q) Intuitively, the allocation should provide the highest possible streaming rate for the lowest-bandwidth user, then the highest possible streaming rate for the second lowest-bandwidth user and so on. With the resulting allocation, users that download at a rate lower than the streaming rate will experience some startup delay, but will still have the best possible quality of experience. Furthermore, max-min fairness 3.2. Simulators for current RAP solutions .35 enables the community to work with multiple streaming rates of varying qualities and to minimize the number of users experiencing low-quality streams.

3.2. Simulators for current RAP solutions We now describe in detail the inter-swarm resource allocation mechanisms currently deployed in BitTorrent clients. In addition, we introduce simulators that replicate their behavior, and present a simulator validation experiment. 3.2.1. Current mechanisms in BitTorrent clients 3 Current BitTorrent communities tackle RAP in a decentralized manner using var- ious heuristics implemented by BitTorrent clients. We investigate for this analysis the two most popular clients, uTorrent and Azureus, which account for 80% of Bit- Torrent usage [46]. Examining the configuration and documentation of these clients shows that they solve SSAP and BAP using a similar behavior. With regard to SSAP, or torrent selection, these clients employ a heuristic based on the proportion of leechers in each swarm. First, all torrents in the user’s seeding library are sorted according to their proportions of leechers. Then, the torrents with the highest proportions of leechers are chosen to be part of the active torrents set. The torrents that fall outside the seeding capacity are paused. This heuristic relies on the assumption that the proportion of leechers is a good approximation for the bandwidth need in a torrent. Note that this heuristic does not take into account the bandwidth of seeders and leechers. However, the impact of this omission on the quality of the solution will depend on the problem instance at hand. The solution to BAP, or bandwidth allocation, involves three steps. First, the client allocates the same number of upload connections to each active torrent, five by default. Second, the connections are allocated to leechers inside each torrent. This is done differently by seeders and leechers. Seeders allocate connections in a round- robin fashion to all leeching sessions. Leechers allocate one connection randomly in order to bootstrap the discovery of new peers. The rest of the connections are allocated to the fastest reciprocating peers following a tit-for-tat strategy. In the third step, after connections are allocated, each upload connection receives an equal share of the peer’s total upload bandwidth. There are various reasons for the current BAP solution. The equal division of connections across torrents stems from the assumption that a small fixed number of connections is sufficient for good performance in a torrent. The round-robin allocation of upload connections used by seeders gives every leecher an equal share of the seeder’s bandwidth. The allocation of upload connections inside a torrent by leechers incentivizes cooperation. Note that in principle tit-for-tat leads to an emergent clustering of peers by upload bandwidth: peers tend to exchange data with others with similar bandwidth. However, the randomly allocated connection, called optimistic unchoke in BitTorrent terminology, affects the clustering: a low- bandwidth leecher receiving a random connection from a high-bandwidth leecher will frequently reciprocate with a regular connection, potentially disconnecting a leecher with similarly low bandwidth. This bias of slow peers towards faster uploaders is .36 3. Inter-swarm resource allocation

well documented in the literature [84, 85]. Allocation mechanisms for network resources must also interact with lower net- work layers. In the context of large data transfers, a paramount issue is the interplay between the application and transport layers. BitTorrent clients typically use TCP connections, hence in case a leecher’s download connection gets congested, TCP congestion control mechanisms come into effect, interfering with the bandwidth al- location of the BitTorrent client. TCP congestion control divides the bandwidth of a congested download connection equally among all uploading connections. The interplay of the current BAP solution and TCP congestion-control has non- 3 obvious effects on the overall resource allocation. For instance, consider a scenario in which there are two torrents, each with one leecher. There are also two seeders, s1 and s2. The upload and download bandwidth of all peers is c. Seeder s1 has seeding capacity 1 and is thus only active in one torrent, while s2 has seeding capacity 2 and is active in both torrents. If the seeders allocate their bandwidth according to the current BAP solution, the leecher served by both seeders will be a bottleneck, and the download capacity of this leecher will be divided equally among the two seeders. The other leecher would be served by a seeder with spare capacity 0.5c. The resulting aggregate download speed will be 1.5c, instead of the maximum of 2c. If the system aims at maximizing throughput, this is a suboptimal solution. Finally, note that the SSAP and BAP solutions we describe in this section are decentralized. Each peer acts autonomously based on local information about other peers. We create simulators to determine the resource allocation that results from applying these decentralized solutions by all peers. The next two subsections explain the simulators for the current SSAP and BAP solutions, respectively.

3.2.2. A simulator for the current SSAP solution We approximate the current SSAP solution with a simulator that calculates the allocation to which the community eventually converges given its instantaneous configuration. The algorithm for this simulation is named cSSAP and is presented in Algorithm 1. The simulator repeatedly iterates over the peers with seeding ses- sions and runs the observed torrent selection heuristic for each of these peers. The information available to the seeders about leecher proportions in the torrents is up- dated after each seeder decision. The simulation stops when the peers stop changing their allocations. Note that we are interested in the effect of the current SSAP mechanism on the composition of the active torrents sets. We isolate this effect by examining the solution to which the mechanism converges; we do not consider the convergence time. In reality, peers get information about the proportions of leechers in torrents only periodically. Depending on the rate of state changes in the system, this may hamper convergence. Nevertheless, this is essentially an information dissemination concern, which is outside the scope of this work.

3.2.3. A simulator for the current BAP solution Similar to SSAP, our analysis is concerned with the result of the current BAP solution after convergence. To approximate this result, we use a simulator named 3.2. Simulators for current RAP solutions .37

Algorithm 1 cSSAP: Current SSAP solution simulator

∀i ∈ I, Σi := ∅ repeat consensus := true for all seeder i do order-descending-by-proportion-of-leechers(Λi) ′ Σi := top-ki(Λi) ′ ̸ if Σi = Σi then consensus := false end if ′ 3 Σi := Σi end for until consensus cBAP, that repeatedly performs the following steps:

1. Allocate the available upload bandwidth of each peer according to the cur- rent BAP solution without considering the download bandwidths of leechers, excluding congested leechers where the uploader has a share of the download bandwidth equal to the other uploaders;

2. Check every leecher for download congestion: if there is no congestion, subtract the allocated bandwidth of uploaders from their available bandwidth; if there is congestion, subtract only an equal share of the leecher’s download bandwidth from the available bandwidth of every uploader.

The simulation stops when for each peer, either there is no more upload bandwidth available, or every leecher the peer is uploading to is congested and the peer has a share of its download bandwidth that is at least equal to the average share of the other uploading peers. We validate cBAP with an experiment using regular BitTorrent clients that are instrumented to provide detailed reports on the data exchanges. The output logs of clients contain sufficient information to determine the bandwidth allocations among the peers during download. The validation consists of simultaneously starting ten peers that participate in three torrents. Some of these peers participate in multiple torrents, creating the need for inter-swarm resource allocation. Moreover, peers have heterogeneous bandwidths, so that clustering can be observed in the experiment. The characteristics of all peers are summarized in Table 3.1. Each peer uses an asymmetric connection with a download bandwidth equal to 8 times the upload bandwidth, and the size of the files distributed in each torrent is 200 MiB. To compare the cBAP simulator to the actual BitTorrent clients, we discard the warm-up and end phases of the experiment. The warm-up phase is the period before all peers have downloaded at least 10% of the torrent. During this period, it is the piece availability—and not the RAP solutions—that chiefly determines the resource allocation. The end phase of the experiment is the period after which one leecher .38 3. Inter-swarm resource allocation

Table 3.1: Parameters for the validation experiment for the cBAP simulator. L and S indicate if a peer is a leecher or a seeder in a torrent, respectively.

Peer ID Download Upload Torrent bandwidth (KiB/s) bandwidth (KiB/s) A B C 1 512 64 S 2 1024 128 L 3 2048 256 LL 4 2048 256 LS 3 5 2048 256 LL 6 512 64 L 7 512 64 LLS 8 2048 256 LLL 9 1024 128 LLL 10 512 64 LLL

has become a seeder. When this happens, the configuration of the community has changed, and must be compared against another solution by the simulator. A comparison between the results of the simulation and the experiment, using the same configuration, is shown in Figure 3.1. Overall, the bandwidth allocation patterns are similar: peers exchange data with peers that have a similar bandwidth. For example peers 5 and 8, both having high bandwidth, allocate more bandwidth to each other than peers 3 and 7, which have different bandwidths. We note that the experiment displays marginally less stable clustering than the simulator, i.e., the bandwidth allocations in the simulator are higher between peers with similar bandwidth. We also note that the absolute values of the bandwidths allocated by the peers are close, with the highest pairwise transfer speed approximately 60 KiB/s in both cases.

3.3. Optimal bandwidth allocation in BitTorrent com- munities In this section, we introduce a graph-theoretical model of resource allocation in BitTorrent communities that allows us to map BAP to flow network problems. This mapping, in turn, permits us to apply graph-theory solutions for BAP targeting throughput maximization in the community, and to devise an algorithm to find the max-min fair allocation of bandwidth to the leeching sessions in the community.

3.3.1. A graph-theoretical model for BitTorrent communities We model the state of a BitTorrent community at a certain instant as a flow network G = (V, E, f, c), where V is the set of vertices, E is the set of edges, f is the flow function, and c is the capacity function. The flow network G is a directed tripartite graph, with V being the union of three disjoint subsets of vertices, U, L and D, and with each edge in E connecting two vertices that are in distinct subsets. These three 3.3. Optimal bandwidth allocation in BitTorrent communities .39

C−BAP simulator Instrumented BitTorrent clients Allocated 10 bandwidth 9 (KiB/s) 8 0 7 10 6 5 20 4 30 3 Downloader ID Downloader 40 3 2 50 1 60 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Uploader ID

Figure 3.1: The bandwidth allocations produced by the cBAP simulator and an experiment with real BitTorrent clients.

sets of vertices are defined in the following way:

• the upload nodes U = {u1, . . . , um} represent the upload potential of the m users (both seeders and leechers) who are active in the community at the instant considered; • the leeching sessions nodes L = ∪iLi represent the presence of leechers in { t | ∈ } torrents, where Li = li i Lt is the set of leeching sessions of user i; and • the download nodes D = {d1, . . . , dn} represent the download potential of the leechers.

In the graph G, a user i is represented by the set of nodes {ui, di} ∪ Li. Figure 3.2 shows an example mapping. The set of edges E represents the transfer potential from the upload nodes to the leeching sessions and from the leeching sessions to the download nodes. If user i is leeching in a torrent t, then there are edges from ui to the leeching sessions of all other users in torrent t. Using the notation introduced in Section 3.1, this formally ∀ ∈ ∀ ∈ ̸ t ∈ means that t T, i, j Lt (i = j) : (ui, lj) E. Similarly, if a user i is seeding in a torrent t then there are edges from ui to the leeching sessions of all other users ∀ ∈ ∀ ∈ ∀ ∈ t ∈ in torrent t. Formally, t T, i St, j Lt : (ui, lj) E. All of the edges thus defined are called upload edges. Finally, to represent downloading, the graph also has edges from the leeching session nodes to download nodes: ∀t ∈ T, ∀i ∈ Lt : t ∈ (li, di) E. These edges are called download edges. The capacity and flow functions of G are defined as follows: • the capacity function c : U ∪ L ∪ D → Z represents the bandwidth of peers, t ∞ where c(ui) := µi is the upload bandwidth of user i, c(li) := , and c(di) := δi .40 3. Inter-swarm resource allocation

Community with two torrents Flow network model torrent 1 torrent 2 c=10 1 u l 1 1 c=50

p d1 2 c=20 1 u l p1 2 2 c=80 p d 3 c=30 2 3 u 2 3 l2

leecher in torrents 1 and 2 leecher in torrent 1 user 1 user 2 seeder in torrents 1 and 2 user 3

Figure 3.2: An example of a BitTorrent community with three users and two torrents and its representation as a flow network.

is the download bandwidth of user i, and • the flow function f : E → R represents∑ the bandwidth allocation, having the property of flow conservation: ∀lt ∈ L, f(u , lt ) = f(lt , d ). j ui∈U i j j j It is easy to see that any flow in G is equivalent to a bandwidth allocation B, and that the Seeding Sessions Allocation Problem is equivalent to selecting a subset ′ ⊆ ∀ ∈ |{ ∈ | t ∈ }| ≤ E E such that i I : t T (ui, lj) E ki.

3.3.2. Maximizing throughput Using the flow network model defined above, solving BAP to maximize throughput is equivalent to solving the maximum flow problem for G, a problem for which several algorithms exist [86]. In this chapter, a linear programming (LP) problem formalization is used, which we denote MaxFlow: ∑ t max (u ,lt )∈E f(ui, lj), ∑ i j subject to f(u , lt ) ≤ µ ∀u ∈ U, ∑t,j i j i i t ≤ ∀ ∈ t f(lj, dj) δj dj D. Note that the flow is not required to take integer values. Because the capacity function does take integer values, the integral flow theorem states that there exists an integral maximum flow; this solution can be found in polynomial time using an LP solver. In our experiments we use MOSEK [87].

3.3.3. Max-min fairness algorithm Our second goal from Section 3.1.3 is to find the max-min fair allocation. In order to do so, we establish an iterative algorithm, which we denote MaxMin, that maintains 3.3. Optimal bandwidth allocation in BitTorrent communities .41 an increasing set F of download edges for which the flow values are fixed to their proper value in the max-min fair allocation. This set F is initially empty, and in every iteration, new download edges are added to F . In every iteration k, our algorithm solves the following linear programming problem MMk:

max ϕk, subject to f(lt, d ) ≥ ϕ ∀(lt, d ) ∈ E , ∑ i i k i i k f(u , lt ) ≤ µ ∀(u , lt ) ∈ E, ∑t,j i j i i j f(lt , d ) ≤ δ ∀(lt , d ) ∈ E, j j j j j j 3 where Ek is the set of edges whose flows have not yet been fixed in previous iterations, i.e., Ek = E\F . The MaxMin algorithm stops when F contains all download edges. In the main loop of our algorithm, having solved the actual instance of MMk, an LP solver returns with a flow f on the graph. In this flow, there may be multiple download edges with the max-min flow value ϕk. These edges are collected into a set Φ, which contains the candidate edges to be added to the set F . However, among the edges in Φ, there may be edges on which the flow value can be increased. In order to check this property, the algorithm continues with two inner loops; to explain these, we use∑ the following terminology. We say that an upload node ui t node is saturated if t,j f(ui, lj) = µi, that download node di has the max-min t property if there∑ is no torrent t for which f(li, di) > ϕk, and that a download node t di is saturated if t f(li, di) = δi. We now describe the two inner loops. The first one checks for all elements of Φ t whether they should actually be included in F . We only keep an edge (li, di) for which either a) the download node di is saturated and has the max-min property, or t b) all upload nodes uj connected to it (through an upload edge (uj, li)) are saturated t′ and all other upload edges (uj, li′ ) with positive flows are connected to download ′ t ′ ≤ edges (li′ , di ) for which the flow is ϕk (i.e., uj cannot be desaturated). t The second loop considers all download edges (li, di) in Φ for which condition b) but not condition a) from the first loop holds. We discard those download edges t t (li, di) for which there exist upload edges (uj, li) with a saturated upload node uj, ′ ′ t t ′ which has other (uj, li′ ) edges with positive flows on them in such a way that (li′ , di ) ′ t ′ is not in Φ, but has f(ll′ , di ) = ϕk. Finally, the MaxMin algorithm takes the elements of Φ, fixes the flows on them, and adds them to F . Properties of the MaxMin algorithm. We first observe that in each iteration at least one element is added to F , since at least one edge with the minimum flow ϕk is kept t ∈ in Φ after filtering. On the other hand, the flow on these edges (li, di) Φ can be t increased only by decreasing flows on those edges (lj, dj) which have at most flow value ϕk, which assures that all edges in Φ belong to the max-min fair allocation. Since the set F of edges with fixed flows is incrementally constructed using the edges from Φ, it follows that the algorithm finds the max-min fair allocation for a given flow network G. As a consequence, we find that the number of iterations in our algorithm is at most equal to |E|. For each iteration there is a linear program, MMk, to be solved, .42 3. Inter-swarm resource allocation

Table 3.2: Summary of datasets. (95% confidence intervals for means).

Property Bitsoup Filelist Registered users 84 007 91 745 Total torrents 13 741 3 236 Mean active torrents 6 869.6 ± 30.8 512.2 ± 10.2 Mean active sessions 76 370.3 ± 1 135.0 32 829.4 ± 672.8 Mean seeders/leechers ratio 5.1 ± 0.1 3.6 ± 0.2 3 which happens in polynomial time (we again use MOSEK). The filtering part of the algorithm has complexity O(|E| · |U| · |L|), as it only considers the download edges and edges connected to them through paths of length at most two. Regarding the existence of the solution of the max-min fair allocation problem, note that our MaxMin algorithm runs on a continuous, convex set (bounded by the finite number of constant capacities), on which the max-min fair allocation exists [88]; moreover, the allocation is unique [89].

3.4. Datasets To evaluate current inter-swarm resource allocation mechanisms in realistic condi- tions, we derive RAP instances from traces of real-world BitTorrent usage. In the following, we describe the datasets we use and the method for extracting problem instances.

3.4.1. Communities studied We use data from two BitTorrent communities: Bitsoup and Filelist1. Both traces were collected by periodically crawling web pages published in these communities containing user activity information. These pages include, for each user in each torrent, the user name, the session duration, and the amount of data uploaded and downloaded in the session. The pages were crawled hourly for Bitsoup, and every six minutes, on average, for Filelist. Table 3.2 summarizes the datasets. In total, there are around 100 000 BitTorrent sessions active at every moment, allowing us to form large-scale problem instances. BitTorrent communities that require registration such as those we analyze are known to have lower resource contention than open BitTorrent sites such as [46]. This is caused by accounting mechanisms used by the community administrators to keep track of the contributions of users and to expel those users who fail to contribute enough. As a consequence, users seed for longer, and the proportion of leechers is low. To investigate if our results are affected by this, we generate problem instances with higher proportions of leechers than those in the traces. In order to do so, we start with the traces and we reduce the seeding capacities of all users to produce a second dataset with an overall ratio of two

1We note that some of this data has been analyzed before (e.g.,[59, 63]). Nevertheless, the aspects evaluated in this chapter have never been considered. 3.4. Datasets .43 leeching sessions per seeding session.

3.4.2. Extracting seeding capacities and seeding libraries

Given the set of users who are online at a certain instant, we define for each user a seeding capacity and a seeding library. The seeding capacity of a user at a certain time is taken directly from the traces as the number of torrents the user is seeding at that time. Defining the contents of the seeding libraries of the users is a more complex 3 task because the traces only contain a series of times when the user was seeding a torrent, but no information about when that torrent was removed. We circumvent the absence of such information by considering the two extreme scenarios for recon- structing the seeding libraries. The first one, named minimal libraries, assumes a user deletes a torrent immediately after the last time the user is observed seeding it. In this scenario, a torrent is in the seeding library of a user at a certain time only if that user was observed seeding it both before and after that time. The second scenario, named maximal libraries, assumes users never delete torrents. Then, a torrent is in the seeding library of a user if the user was observed seeding it at least once in the past. The seeding library size distribution is heavily skewed in both com- munities. In Bitsoup, using minimal and maximal estimation, the median library sizes are 3 and 6, respectively, while the maximums are 290 and 542, respectively. In Filelist, the medians are 2 and 4, while the maximums are 72 and 566.

To obtain unbiased comparisons using different times in different traces, we define limited time windows for analyzing past and future events relative to each instant. In addition, we use a random selection of instants from one whole week of each of the two BitTorrent community traces. This allows us to account for most short-term seasonality in BitTorrent usage, which is daily or weekly, and to have a time window for seeding library estimation of 28 days in both traces. In Sections 3.5 and 3.6, we use 55 problem instances derived from Bitsoup and 45 from Filelist.

3.4.3. Upload and download bandwidths

The traces do not contain information about the upload and download bandwidths of users. To obtain realistic numbers, we turn instead to a trace obtained by Isdal et al. [90], who measured the upload bandwidths of a large sample of BitTorrent peers using passive measurement tools. We derive random samples from this dataset to assign to the users in our problem instances, preserving the distribution of band- widths in [90]. Additionally, we consider two types of user connections. If a user’s upload bandwidth is less than 100 Mbit/s, we assume the connection to be asym- metric with download bandwidth equal to eight times the upload bandwidth (in line with connections in Europe and North America). On the other hand, if the upload bandwidth of a user is higher than 100 Mbit/s, the user is assumed to have a symmetric connection with equal download and upload bandwidth. .44 3. Inter-swarm resource allocation

Minimal libs Maximal libs 1400 1200

1000 Bitsoup 800 600 400 200 1400 1200

1000 Filelist 3 800 600 Average download speed download Average per leeching sesion (KiB/s) 400 200 High Observed High Observed Leecher proportion

Figure 3.3: Upper bound throughput for all scenarios, as produced by MaxFlow in relaxed RAP instances (means with 95% confidence intervals).

3.5. Can current algorithms provide high through- put? In this section, we evaluate the performance of current SSAP and BAP solutions, as resulting from the cSSAP and cBAP simulators, respectively, in the context of file-sharing communities interested in maximizing aggregate throughput.

3.5.1. Baseline We first establish a baseline for performance of any allocation mechanism. Recall that, for a SSAP solution, the bandwidth allocation that maximizes aggregate down- load speed of all users is that given by MaxFlow (as discussed in Section 3.3). Our baseline is thus the performance of MaxFlow on an unconstrained input, where all seeders can use their whole seeding libraries—a relaxed version of RAP. The re- sults represent an upper bound for solutions of any RAP instance where the seeding capacity constraint holds. Figure 3.3 depicts the mean download speed across leeching sessions, considering different leecher proportion levels, and library estimation methods. We observe that changes in leecher proportion have a sizeable effect in Filelist, but little effect in Bitsoup. This implies that a considerably larger fraction of seeding sessions in Bitsoup is not contributing to the maximum flow in the community. Removing these seeding sessions has no effect on community performance. At the same time, the mean download speed in Filelist is 2–3 times the speed in Bitsoup, suggesting the configuration of Filelist is such that leechers and seeders are balanced more evenly in torrents. It is also notable that library estimation method has no significant influence on 3.5. Can current algorithms provide high throughput? .45 the results. Maximal library estimation provides considerable more option for peers, but this does not lead to higher performance. Investigating the allocations produced by MaxFlow shows that in all solutions, the majority of seeders does not use any file from their libraries. This suggests that it is possible to attain the maximum flow in the RAP instances we consider even if most peers do not allocate bandwidth to seeding sessions.

3.5.2. Torrent selection Our next experiment assesses whether the current solution for SSAP limits the performance of solutions based on it for the complete resource allocation problem. 3 Thereafter, we use the notation Algorithm1+Algorithm2 to denote a RAP solution composed by the combination of two algorithms Algorithm1 and Algorithm2, that address SSAP and BAP, respectively. This experiment compares the performance of cSSAP+MaxFlow running in succession to the established baseline—which is an upper bound. The solution cSSAP+MaxFlow would be equivalent to having clients solve torrent selection in a decentralized manner, and then obtaining from an oracle the optimal bandwidth allocation for their SSAP solution. If cSSAP+MaxFlow performs similarly to the baseline, it is possible to affirm that the current torrent allocation does not limit the performance of a complete solution for RAP. At the same time, It may happen that cSSAP+MaxFlow performs well in the experiment because the space of possi- ble allocations does not allow a different outcome. To test for this, we examine the performance of a random torrent allocation coupled with MaxFlow. Results comparing the performance upper bound, cSSAP+MaxFlow and Ran- dom+MaxFlow are presented in Figures 3.4 and 3.5. Performance is measured as the aggregate download speed of all peers as relative to the baseline. For Filelist, there are negligible or no differences between solutions from cSSAP+MaxFlow and the upper bound. Moreover, this happens regardless of the library estimation and the proportion of leechers. In Bitsoup, cSSAP+MaxFlow is equivalent to the upper bound in most scenarios, but 10-15% worse than the upper bound for the scenarios with high leecher proportion. However, we cannot know if the upper bound perfor- mance is attainable in a given scenario, so we cannot conclude cSSAP is affected by the change in the proportion of leechers or if the upper bound is not achiev- able for the high leecher proportion scenario. Finally, there is a slight difference in the performance obtained with minimal and maximal libraries in Bitsoup. These differences can be attributed to an effect of larger choice on cSSAP, because all solutions possible with minimal library estimation are also possible with maximal library estimation. Overall, it follows that it is possible to attain optimal or nearly optimal per- formance using the current torrent selection mechanism. This is notable given that this mechanism ignores bandwidth information. Our results suggest that an efficient bandwidth allocation algorithm can cope with this limitation. Finally, an analysis of the Random+MaxFlow results shows that a heuristic that does not consider any torrent information can only hamper the performance of an efficient BAP solution to a limited extent. Our results thus suggest a reduced space of possibilities for .46 3. Inter-swarm resource allocation

Minimal libs Maximal libs 1.0 0.9 Bitsoup 0.8 0.7 0.6 0.5 3 1.0 0.9 Filelist 0.8

unrestricted MaxFlow 0.7 Performance relative to relative Performance 0.6 0.5 cSSAP Random cSSAP Random SSAP solution coupled with MaxFlow

Figure 3.4: Throughput produced with observed proportions by current and random SSAP solu- tions coupled with MaxFlow and relative to the performance upper bound for each RAP instance (means with 95% confidence intervals).

Minimal libs Maximal libs 1.0 0.9 Bitsoup 0.8 0.7 0.6 0.5 1.0 0.9 Filelist 0.8

unrestricted MaxFlow 0.7 Performance relative to relative Performance 0.6 0.5 cSSAP Random cSSAP Random SSAP solution coupled with MaxFlow

Figure 3.5: Throughput produced with high leecher proportions by current and random SSAP solutions coupled with MaxFlow and relative to the performance upper bound for each RAP instance (means with 95% confidence intervals). 3.5. Can current algorithms provide high throughput? .47

Bitsoup Filelist 1.0

0.8

0.6

0.4

0.2 High Observed High Observed Leecher proportion relative to cSSAP+MaxFlow relative Performance of cSSAP+cBAP Performance

Figure 3.6: Throughput of cBAP coupled with cSSAP relative to optimal, cSSAP+MaxFlow 3 (means with 95% confidence intervals, minimal libraries).

torrent selection.

3.5.3. Bandwidth allocation We now examine the efficiency of the current bandwidth allocation mechanism. Our experiment compares cBAP coupled with cSSAP to the optimal solution for the cSSAP allocation, cSSAP+MaxFlow. Figure 3.6 presents the results of this experiment. Only minimal libraries are considered, as we know from our previous experiment that library estimation has a negligible effect on solutions (Figure 3.4). Results are similar for both communities: current BAP solutions achieve less than 80% of the optimal throughput, and the performance of cBAP is affected by resource contention. In the scenarios with high leecher proportion, cSSAP+cBAP achieves only 50-60% of the performance of cSSAP+MaxFlow. Such results highlight that current decentralized mechanisms fall short of the performance that can be achieved in multi-swarm scenarios. Furthermore, note that the decrease in relative performance of cSSAP+MaxFlow suggests that the more resource contention in the community, the further from the optimal current methods are. This is particularly relevant for communities that have less seeding than those we study. Finally, the similarity between relative performance of cSSAP+cBAP in the two communities suggests the performance of these mechanisms is independent from the structure of the community.

Summary We find that the current torrent selection mechanism does not limit the performance of file-sharing communities, often allowing for optimal solutions if com- bined with optimal bandwidth allocation. On the other hand, the bandwidth alloca- tion mechanism presently implemented in BitTorrent clients significantly hampers the performance of file-sharing communities, and performs particularly worse in communities with a high leecher proportion. Finally, the upper-bound performance of a file-sharing community is not affected by the size of the seeding libraries and is only slightly affected by the variation in leecher proportion we consider. .48 3. Inter-swarm resource allocation

Minimal libs Maximal libs 140 120 100 Bitsoup 80 60 40 20 0 140 120

100 Filelist 3 80 60 40

5th percentile download speed 5th percentile download 20

across all leeching sessions (KiB/s) 0 High Observed High Observed Leecher proportion

Figure 3.7: Upper bound throughput for all scenarios, as produced by MaxMin in relaxed RAP instances (means with 95% confidence intervals).

3.6. Are current algorithms appropriate for video streaming?

We now turn to the video-streaming use-case. In this context, the ideal resource allocation is max-min fair for all leeching sessions. Such an allocation provides the best possible service for the sessions most exposed to streaming interruptions while guaranteeing that the rest of the sessions obtain a service at least as good. We use a performance metric appropriate for video streaming, the download speed of the fifth percentile worst-performing leeching session. If the metric has value v, 95% of the sessions in the community receive a download speed at least equal to v.

3.6.1. Baseline

The baseline for this experiment is obtained by running the MaxMin algorithm introduced in Section 3.3 on relaxed RAP instances. The average download speed for the fifth percentile leeching session across all problem instances is presented in Figure 3.7. The performance in Filelist is considerably higher than in Bitsoup, similarly to what we observed for aggregate download speed in Section 3.5. The fifth percentile download speed in Filelist is approximately 6 times that in Bitsoup. Within a community, the results are not affected by seeding library estimation nor by the leecher proportion. This implies that the performance of the worst performing leeching sessions cannot be improved just by having seeders participate in more torrents from their libraries. 3.6. Are current algorithms appropriate for video streaming? .49

Minimal libs Maximal libs 1.0 0.8 Bitsoup 0.6 0.4 0.2 0.0

1.0 0.8 Filelist 0.6 3 unrestricted MaxMin 0.4 Performance relative to relative Performance 0.2 0.0

cSSAP Random cSSAP Random SSAP solution coupled with MaxMin

Figure 3.8: Fifth percentile session download speed with observed proportions produced by current and random SSAP solutions coupled with MaxMin and relative to unrestricted MaxMin (means with 95% confidence intervals).

3.6.2. Torrent selection Analogously to our analysis of aggregate download speed, we first determine whether the current SSAP solution hinders the performance of the optimal BAP solution for video streaming. This is done comparing cSSAP+MaxMin to MaxMin running on the relaxed RAP. At the same time, we establish the extent to which any SSAP solution can affect the overall RAP solution by analyzing the results of a random tor- rent allocation (Random+MaxMin) in relation to the same unconstrained MaxMin solution. The results are shown in Figures 3.8 and 3.9. For the scenarios using the observed leecher proportion, there are only negligible differences between current, random and unrestricted SSAP solutions. This sug- gests the cSSAP is adequate for maximizing the fifth percentile performance. At the same time, the close-to-optimal result of random selection points to a limited potential for choice; it seems SSAP solutions can only have limited effect on the RAP solution in the problem instances we study. In summary, given the observed proportion of leechers, solving SSAP without bandwidth information does not affect streaming performance when an efficient BAP solution is used—a similar outcome to the throughput maximization use-case. With regard to the high leecher proportion scenario, Filelist results for cSSAP and random selection show performance drops of 10–20% compared to the baseline, for both seeding library configurations. On the other hand, using Bitsoup data, it is remarkable to see that cSSAP generates a torrent selection where MaxMin cannot provide the lowest five percent of the peers with any bandwidth at all. However, since our baseline applies to an unconstrained RAP, it may be that a higher performance is not attainable by any solution that respects the seeding constraint. .50 3. Inter-swarm resource allocation

Minimal libs Maximal libs 1.0 0.8 Bitsoup 0.6 0.4 0.2 0.0

1.0 0.8 Filelist 3 0.6 unrestricted MaxMin 0.4 Performance relative to relative Performance 0.2 0.0

cSSAP Random cSSAP Random SSAP solution coupled with MaxMin

Figure 3.9: Fifth percentile session download speed with high leecher proportions produced by current and random SSAP solutions coupled with MaxMin and relative to unrestricted MaxMin (means with 95% confidence intervals).

3.6.3. Bandwidth allocation Next, we investigate the performance of cBAP for video streaming. Starting with the torrent allocation produced by cSSAP, we compare the solution of cBAP to the optimal BAP solution generated by our MaxMin algorithm. The results are depicted in Figure 3.10. Note that only minimal libraries are considered, as we have observed library estimation has no significant effect on the performance that can be achieved by BAP solutions. Moreover, Bitsoup is not considered in the case with is a high proportion of leechers, since cSSAP generates a torrent allocation in which it is impossible to provide any bandwidth to the worst five percent of streaming sessions.

Bitsoup Filelist 1.0 0.8 0.6 0.4 0.2 0.0 High Observed High Observed

relative to cSSAP+MaxMin relative Leecher proportion Performance of cSSAP+cBAP Performance

Figure 3.10: Fifth percentile leeching session download speed produced by cBAP coupled with cSSAP relative to max-min fair allocation speed cSSAP+MaxMin (means with 95% confidence intervals, minimal libraries). 3.7. Related work .51

Results are similar for the other scenarios for both communities: fifth percentile download speeds produced by cBAP are around 30-40% of the baseline values. This shows that current BAP solutions deployed in BitTorrent are far from ideal in the video-streaming use-case. However, differently from what we observed for aggre- gate throughput, there is no sizeable effect of leecher proportion on the relative performance of cBAP. Summary Akin to our results for file sharing, we find that the current torrent se- lection mechanism does not hamper performance in video-streaming communities. Nevertheless, randomly selecting torrents allows for similar performance using our datasets. Regarding bandwidth allocation, we again find that the current mech- 3 anism’s performance is substantially worse than the optimal. Finally, the upper bound is not affected by seeding library estimation or leecher proportion.

3.7. Related work Considerable research and development effort has been invested in designing and evaluating BitTorrent’s intra-swarm resource allocation methods. Experimental in- vestigations by Legout et al. suggest that the current algorithms for choosing upload partners inside a swarm need no further improvement [45] and document the high utilization of upload bandwidth inside a swarm [85]. BitTorrent has also been stud- ied at the community level. Zhang et al. [46] show how an entire ecosystem forms around the P2P protocol. Guo et al. [50] and Andrade et al. [63] analyze traces of multiple BitTorrent communities. Nevertheless, previous work investigating multi- swarm systems has not considered the community-level metrics we use, nor evaluated the effect of current inter-swarm resource allocation mechanisms. More similar to our work, Dunn et al. [91] explore seeding strategies for a BitTorrent-like system centered around a content provider. Their goal is minimizing the bandwidth demand at the provider—equivalent to maximizing P2P throughput. Using synthetic scenarios, they find that the behavior of current BitTorrent algo- rithms can be improved. Our results do not contradict this finding, but question whether improvements for SSAP solutions are relevant for most real communities. Peterson et al. [92] design a BitTorrent-inspired content distribution system with a central bandwidth allocation algorithm. Similar to us, they envisage different goals for the system, such as guaranteeing a minimum service level in swarms or avoiding starvation. However, they only present results for the throughput maximization goal, for which they also find BitTorrent to perform suboptimally. We corroborate this finding, adding that it happens in real BitTorrent communities. Furthermore, we expand the results of Peterson et al. by examining the video-streaming use-case.

3.8. Conclusion In this chapter, we present a performance evaluation of de facto mechanisms for inter-swarm resource allocation in BitTorrent communities. This paves the way for informed developments of these mechanisms by identifying requirements and relevant factors that affect the performance of BitTorrent communities from a multi- swarm perspective. .52 3. Inter-swarm resource allocation

We conclude that, for both file-sharing and video-streaming communities, the current torrent selection mechanism is suitable if coupled with efficient bandwidth allocation algorithms, but the current bandwidth allocation mechanism performs sig- nificantly worse than optimal in multi-swarm operation, especially in the case of high leecher proportions. In a way, our results highlight that there is a price for anarchy in BitTorrent communities: with individuals allocating resources solely in their own interest, they do not fulfill the global objective optimally. Nevertheless, this does not imply that globally optimal mechanisms should not be incentive-compatible. Instead, future work should ideally improve these mechanisms considering multi- 3 swarm incentives. The observation that the size of the seeding libraries does not influence the upper bound performance is relevant for the design of future BitTorrent clients. Our simulations suggest there is little to gain through peer-level caching of user downloaded torrents. Furthermore, the development of real-time implementations of our optimal bandwidth allocation algorithms could lead to efficient inter-swarm resource allocation in cooperative scenarios where peers follow the directions of a centralized coordinator. 4 Investment strategies for credit-based P2P communities

User-contributed resources are the basis of high performance in P2P systems. There- fore, an adequate incentive structure that encourages users to contribute is paramount to the success of a P2P system. BitTorrent stands at a crossroads in this respect. On the one hand, BitTorrent has an effective built-in incentive mechanism [45], tit for tat, which rewards leechers with better performance if they contribute upload band- width to help other leechers. On the other hand, seeding is not incentivized in the BitTorrent protocol, and there is insufficient attention given to users participating in multiple swarms simultaneously. The result of this drawback is the creation of hundreds of private BitTorrent communities that employ extra incentive mechanisms to encourage their users to contribute. Private BitTorrent communities address the lack of incentives for seeding in BitTorrent through a credit system that tracks the sharing behavior of users across swarms [18, 94]. Typically, sharing ratio enforcement is used to penalize users who do not keep a minimum ratio of uploaded and downloaded data. This leads to an oversupply of upload bandwidth, which gives users of private communities very good download performance [18]. Unfortunately, the upload bandwidth oversupply in credit-based private com- munities has a negative side effect. It makes it difficult for users who are willing to contribute to do so. A user may have to compete with too many other users to up- load in a swarm, and may be unable to attain a sufficient sharing ratio to maintain membership in the community [47]. We argue that while a credit system encourages contribution, the skewed distribution and temporal variability of content popularity

This chapter is based on work published in Euromicro PDP 2013 [93].

53 .54 4. Investment strategies for credit-based P2P communities

often make it exceedingly hard for users to contribute effectively in private BitTor- rent communities. In short, for a user, there is a fundamental mismatch between the acts of downloading and uploading when it comes to maintaining a good sharing ratio while pursuing their own interest in files. In this chapter, we propose as a solution to this problem, the decoupling of downloading and seeding, and put forward an automatic mechanism for users who are willing to contribute resources to the community to do so easily. This peer- level mechanism speculatively downloads and seeds swarms with the sole purpose of increasing the upload/download ratio. As long as the choice is made by an informed algorithm, the user stands to gain, and so does the community, since the download performance of the other users increases. The solution we introduce can be described in terms of a cache that contains the swarms that have been downloaded speculatively. Its most important component is 4 the swarm replacement algorithm: How to select the new swarm that should go into the cache, and which old swarm should be removed? To answer this question, we build a regression model with as input the swarm characteristics available to each peer, and obtain a prediction of the amount of bytes the peer will be able to upload to that swarm in the near future. Note that, as opposed to a traditional cache, we do not store the swarms downloaded by the user of the BitTorrent client for their own interest. Instead, we consider for adding to the cache any swarm available in the community, irrespective of user interest. Our main contributions in this chapter are:

1. We design a speculative download mechanism that can be adopted by BitTor- rent clients to automatically and autonomously download swarms in order to gain credit in a private BitTorrent community (Section 4.2); 2. We introduce a method for predicting the upload potential of swarms based only on information available to BitTorrent clients, without protocol exten- sions (Section 4.2.2); 3. We show through trace-based simulation that using the speculative download mechanism driven by the predictor for swarm upload potential leads to credit gain (Section 4.4).

4.1. Problem statement Because of oversupply, in many private BitTorrent communities it is hard for users who are willing to contribute upload bandwidth to actually do so. This unnecessar- ily limits the sharing ratio of these users, as well as the global download performance in the community. The goal of our work is to make use of any idle upload band- width users have in an automatic and decentralized manner that is compatible with regular BitTorrent clients, such that the sharing ratio of users is increased. We name problem of deciding whether and what to automatically download in order to increase upload bandwidth usage the Bandwidth Investment Problem. To solve the bandwidth investment problem, a BitTorrent client must identify swarms—also called torrents—that are undersupplied in the community, download 4.2. Speculative download mechanism design .55 some portion of their content and seed those torrents using the idle upload band- width. In some sense, this is what experienced users often do manually, in addition to downloading torrents they are interested in. We aim to automate this process. As a success criteria, if a solution is applied to this problem and used for a reasonable amount of time—on the order of hours, it should cause an increase in the sharing ratio of its user. To compare the quality of several solutions, we can run them at the same time and compare the sharing ratios they produce. From the perspective of bandwidth investment, we are comparing the profits produced by the different strategies. There are two constraints we also consider solutions must satisfy. Any mecha- nism that qualifies as a solution must be integrated into a regular BitTorrent client which is under user control. As such, it always has to give priority to user actions. If the user decides to download a torrent, the mechanism must yield bandwidth so that the user download can proceed as fast as possible. Similarly, it must limit 4 storage usage so that the user always has available storage. In addition to the previous constraint, the information used to decide which swarms are undersupplied must already be available to regular BitTorrent peers. We do not believe a mechanism that requires changes to the established BitTorrent protocol has chances to be adopted.

4.2. Speculative download mechanism design In this section, we introduce and describe in detail the speculative downloading mechanism, the solution we propose for the Bandwidth Investment Problem.

4.2.1. Speculative download The main contribution of this chapter is the speculative download mechanism de- picted in Figure 4.1 that identifies, downloads, and seeds undersupplied swarms in private communities. The new module to be added to BitTorrent clients is called the Speculative Download Engine (SDE). All user commands pass through it, so that it can react appropriately, e.g., yield bandwidth when the user is downloading. At the same time, it gets input from a Discovery module, which provides a list of available swarms, together with their characteristics. This module already exists in many BitTorrent clients as an RSS feed parser, and many communities offer customizable RSS feeds for announcing newly available torrents. The commands issued by the SDE are executed by the Data transfer module which implements the BitTorrent protocol to transfer data between other peers and the Storage module. After a warm-up period, during which the engine adds data to the storage, the mechanism will start operating in a full-cache mode, where every new swarm has to replace an old one. For simplicity, and since the user is not interested in the actual data, we establish a standard size for downloads instead of downloading all data from a swarm, and use partial seeding [95]. This way, the storage is divided in pieces of equal size and the replacement policy does not have to consider the space swarm data occupies. Considering the speculative download mechanism is similar to a cache, we can .56 4. Investment strategies for credit-based P2P communities

4

Figure 4.1: Speculative download mechanism.

also analyze its functionality in terms of cache replacement. Given a set of available swarms in the community and a set of swarms already in the cache, find the best swarm not in the cache and the worst swarm in the cache and decide whether to replace the swarm in the cache with the one available. The decision is based on the predicted upload speed in the swarms. In the next subsection, we show how this decision can be made.

4.2.2. Predicting future upload speed For any BitTorrent peer and swarm, given the set P of characteristics of the swarm that are available to all peers (number of seeders, number of leechers, file size, swarm age), the set I of characteristics of the peer (download and upload bandwidth), and, optionally, the set H of records of the peer’s historical activity in the swarm (download and upload history), we must give an estimate of the expected upload speed the peer will achieve in the swarm. We can asses the accuracy of an upload speed predictor by comparing the prediction to the actual upload speed a peer achieves when participating in the swarm. We then use this predictor to drive the SDE. In this chapter, we predict upload speed using regression analysis. Given a trace of the behaviour of a peer with s samples, each describing the activity of the peer in a swarm during a certain time interval, we construct a multiple regression model, 4.2. Speculative download mechanism design .57

yi = β0 + β1x1i + β2x2i + ··· + βnxni, i = 1, . . . , s where the upload speed of the peer in the swarm during the interval when sample i was captured, yi, is a dependent variable determined by the independent variables {x1i, . . . , xni} = P ∪ I ∪ H. For a trace with s samples, we have s equations. Usually we have more samples than independent variables, so s > n, which means we have an overdetermined system of equations that is inconsistent and has no exact solution. We choose parameters β1, . . . , βn and the intercept β0 so that they fit the data best using the least squares method. We then use the parameters β to predict the upload speed of the peer in any new swarm N at any time, given characteristics x1N , . . . , xnN . Because our data has unknown distributions for characteristics, we use a tech- nique called multivariate adaptive regression splines (MARS) [96], which is an exten- sion of the linear regression we presented. MARS builds a model for the dependant 4 variable using an iterative process. Starting from a model using only the intercept, MARS creates intermediary models by gradually adding independent variables such that the least squares error of the model is minimized.

4.2.3. Putting it all together We can now present in detail the operation of the cache replacement mechanism of the SDE using upload prediction. This is depicted in Algorithm 2.

Algorithm 2 Cache replacement algorithm 1: N = {Swarms not in cache} 2: C = {Swarms in cache} 3: for all s ∈ N do 4: prediction[s] = predict_upload_speed_new(s) 5: end for 6: for all s ∈ C do 7: prediction[s] = predict_upload_speed_old(s) 8: end for 9: h = highest(N, prediction) 10: l = lowest(C, prediction) 11: if prediction[h] > prediction[l] then 12: remove_from_cache(l) 13: add_to_cache(h) 14: end if

The algorithm predicts the upload speed of all the swarms that have not been downloaded yet (N), and, separately, it predicts the upload speed of all the swarms that have been downloaded already (C). Note that we can use different regression models for the two sets of swarms, since we have more information for C (set H is available, i.e., history of download and upload). Next, the algorithm sorts the two sets of swarms according to the predicted upload speeds and chooses the swarm .58 4. Investment strategies for credit-based P2P communities

with the highest predicted upload speed from the new set, and the swarm with the lowest predicted upload speed from the old set. If the prediction for the former is higher than the prediction for the latter, it does the cache replacement. This algorithm can be run at regular intervals, or whenever the upload speed drops below a preconfigured threshold. It is important to also consider the cost of downloading the new swarm that is to be put in the cache in order to have a realistic model. Recall that our speculative download engine uses partial seeding and that we are only downloading a fixed amount of data from every swarm we put in the cache. We are free to set this cache block size, denoted b, to maximize the efficiency of our algorithm; 16 MiB is a good value according to our experiments. Thus, for every cache replacement operation, we have to not only achieve a better download speed, but also make sure we recuperate the cost of downloading the new cache block. It is necessary to 4 introduce a new parameter for this, a time window w after which we expect to start profiting from the investment in the new swarm. The cache replacement condition on Line 11 of Algorithm 2 becomes:

(prediction[h] − prediction[l]) × w > b

4.3. Trace based evaluation In this section, we evaluate the efficiency of the SDE using traces of activity from a BitTorrent community.

4.3.1. Dataset Our dataset consists of records of the activity of 84 025 peers in the Bitsoup private community between April and July 2007. The data was collected by crawling the website of the community once per hour (thus w = 1h). It includes, for every peer in every torrent, the amount of data uploaded and downloaded, and the session duration. The mean number of sessions observed was 76 370. In addition, we have data about the file size for every torrent from the website. For the swarms that started after our monitoring began, we also have the age of the swarm. This gives us all the variables for the prediction algorithm belonging to sets P and H, regarding swarms and past peer activity, respectively (see Section 4.2.2). We approximate the variables in set I describing the characteristics of the peer, download and upload bandwidth, by using the maximum values observed in the trace for download and upload speed. We only use samples in which the peer is participating in a single swarm at a time. This gives us a record of the download and upload performance of the peer at a time when the peer is not involved in any other activity in the community, creating a level playing field where each swarm is likely to receive the full bandwidth of the peer. We differentiate between two types of BitTorrent sessions recorded in the trace. Sessions can either be considered new, where the peer is observed downloading in the trace, or old, where the peer has finished downloading and is now seeding. For 4.3. Trace based evaluation .59

Table 4.1: Characteristics of the dataset

Community name Bitsoup Collection interval April to July 2007 Number of users observed 84 025 Sampling interval 1 h Mean number of sessions 76 370 the old set, we can use the past peer activity in the swarm to inform the prediction algorithm and improve prediction accuracy.

Algorithm 3 Evaluating prediction efficiency 1: N = {Swarms not in cache} 4 2: C = {Swarms in cache} 3: for all s ∈ N do 4: prediction[s] = predict_upload_speed_new(s) 5: trace[s] = get_upload_from_trace(s) 6: end for 7: for all s ∈ C do 8: prediction[s] = predict_upload_speed_old(s) 9: trace[s] = get_upload_from_trace(s) 10: end for 11: hp = highest(N, prediction) 12: lp = lowest(C, prediction) 13: ht = highest(N, trace) 14: lt = lowest(C, trace) 15: if (prediction[hp] − prediction[lp]) × w × bandwidth > b then 16: prediction_gain = trace[hp] − trace[lp] 17: optimal_gain = trace[ht] − trace[lt] 18: end if 19: return prediction_gain, optimal_gain

4.3.2. Evaluation methodology To evaluate the sharing speculative download mechanism using the Bitsoup trace, we use Algorithm 3. This is very similar to Algorithm 2, which describes the nor- mal operation of the SDE cache replacement when deployed in a BitTorrent client. However, in addition to predicting the upload speeds of swarms, it also checks the quality of those predictions against the data in the trace. The indicator for predic- tion quality is the gain in upload speed resulting from the cache swap:

upload_speed(added_swarm) − upload_speed(removed_swarm) gain = upload_bandwidth .60 4. Investment strategies for credit-based P2P communities

The gain is a dimensionless quantity and it can take values between 1, when a peer is replacing an idle swarm with a swarm with an upload speed that saturates the peer’s upload bandwidth, and −1, when a peer replaces a swarm that was saturating its upload bandwidth with a swarm with no upload potential. For every peer in the community, we select the best swarm not in the cache (i.e., from set N), and the worse swarm in the cache (i.e., from set C). We do this selection according to the upload speed prediction obtained using our predictor function (Lines 11 and 12), just like Algorithm 2 would do. In addition, we also select the best new swarm and the worse old swarm according to the upload speed recorded in the trace (Lines 13 and 14). The decision to do the cache replacement is similar to the one in normal operation, taking into account the time window w and cache block size b. Additionally, bandwidth is added as a factor because the trace records upload speeds relative to the upload bandwidth of the peer. If the 4 replacement is made, the resulting gain is evaluated using the actual data from the trace (Line 16). This represents the extra upload speed the peer gains by replacing the swarm it predicted was the worst in its cache with the swarm it predicted was the best available and not already in its cache. We also compute the optimal gain that is achievable by the peer if the prediction is perfect and identifies the real best and worst swarms in the trace (Line 17). Algorithm 3 is run once for every peer in the cache. The sets of new and old swarms are built from the sessions of the peer in the trace. All leeching sessions of the peer are put in the new set N, and all seeding sessions of the peer are put in the old set C. It is important to note that time does not play a role in the evaluation, as we are not considering the effects of running the algorithm on the trace, but only evaluate the quality of each peer decision individually. The cache replacement algorithm will try to replace a swarm from C with one from N, according to their predicted upload speeds. The fact that the sessions have taken place at different times in the trace is irrelevant because this is not one of the parameters that influence the upload speed. Notice also that the trace is only used for evaluating the prediction. It is not in any way “replayed” by our algorithm, and the algorithm is not running a simulation of the activity in the community. We are only interested in evaluating the upload speed prediction of single peers using real-world data. The practical performance of a prediction algorithm depends on the size of the input data. The more swarms used as input, the lower the chances the prediction algorithm will find the cache swap producing the maximum upload speed gain. Because of the importance of input data size, we create from the dataset using random sampling three constrained input data scenarios, in addition to the complete set containing all sessions. Table 4.2 shows the size of swarm sets for all input data scenarios, small (S), medium (M), large (L), and unconstrained (U). We use more new swarms than old swarms in our evaluation because this is the case in real world usage. Because we use random sampling when creating S, M, and L, we repeat the prediction 5 times for every peer and take the mean. One threat in using regression analysis is over-fitting. This means obtaining a model that gives highly accurate results for the data that was used to create it, but 4.4. Results .61

Table 4.2: Size of swarm sets per peer for all input data scenarios.

SML U |N| 10 20 40 |All leeching sessions in trace| |C| 5 10 20 |All seeding sessions in trace| that has poor prediction performance. We avoid over-fitting, by splitting our data in two sets, one that we use for creating the regression model and one that we use for testing the prediction which we present in Section 4.4. The ratio for splitting in our experiments is 80/20. In addition to SDE, we define three other algorithms to put the results into context. Random shows the results of a simulation of repeatedly selecting a random new swarm to replace a random old swarm. On average, this produces a gain 4 equal to the difference between the mean upload speed of the new and old swarms. SeederRatio represents a simple heuristic that selects the new swarm with the lowest ratio of seeders to replace the old swarm with the highest ratio of seeders. Finally, we also present an algorithm called Optimal, which always replaces the worst old swarm with the best new swarm, thereby producing the best possible gain in upload speed.

4.4. Results We start the results presentation by looking at gains in upload speed, i.e., how much faster is the peer uploading after the cache swap relative to its maximum upload speed. Figure 4.2 shows the CDFs of gains for SDE compared to Random, SeederRatio, and Optimal, for the situations when there is a potential gain, i.e., Optimal gain is positive. Random is not making any difference to the upload speed in the median case. Still, when factoring in the cost of downloading a new swarm, it is immediately obvious that Random would result in a decrease of the sharing ratio in the median case (as we will show in Figure 4.4). SeederRatio produces a small increase in upload speed for the median case, and produces significantly less negative results than Random. We can conclude that SeederRatio is a simple to implement strategy that is effective at increases to upload speed. For virtually all peers, SDE performs better than SeederRatio. Although there is some variation with input data size, the performance of the different algorithms does not vary significantly. Note that in some cases, SDE decides not to make the cache replacement. These are included in the CDF as having 0 gain. While SDE does not attain optimal performance, it is important to note that all gains presented in this section are the average case for a single cache replacement. Instead, in normal usage, the algorithm will be run many times by a peer, every time upload speed drops below a certain threshold, so the effects will be cumula- tive. Furthermore, we expect the upload speed of a peer using our algorithm to be constantly close to maximum, so the necessary gain to reach the maximum will normally be low. .62 4. Investment strategies for credit-based P2P communities

SDE SeederRatio Random Optimal

1.00 Scenario S: 5/10 (n=3947)

0.75

0.50

0.25

0.00 1.00

4 Scenario M: 10/20 (n=1203) 0.75

0.50

0.25

0.00

CDF 1.00 Scenario L: 20/40 (n=119)

0.75

0.50

0.25

0.00

1.00 Scenario U: All swarms (n=10072)

0.75

0.50

0.25

0.00 -1.0 -0.5 0.0 0.5 1.0 Gain [proportion of upload bandwidth]

Figure 4.2: The CDF of the gain in upload bandwidth utilization for the SDE, SeederRatio, Random, and Optimal algorithms, in all four scenarios, when Optimal gain > 0.(n denotes the number of users in the experiment.) 4.4. Results .63

SDE SeederRatio Random Optimal

1.00 Scenario U: All swarms (n=8406)

0.75

0.50 CDF

0.25

0.00

-1.0 -0.5 0.0 0.5 1.0 Gain [proportion of upload bandwidth] 4

Figure 4.3: The CDF of the gain in upload bandwidth utilization for the SDE, SeederRatio, Random, and Optimal algorithms, in scenario U, when Optimal gain > 0 and SDE performed a cache replacement. (n denotes the number of users in the experiment.)

In Figure 4.3, we present the cases when SDE decides to make a replacement. This time we only show the unconstrained scenario U, since the others are similar. We observe that SDE manages to make a cache replacement that results in an upload bandwidth usage increase in 75% of the cases. To put this into perspective, note that Random produces no effect for the median case. SeederRatio is once again better than Random, but not as good as SDE. We analyze the return on investment (ROI) for cache replacement decisions in potential gain situations in Figure 4.4. For every cache replacement, we compute the ROI considering that for one hour (the time window w for our dataset) the peer will benefit from the upload speed gained. The investment represents downloading 16 MiB (the cache block b we use) from a new swarm.

gain × upload_bandwidth ROI = b We can see that using random choices for swarms results in a negative ROI for the majority of peers in all input data scenarios. Using SeederRatio choices instead gives a positive median ROI for all scenarios except L. On the other hand, using SDE, the median ROI is always positive, ranging from 74% for the L scenario to 222% for the S scenario. For U, median ROI is 207%. We conclude that SDE makes highly profitable investment decisions. The converse situation for potential gain is when there is no profitable replace- ment, i.e., no new swarm has a better upload potential than any of the old swarms and Optimal gain < 0. Intuitively, this situation should be rare in private com- munities. Nevertheless, we examine the performance of SDE in such situations in Table 4.3. For all scenarios, we see that SDE makes the correct choice of not doing .64 4. Investment strategies for credit-based P2P communities

SDE SeederRatio Random Optimal

1.00 Scenario S: 5/10 (n=3425)

0.75

0.50

0.25

0.00 1.00 4 Scenario M: 10/20 (n=1088) 0.75

0.50

0.25

0.00

CDF 1.00 Scenario L: 20/40 (n=103)

0.75

0.50

0.25

0.00 Scenario U: All swarms (n=8406) 1.00

0.75

0.50

0.25

0.00 -10 0 10 20 Return on investment

Figure 4.4: The CDF of the return on investment for the SDE, SeederRatio, Random, and Optimal algorithms, in all four scenarios, when Optimal gain > 0.(n denotes the number of users in the experiment.) 4.5. Related work .65

Table 4.3: Performance when there is no potential gain

SMLU Correct (%) 68 75 100 63 Median mistakes (%) 100 33 NA 100 First quartile loss 0.008 0.033 NA 0 Median loss 0.033 0.023 NA 0.091 Third quartile loss 0.132 0.019 NA 0.095 any cache replacement most of the times (for scenario L, the perfect result comes from the small number of samples). When SDE does make a mistake, the cost is usually low: 2–9% of upload bandwidth is lost in the median case. Note that the statistics for mistakes and loss refer only to the situations where SDE makes a cache 4 replacement.

4.5. Related work A considerable body of research focuses on incentive mechanisms in BitTorrent, es- pecially considering individual download sessions [45, 49, 97]. However, users can be observed downloading many files, often in overlapping download sessions. Since the pure BitTorrent protocol does not provide any way to identify peers across down- load sessions, many other works propose reputation systems that are supposed to keep track of user behavior. Research has focused mainly on decentralized methods to build reputation systems [27, 98]. In practice however, centralized solutions are widespread, in the form of private BitTorrent communities that use a centralized server to keep track of user identity and activity [18]. Although we use data from a private community as opposed to a decentralized reputation system, we note that our work is orthogonal to the practical implementation of such a system; the only requirement for our solution to work is having a credit system that keeps track of user contribution across swarms. The oversupply problem of private BitTorrent communities is well documented in the literature, along with the problems it raises for maintaining membership [18, 47, 94]. There are many examples of caching in P2P systems [99–102]. While our system resembles a cache, remember that it plays an active role in selecting the swarms that it can store and does not rely on the user for this. In fact, one of the goals of our work is to decouple a user’s contribution to the community from the user’s preference for content, since it is not likely the two will match. On this topic, Wu et al. [103] propose a solution based on a centralized server that assigns users to semi-static groups that contribute upload bandwidth to specific channels in a live video streaming system. Instead, we solve the problem for file-sharing systems, but in a completely decentralized manner. Carlsson et al. [104] propose another solution, again based on central servers, that “inflate” swarms with peers that are not currently fully utilizing their upload bandwidth. Garbacki et al. [105] propose a .66 4. Investment strategies for credit-based P2P communities

mechanism for bandwidth exchange between peers that decouples peer contribution from peer interest, but is based on explicit help requests peers send to one another to request the download of specific swarms. Regression models have been used before to predict characteristics of BitTorrent, like online time [106], or peer download speed [107]. However, to the best of our knowledge, we are the first to predict the upload speeds that peers can achieve in BitTorrent swarms.

4.6. Conclusion In this chapter, we provide a solution to the problem of bandwidth investing in private BitTorrent communities. We show how this problem occurs in the first place when oversupply makes it hard for honest users to contribute their upload bandwidth. The solution design is based on speculative downloading: identifying 4 swarms in need of upload bandwidth in the community, downloading, and seeding them. At the heart of the solution is the Speculative Download Engine (SDE)—a prediction algorithm which we implement using multiple regression. We test our solution with real-world data from a private BitTorrent community with more than 80 000 peers. The results show that SDE is successful for different types of input scenarios and that it produces a sizeable return on investment. In most situations when there is no possibility for gain, SDE correctly predicts a cache replacement should not take place. Deploying SDE in a BitTorrent client and conducting a live Internet performance evaluation will reveal the effects of repeatedly running SDE for a peer, as well as the effects of multiple peers collectively using SDE. We hypothesize that, while the positive effect of using SDE will diminish when deployed throughout the community, there will be no negative effect for download performance, since using SDE leads to an increase in available upload bandwidth. Because the presence of multiple SDE-enabled peers in swarms will be reflected in the swarm characteristics, i.e., the number of seeders will increase, SDE will predict less potential for gain and will make fewer replacement decisions. 5 Towards a P2P bandwidth marketplace

Measurements from 2013 show that peer-to-peer (P2P) file transfer systems con- tinue to be very popular, generating a third of the end user upload traffic over the Internet [12]. Swarm-based systems like BitTorrent and its descendants (uTP, Swift) are very efficient at transferring files. At the same time, they are susceptible to poor performance caused by flash crowds and other types of supply and demand misalignments. In fact, recent studies [48, 109] have documented severe imbalances in existing in P2P file transfer communities, with some users experiencing limited download speeds while other users who are willing to donate their upload band- width are waiting idle. The problem stems from the mismatch between supply and demand. Because BitTorrent communication takes place only in the context of a single swarm, peers cannot align their supply and demand based on information from multiple swarms. In this chapter, we introduce the design of a bandwidth marketplace, where peers can independently initiate cross-file bandwidth trades. Peers monitor the supply and demand of individual files and alert other peers when they discover an undersupply situation. Idle peers can respond to these alerts and become helpers to alleviate the undersupply by contributing their upload bandwidth using a helper algorithm. To study the behavior of helpers, we develop an analytical model based on the BitTorrent fluid model [49, 110]. Using our model, we identify several bottlenecks that can affect the download performance of peers. We show that the bandwidth marketplace we design can increase the performance in most typical situations, but conclude there are cases where performance improvements are impossible. We identify an existing algorithm, part of the popular BitTorrent library Libtor- rent [51], that can be used by bandwidth marketplace helpers and present it in detail. To study the behavior of the algorithm, we test it through experiments emu- lating two different types of oversupply, one static with a fixed set of peers, and one This chapter is based on work published in ICDCN 2014 [108].

67 .68 5. Towards a P2P bandwidth marketplace

dynamic with a peer arrival process. The experiments involve running purposefully instrumented BitTorrent clients in a laboratory environment with up to 210 peers. The contributions of this chapter are as follows:

1. We design a bandwidth marketplace that facilitates matching of bandwidth supply and demand in P2P communities. The design is completely decentral- ized, is incentive compatible, and permits each participating peer to maintain its autonomy (Section 5.1); 2. We develop a model for the behavior of helper peers within swarms. Using the model, we derive performance bounds for the general operation of swarms and for the performance impact helpers can have (Section 5.2); 3. We evaluate an algorithm recently included in Libtorrent that can be used by helpers. Using realistic experiments, we show that it fulfils the requirements of a bandwidth marketplace helper algorithm (Section 5.4).

5.1. Bandwidth marketplace design 5 As a prerequisite, the bandwidth marketplace design depends upon the existence of a credit system, e.g., as used by private communities [109], to incentivize users to contribute upload bandwidth across swarms. The plain BitTorrent tit-for-tat incentive mechanism is not sufficient because seeders are not incentivized to continue uploading after finishing downloading. Our design is based on a decentralized algorithm, executed independently by all peers in the system. This is summarized in Figure 5.1. The operations executed in the algorithm are as follows:

1. All peers constantly monitor their own download progress to detect a poten- tial supply shortage, as in swarm A in the figure. Whenever their download speed drops below a percentage of their historical download speed, they use MultiQ [111] to detect the real upload bandwidth of seeders in the swarm. If they determine that the low download speed is caused by a leecher bottleneck (i.e., the seeders are fast, but the leechers cannot replicate the data they re- ceive from seeders fast enough, Section 5.2.3) they execute the next operation in the algorithm. 2. A peer that identifies an undersupplied swarm broadcasts a help request con- taining the swarm ID to a limited set of marketplace partners, called its set of helpers. The set is bootstrapped with the best tit-for-tat partners from existing swarms. The helpers are ranked according to their responses to past requests. New helpers are optimistically added constantly to replace badly performing helpers, in order to improve the chances of receiving help, similarly to the way BitTorrent discovers new partners through optimistic unchoking. 3. All peers listen for help requests, but are free to act upon them autonomously. If a peer has a lower upload bandwidth utilization compared to historical val- ues, for example because it is only part of oversupplied swarms, like swarm 5.2. A model for helper contribution bounds .69

5 Figure 5.1: Overview of bandwidth marketplace operation.

B in the figure, it joins the undersupplied swarm as a helper. Note that this does not imply connecting directly to the peer that sends the help requests. The helper simply offers its excess upload bandwidth to the swarm, which in- evitably also benefits the requester. Helpers keep track of the their own upload in the new swarm and use it to rank requesting peers in the future. Peers that request help in severely undersupplied swarms are helped more frequently than those who request help in swarms that are not only mildly undersupplied; this also provides security against malicious peers. Peers that are part of balanced swarms, like swarm C in the figure, ignore the help requests.

4. As a helper, a peer tries to upload as much as possible in the swarm while downloading as little as possible. Ideally, adding helpers to a swarm should have the same effect as increasing the upload bandwidth of existing peers in the swarm. We investigate the behavior of helper algorithms in general in Section 5.2 and we present an existing helper algorithm recently introduced by Libtorrent in Section 5.3.

5.2. A model for helper contribution bounds In this section, we present a model for predicting the performance of helper algo- rithms using analytically derived bounds, starting from the well established BitTor- rent fluid models [49, 110]. We describe corner cases of peer interaction and show how adding helpers to the system changes performance. .70 5. Towards a P2P bandwidth marketplace

Table 5.1: Model notation

Number of leechers x(t) Number of seeders y(t) Number of helpers z(t) Leecher upload bandwidth µx Seeder upload bandwidth µy Helper upload bandwidth µz Leecher effectiveness ηx Helper effectiveness ηz Helper download effect ϵ

5.2.1. Overview Throughout this section and the rest of the chapter we use the notation summarized in Table 5.1. Consider a system of identical peers with a download bandwidth much greater 5 than their upload bandwidth. According to the fluid model of Guo et al. [110], the download speed of a leecher at time t is

η x(t) + y(t) u(t) = µ x (5.1) x(t)

where µ is the upload bandwidth, ηx is the effectiveness of leecher upload (i.e., the fraction of upload bandwidth leechers are able to use), x is the number of leechers, and y is the number of seeders. Considering further that leechers have upload bandwidth µx different from the upload bandwidth of seeders µy, the download speed becomes

η µ x(t) + µ y(t) u(t) = x x y (5.2) x(t)

We introduce helper peers and model their population through z(t). Each helper has upload bandwidth uz. Helpers have a different effectiveness in using their up- load bandwidth than leechers, which we denote ηz. Adding helpers to the system generates a certain increase in download requests, modeled through parameter ϵ, which can vary from 0 (minimal effect) to 1 (helper downloads everything at full speed just like a leecher). The goal of the helper algorithm is to maximize ηz while minimizing ϵ. The download speed of leechers becomes

η µ x(t) + µ y(t) + η µ z(t) u(t) = x x y z z (5.3) x(t) + ϵz(t)

5.2.2. Swarm bootstrapping We define the bootstrapping phase of the swarm as the time before any of the leechers and helpers download a complete piece from the seeder. During this phase, 5.2. A model for helper contribution bounds .71

ηx = 0 because leechers have no pieces to upload to one another. Similarly, ηz = 0. Thus µ y(t) u(t) = y (5.4) x(t) + ϵz(t) It is especially important for the helper algorithm to recognize and adapt to this phase of the swarm lifetime so that the duration of the phase is minimized and the swarm moves to the phase when leechers and helpers can contribute upload bandwidth.

5.2.3. Download speed bottlenecks The leecher upload speed is determined by one of the three mutually exclusive bottlenecks that can exist in the system. The first is a seeder bottleneck (SB), when the aggregate upload speed of the seeders is slower that the upload speed of a leecher, µyy(t) < µx. In other words, the leechers in the swarm can replicate among themselves the pieces they receive from the seeders faster than the seeders can upload the pieces. In this case, in the absence of helpers, the download speed is bound by: 5 uSB(t) = µyy(t) (5.5)

Helpers can not increase the leecher upload speed in this situation. The best a helper algorithm can do is maintain an ϵ as low as possible so u(t) does not decrease. The second bottleneck is the leechers bottleneck (LB). In this case, µx < µyy(t), so that seeders can upload data faster than leechers can distribute it among them- selves. Qiu and Srikant [49] show that leecher upload effectiveness is very high: ( ) log N k η ≈ 1 − (5.6) x N where N is the number of file pieces and k is the number of connections each peer has. In practice, with files having at least hundreds of pieces and peers having tens of connections, ηx is considered 1, which is confirmed by actual measurements [85]. In the absence of helpers, the download speed is bound by:

µ x(t) + µ y(t) u (t) = x y (5.7) LB x(t)

Adding helpers to a swarm operating in this regime can increase download speed up to the limit given by the upload speed of the seeders (Equation 5.5). Thus, the speedup helpers can achieve is bounded by:

µyy(t)x(t) Sb1 = (5.8) µxx(t) + µyy(t)

Introducing helpers into the system can create a helper bottleneck (HB), derived from the total upload bandwidth that helpers add to the swarm. Consider the ideal .72 5. Towards a P2P bandwidth marketplace

situation where helpers do not download anything (ϵ = 0) and use their full upload bandwidth (ηz = 1) (not possible in reality). The download speed is: µ x(t) + µ y(t) + µ z(t) u (t) = x y z (5.9) HB x(t) Thus, the second speedup bound is:

µxx(t) + µyy(t) + µzz(t) Sb2 = (5.10) µxx(t) + µyy(t)

We compute zmax(t), the maximum useful number of helpers. Having more than zmax(t) helpers does not increase download speed any further. Starting from the two download speed bottlenecks, uSB(t) 5.5 and uHB(t) 5.9: µ x(t) + µ y(t) + µ z (t) µ y(t) = x y z max (5.11) y x(t)

µyy(t)(x(t) − 1) − µxx(t) zmax(t) = (5.12) 5 µz 5.3. Libtorrent helper algorithm Libtorrent [51] is a library implementing the BitTorrent protocol designed to be used by third-party BitTorrent clients. In its latest major version (0.16), it intro- duced a share mode for downloading, where file pieces are downloaded only when a heuristic determines it is possible to upload them to other leechers. Like many other BitTorrent engines, Libtorrent supports choosing a subset of files to download from a multi-file torrent. Internally, the choice is implemented through a bitmap which marks each file piece for download or not. This bitmap is also used in share mode. Initially, the bitmap is empty, so Libtorrent does not download any data. The bitmap is updated by Algorithm 4, which is triggered by new peer connections, the closing of connections, the completion of piece downloads, and various other events. The parameters are controlled by Libtorrent code, except for T , which we set to 1, so the upload is at least equal to the download. The algorithm tries to heuristically estimate the feasibility of downloading one of the rarest pieces observed at neighboring peers. It starts by counting the total number of pieces missing at neighboring leechers (Line 4). Only the number of pieces is taken into consideration, not their IDs; however, given the random rarest first piece picking policy of BitTorrent clients [8], it is very likely that the sets of missing pieces of leechers are disjoint, hence the choice of not storing piece IDs. Note that other helpers are not taken into consideration when counting missing pieces. The number of missing pieces is decreased considering each seeder can upload a number of pieces given by parameter B while the helper is downloading one piece. If this results in no missing pieces being left among the connected leechers, the algorithm returns and the helper does not start downloading another piece. Next, the helper examines the type of peers it is connected to (Line 11). Having seeders take up a too high proportion of all connections is detrimental to the chances 5.3. Libtorrent helper algorithm .73

Algorithm 4 Libtorrent helper algorithm for piece selection Require: B, T, S, D 1: missing_pieces = 0 2: for all p ∈ connected_peers do 3: if p is a leecher then 4: missing_pieces += total_pieces − pieces(p) 5: end if 6: end for 7: missing_pieces -= B × |connected_seeders| 8: if missing_piece ≤ 0 then 9: return 10: end if 11: if |connected_seeders|/|connected_peers| > S then 12: disconnect excess seeders 13: end if 14: if downloaded × T > uploaded then 15: return 16: end if 5 17: if downloading × D > downloaded then 18: return 19: end if 20: if |connected_peers| − min(piece_rarities) < T then 21: return 22: end if 23: download random(rarest_pieces) of uploading, because seeders are never interested in new pieces. However being connected to some seeders is beneficial overall, because it offers the opportunity to download pieces that are rare among leechers. The parameter S controls the proportion of seeder connections; it depends on a number of factors, like the total number of connected peers and the maximum number of connections allowed. In case the helper is connected to too many seeders, it disconnects some of them. The algorithm checks the upload and download statistics to make sure the shar- ing ratio remains above the target given by parameter T . Note that this check does not stop the bootstrapping of the helper (the sharing ratio while downloading the first piece is inevitably 0), because it only involves fully downloaded pieces. This check is a feedback mechanism that stops the download of new pieces if the upload is not proceeding as expected. Furthermore, there is also a check on the pieces whose download is in progress, controlled by parameter D (Line 17), which ensures the helper does not request too many pieces at the same time. Finally, the availability of the rarest pieces is computed and compared to the number of connected peers to ensure there are enough peers potentially interested in a future download (Line 20). One piece is selected at random from the set of rarest pieces and is marked for download. .74 5. Towards a P2P bandwidth marketplace

Table 5.2: Experiment parameters

Experiment Parameter Static Dynamic File size [GiB] 1 1 Number of seeders 1 1 Seeder bandwidth [KiB/s] 1000–5000 5000 Number of leechers 20 70 Leecher bandwidth [KiB/s] 1000 1000 Number of helpers 10–70 70 Helper bandwidth [KiB/s] 500–3500 1000 Run time [s] 200 300

5.4. Evaluation of helper contribution impact We evaluate the helper algorithm in a controlled environment using a Libtorrent- 5 based BitTorrent client through a static experiment where we compare the perfor- mance of the client to the bound given by the model in Section 5.2, and a dynamic experiment in which we replicate a flash crowd peer arrival pattern.

5.4.1. Experimental setup The experimental setup is based on the P2P Test Framework [112], a collection of scripts that facilitates testing P2P clients in controlled conditions. We build a C++ BitTorrent client based on Libtorrent that supports share-mode peers acting as helpers, starting from the minimal client included in the framework. Our client does not have a graphical user interface, and is thus only suited for laboratory experiments, not daily use. Nevertheless, our client is a fully conforming BitTorrent client, not an emulator. The common parameters of our experiments are summarized in Table 5.2. We only limit the upload bandwidth of peers and leave the download unconstrained, in accordance with the model we use [49, 110] and typical ISP connections which are highly asymmetrical. All clients are stopped after the given run time. Considering the file size and upload bandwidths we use, this means no leechers become seeders. In other words, our results describe exclusively the transient undersupplied phase of the swarm during which helpers have the biggest impact. The number of peers is sufficient to allow choking to occur given the usual number of upload slots, 5 [8]; experiments using fewer peers would be unrealistic because peers would always upload to each other. The run time is chosen to allow sufficient choke rounds between peers to minimize outliers, which is visible in the low standard deviation of the results.

5.4.2. Static experiment In the first experiment, which we call static, we observe a swarm for the first 200 seconds of its lifetime. The swarm has a single seeder, which is the content injector 5.4. Evaluation of helper contribution impact .75 in this case. The seeder uses the Libtorrent version of the efficient super-seeding algorithm [113], which presents to each leecher only the rarest subset of the pieces available, thus maximizing the seeding efficiency by minimizing the uploading of content multiple times to different leechers. There are 20 identical leechers in the swarm with an upload bandwidth of 1000 KiB/s and they all join at the beginning. The baseline performance in our evaluation is given by the normal BitTorrent execution. We start the seeder and leechers and record the mean leecher download speed over the duration of the experiment. We compare the baseline with the effect of introducing helpers originating from our bandwidth marketplace. We again record the mean download speed of the original leechers, which is now affected by the additional helpers. In this experiment, all additional helpers are added to the swarm at the same time, 30 seconds after the start, to simulate the delay introduced by the bandwidth marketplace protocol. Along with the additional helpers running the Libtorrent helper algorithm, we use two other types of additional peers, in order to give insight into the performance of the Libtorrent helper algorithm. First, to replicate a naive approach to designing a bandwidth marketplace, where additional peers trying to help are actually normal peers, we inject peers running the normal leecher algorithm. Second, we simulate 5 the addition of ideal helpers using the model bound for the download speed (Equa- tion 5.9); this ideal bound provides a performance target for helper algorithms, although it cannot be reached by an actual implementation. We first explore the leecher bottleneck case, when download speed is constrained by the upload speed of leechers. To reproduce this bottleneck, we set the seeder up- load bandwidth to 5000 KiB/s, corresponding to a well provisioned content injector. Figure 5.2 shows the effect of adding additional peers compared to the baseline BitTorrent performance. The maximum number of additional peers used is 70, (it must be less the zmax = 75, Equation 5.12), and the minimum number is 10, half the number of leechers. All additional peers have the same bandwidth, 1000 KiB/s, which is equal to that of the leechers. The Libtorrent helper algorithm is successful in alleviating undersupply in the swarm. It shifts the bottleneck from a leecher bottleneck to a helper bottleneck. The addition of helpers produces a linear effect from 10 to 50 helpers, when leecher download speed is more than twice as high as the baseline speed. Adding more helpers gives diminishing returns; the download speed increase from 50 to 70 helpers is sub-linear. Examining the use of normal peers as additional peers, we note that it does not improve the performance of the original leechers. This is in accordance with our model, which predicts the slight drop in performance visible in the figure caused by the division of seeder bandwidth to more peers (x increased in Equation 5.7). The simulated ideal performance bound shown in the figure is approximately twice as high as the performance generated by Libtorrent helpers, in the 10–50 additional peers range. To investigate the cause of this difference, we analyze the bandwidth usage of Libtorrent helpers in Figure 5.3. The mean upload and download speed of helpers can be used to approximate parameters ηz and ϵ in Equation 5.3. While the standard deviation of these mean speeds is not as low as the standard deviation of leecher performance in Figure 5.2, we do note a trend of decreasing .76 5. Towards a P2P bandwidth marketplace

5000

4000

Type of additional peers 3000 ● Baseline Helpers 2000 Ideal Normal peers

Download speed (KiB/s) Download 1000 ● ● ● ● ● ● ●

0

10 20 30 40 50 60 70 5 Additional peers

Figure 5.2: Effect of number of additional peers on mean download speed of original leechers in swarm for static experiment (standard deviation error bars, n = 20).

1000

750

Transfer type 500 ● Download Upload

Transfer speed (KiB/s) Transfer 250 ● ● ● ● ● ● ● 0

10 20 30 40 50 60 70 Additional helpers

Figure 5.3: Effect of number of helpers on their own mean download and upload speed (standard deviation error bars, n indicated by horizontal axis). 5.4. Evaluation of helper contribution impact .77

5000

4000

Type of additional peers 3000 ● Baseline Helpers 2000 Ideal Normal peers

Download speed (KiB/s) Download 1000 ● ● ● ● ● ● ●

0

500 1000 1500 2000 2500 3000 3500 Upload bandwidth of additional peers (KiB/s)

Figure 5.4: Effect of upload bandwidth of additional peers on mean download speed of original 5 leechers in swarm (standard deviation error bars, n = 20). effectiveness of the Libtorrent helper algorithm correlated with the increasing num- ber of helpers. The upload effectiveness for the 10 helpers case is a remarkable 89% (recall the 1000 KiB/s upload bandwidth). For the 70 helpers case, the upload effectiveness is only 54%. At the same time, the Libtorrent helper algorithm detects the reduced effectiveness and decreases the download burden on the swarm. This corresponds to a download burden of 22% for the 10 helpers case and only 8% for the 70 helpers case. Overall, we conclude that the algorithm behaves as expected, but that the download burden ϵ can be improved and so can the upload effectiveness ηz for swarms with a high number of helpers. We also investigate the effect of varying the upload bandwidth of helpers while keeping their number constant. All additional peers are again identical. Upload bandwidth varies from 500 KiB/s, which is half that of leechers, to 3500 KiB/s (derived from Equation 5.12). We depict the results in Figure 5.4. The Libtorrent helper algorithm shows a similar behavior to Figure 5.2. However, the linear per- formance increase limit is lower. Starting with the 2000 KiB/s case, increasing the upload bandwidth of helpers results in lower and lower download speed increases for leechers. On the other hand, using normal peers as additional peers has the po- tential to generate good results. If low bandwidth peers (500 KiB/s) are used, there is a performance drop. Starting with moderately fast peers (1500 KiB/s) though, there is a positive effect on the download speed of the original leechers. At the same time, we note that it is still better to use helpers instead of normal peers; even considering the highest upload bandwidth in the experiment (3500 KiB/s), they generate a bigger performance increase, 112% vs. 70%. Figure 5.5 shows the effect of helper upload speed on their upload effectiveness .78 5. Towards a P2P bandwidth marketplace

2000

Transfer type ● Download Upload 1000 Transfer speed (KiB/s) Transfer

● ● ● ● ● ● ● 0

500 1000 1500 2000 2500 3000 3500 Helper upload bandwidth (KiB/s)

5 Figure 5.5: Effect of upload bandwidth of helpers on their own mean download and upload speed (standard deviation error bars, n = 20).

and their download burden on the swarm. For low bandwidth helpers, the effec- tiveness is very high. Starting from 2000 KiB/s, however, it flattens out. This is accompanied by a decrease in download burden, so the actual effect on leecher performance is still positive. While in practice the leecher bottleneck case we explored so far is most common, we also investigate the seeder bottleneck case. For this, we set the seeder upload speed to 1000 KiB/s, equal to that of leechers, generating a seeder bottleneck slightly lower than the theoretical swarm dissemination speed of 1050 KiB/s (Equation 5.7). Figure 5.6 shows the results. Recall from Section 5.2.3 that even ideal helpers cannot increase leecher download speed in this case, because it is strictly bounded by the upload speed of the under-provision seeder. The additional peers have an upload bandwidth of 1000 KiB/s, just like the leechers. Note that the baseline performance does not coincide with the model, due to overhead in the practical implementation of the protocol. The Libtorrent helper algorithm performs ideally, generating no download burden on the swarm and keeping the download speed of leechers unchanged. On the other hand, adding normal peers to the swarm causes a significant performance drop. This is due to the bootstrapping period (Section 5.2.2) which lasts for approximately 10% of our experiment.

5.4.3. Dynamic experiment The second experiment consists of emulating a flash crowd where increasing num- bers of leechers join the swarm. To record the baseline BitTorrent performance, we create 5 waves of leechers spaced 30 seconds apart, starting at the beginning of the experiment. The number of leechers in each wave is 10, 12, 14, 16, and 18, respec- 5.5. Related work .79

800

600

400

200 Download speed (KiB/s) Download

0

Baseline Normal peers Helpers Type of additional peers

Figure 5.6: Effect of additional peers on mean download speed of original leechers in swarm in case of a seeder bottleneck (standard deviation error bars, n = 20). 5 tively. All leechers have the same upload bandwidth, 1000 KiB/s, while the seeder has 5000 KiB/s, meaning the performance bottleneck is caused by the leechers. The experiment lasts for 300 seconds and no leecher finishes downloading. We report the mean download speed of leechers grouped by their join time, to see if the join time has an effect on performance. To test the effect of helpers, we consider that each leecher determines two helpers to join the swarm. All helpers have the same upload bandwidth of leechers, 1000 KiB/s. Each helper joins the swarm 10 seconds after the leecher that summons it. Figure 5.7 shows the results. The baseline performance of early leechers is better than that of late leechers, because early leechers benefit from a higher bandwidth allocation from the seeder. The Libtorrent helper algorithm works very well in this dynamic flash crowd situ- ation, which is the ideal use case for the bandwidth marketplace. The presence of helpers more than offsets the performance decrease of leechers joining the swarm in the middle of the experiment, in the 30 and 60 seconds waves. These leechers benefit from the presence of helpers summoned by previous waves of leechers; these helpers had already downloaded pieces and are ready to share them. Specifically, the download speed of leechers in the 60 seconds wave almost doubles from 827 KiB/s to 1518 KiB/s. The leechers arriving latest do not have the chance to encounter enough helpers before the experiment ends, hence their lower performance.

5.5. Related work Several designs have been proposed to solve the supply and demand problem in P2P file transfer systems. None of them have seen wide scale adoption, and plain Bit- .80 5. Towards a P2P bandwidth marketplace

0 30 60 90 120

1500

1000

500 Download speed (KiB/s) Download

0

Baseline Helpers Baseline Helpers Baseline Helpers Baseline Helpers Baseline Helpers Type of additional peers

5 Figure 5.7: Effect of arrival time in seconds on mean leecher download speed, when normal peers are added (Baseline), and when normal peers and helpers are added (Helpers) (standard deviation error bars, n = 10 + wave × 2).

Torrent continues to be very popular. Compared to previous work, our solution is decentralized, incentives compatible, and based on a simple algorithm that does not require user intervention. Garbacki et al. [105] attempt to solve the problem using “bandwidth borrowing”. While effective, their solution has a clear scalability prob- lem, as a peer can only recuperate bandwidth from its debtors. Aperjis et al.[114] propose PACE, a credit-based system for multilateral transfers, where each peer asks for a specific price for its services. The main problem with the design of PACE is the complexity of the pricing algorithm which, although allegedly simple, does take into account complex parameters like “network-friendly” behavior. Peterson and Sirer [92] propose Antfarm, a P2P system that efficiently balances bandwidth supply and demand. Antfarm, however, is not a truly decentralized solution, relying instead on a central server coordinating peer actions. Wu et al. [103] design a solu- tion specific to video-streaming where peers are assigned to static groups responsible for certain streams. Again, the grouping of peers is done by a centralized server. Most recently, Dán and Carlsson [115] propose the merging of under- and over- supplied swarms with good results. However, their design also includes centralized servers, used for merging swarms.

5.6. Conclusion The bandwidth marketplace we designed in this chapter is a completely decentral- ized and incentives-compatible solution to the problem of bandwidth supply and demand mismatch in P2P file transfer systems. According to the analytical model 5.6. Conclusion .81 we introduced, it has the potential to remove the performance bottleneck in swarms with well provisioned seeders. We identified a suitable algorithm for helper peers as part of Libtorrent and described its operation. Through realistic experiments, we tested the helper algorithm and discovered that it significantly increases the down- load performance of peers. As seen in the comparisons with idealized helpers, the effectiveness of the algorithm can still be improved.

5

6 Decentralized credit mining in P2P systems

User contribution is key to the functioning of P2P systems, as we have shown in the previous chapters. Credit-based private BitTorrent communities offer high down- load speeds because they are successful at incentivizing users to contribute [18]. The Tribler community [23] uses the decentralized BarterCast credit system to in- centivize users to seed [27], similar to private communities, while remaining open to everyone. However, Tribler lacks a mechanism for increasing the contribution of honest users who are limited by low upload bandwidth [47] or a lack of knowledge of BitTorrent economics [48]. In this chapter, we design, implement, and evaluate using Internet experiments a decentralized Credit Mining System (CMS) aimed at helping users contribute to Tribler. The goal of the CMS is simple: earn BarterCast credit on behalf of the user—without requiring user intervention—by contributing upload bandwidth to the community. The user can then spend the credit earned, for example, to obtain a fast download speed while downloading new content in case of a flash crowd [33]. (Tribler peers can rank download requests in order of requester credit, and can give priority to the requests of the peers with the most credit.) Our CMS is completely decentralized—it is part of the Tribler P2P client and does not require the collaboration of any other Tribler peer to function. The oper- ation of the CMS can be seen as a sequence of three steps. First, the user selects a source of swarms for the CMS to take into consideration for credit mining. This is a form of white-listing and ensures the user has control over the content shared through their computer. Second, the CMS periodically selects a subset of swarms for active credit mining. The user may provide a large number of swarms and it is not technically feasible for the CMS to actively participate in all of them. Third, the CMS joins the selected swarms and attempts to maximize earned credit by downloading as little as possible and uploading as much as possible. This chapter is based on work published in IFIP Networking 2015 [116].

83 .84 6. Decentralized credit mining in P2P systems

Long-term content availability is a problem in P2P systems, caused by the grad- ually falling user demand for old content [48]. This also makes credit mining old swarms inefficient. However, users may want to improve the availability of old con- tent and we provide a special mode of operation for the CMS to help them. In archival mode, the CMS selects swarms not based on upload potential, but on the number of replicas present in the system. In addition to the functionality outlined above, we include in our CMS design several subsystems aimed at tackling challenges arising from implementing and de- ploying the CMS over the Internet. We identify duplicate content and select for credit mining only the swarm with the most peers. We detect spam using collabo- rative filtering. Finally, we optimize the network traffic necessary to maintain up to date information on the swarms in the CMS. We implement our design in Tribler and deploy it over the Internet, using swarms from a real-world BitTorrent community. In our evaluation, we explore several parameters that influence the functioning of the CMS and use the results to select default values that lead to good performance. Furthermore, we verify the feasibility of widespread adoption of the CMS through an experiment where a high proportion of peers use it. The contributions we make in this chapter are the following:

1. We design CMS, a credit mining system for Tribler, which automatically se- 6 lects from a whitelist of swarms provided by the user the swarm with the best upload potential, and joins this swarm with the goal of maximizing its upload/download ratio (Section 6.2);

2. We solve problems arising from deploying CMS on the Internet by detecting duplicate content, removing spam, and minimizing overhead network traffic (Section 6.3);

3. We implement CMS in Tribler and test it with real-world swarms, showing its credit mining effectiveness, and proving its compatibility with widespread community deployment (Sections 6.4 and 6.5).

6.1. Background In this section, we describe accounting in P2P systems, focusing on centralized, as well as decentralized solutions. We also introduce Tribler, the system we use for the implementation of the CMS.

6.1.1. Accounting and credit in P2P systems Accounting mechanisms are widespread in P2P systems because they incentivize users to contribute. Arguably the most successful P2P system, BitTorrent, uses tit for tat [8], a pairwise accounting mechanism where peers keep track of their interaction with other peers and reward the peers with the biggest contribution. However, tit for tat does not incentivize long-term contribution, beyond the transfer of a single file. 6.1. Background .85

Several BitTorrent communities use centralized accounting to track the overall contributions of users over time [18]. BitTorrent clients report the upload and download to the community servers such that each user has an associated sharing ratio—the ratio between upload and download. Communities employ sharing ratio enforcement to provide certain privileges, like access to the newest content, only to the users with sharing ratios above certain thresholds. The sharing ratio is a form of credit, and in certain communities it can be transferred between users. Decentralized accounting mechanisms spread the burden of tracking peer con- tributions from a centralized server to the individual peers in the network [98]. Decentralized accounting mechanisms that require every peer to store the complete and up-to-date contribution information of all peers are also suffering from limited scalability. Instead, BarterCast [27] uses only the local view of each peer to compute the relative contribution of other peers by applying flow network techniques on the peer interaction graph. Accounting is used to provide benefits to users that earn credit. In certain private communities, users are required to maintain a minimum credit balance, otherwise they are expelled from the community [18]. In other communities, new content is first made available to users with sufficient credit, and only later to the other users.

6.1.2. Tribler Tribler is a P2P system developed at the Delft University of Technology and released as open-source software. It is based on a custom protocol, called Swift [21], but is also 6 compatible with BitTorrent. In addition to file transfer, it provides collaborative wiki-style editing, decentralized search, and integrates the BarterCast accounting mechanism. As opposed to BitTorrent, where users have no long-term identifiers, each Tribler user is assigned a permanent identity, called PermID, which is used by BarterCast and other subsystems. The PermID is actually an automatically generated public key that also enables Tribler to encrypt the communication between peers. At the same time, each data transfer between two Tribler peers generates a BarterCast record which is cryptographically signed by both peers. BarterCast records pro- vide unforgeable proof of contribution and are the basis of the Tribler accounting mechanism. Tribler gives users the possibility to publish their own content in a decentralized way. Any Tribler user can create a personal channel and add content to it. Other users can subscribe to the channel and receive new content as it is published. The P2P infrastructure for publishing and subscribing to channels, as well as collab- orative editing and other Tribler features is provided by Dispersy [24], a generic message synchronization system. Dispersy uses Bloom filters to enable peers to exchange messages in a P2P network under challenging conditions, including high churn and lack of end-to-end connectivity. Applications running on top of Dispersy define the message semantics—Dispersy is only concerned with synchronization. For example, Tribler uses several type of messages to implement channels, such as a message for creating a channel, a message for adding content to a channel, a message for removing content from a channel, .86 6. Decentralized credit mining in P2P systems

Tribler Web Swarm selection Sharing ratio channels feed Folder policy & interval target

Supply of Swarm Swarm selection swarms credit mining User

Figure 6.1: An overview of the three steps of the credit mining process.

etc. Each peer adds all messages to a Bloom filter and periodically exchanges the Bloom filter with other peers to determine what new messages must be exchanged. For example, if peer A is subscribed to a channel, when A and another peer, B, exchange Bloom filters, if A observes in the Bloom filter of B a message adding new content to the channel, A will request the message from B, thereby discovering the new content.

6 6.2. Credit Mining System design In this section, we describe the design of the CMS. Considering the three steps involved in the credit mining process, depicted in Figure 6.1, we discuss first the supply of swarms to the CMS, second, the algorithm for selecting swarms for in- vestment, and third, the behavior of the CMS within a swarm. Finally, we describe the design of the archival mode of operation of the CMS.

6.2.1. Supply of swarms We allow the user to select the swarms that are to be credit mined by the CMS, thereby creating a form of white-list. In other words, we make the credit mining process opt-in: while the CMS is autonomous and does not require user input, it does require a source of swarms to start. The user can supply swarms to the CMS using three sources, which we describe in turn. The first source is a list of Tribler channels selected by the users for boosting. Whenever the user wants to support a channel, without necessarily downloading the content shared through the channel, they can add the channel to the CMS so that its swarms are taken into consideration. The second source is a web feed coming from a web server, which allows the CMS to access swarms as soon as they are created. Web feeds are a popular means for communities to distribute swarms and we have used them before successfully for this purpose [18]. The third source is a folder containing torrent files. This can be a folder with torrents downloaded by the user from the web, or it can be the folder with torrents collected by Tribler through gossiping during normal operation. When two Tribler 6.2. Credit Mining System design .87 peers meet, they exchange lists of swarms to determine the similarity between them, a feature useful for search. At the same time, the torrents corresponding to the swarms are saved in a local cache, which is accessible to the user.

6.2.2. Swarm selection The number of swarms available through a source is not bounded, so the CMS has to explicitly select a subset of active swarms to which to dedicate the bandwidth and storage resources available. Furthermore, swarm selection is a periodic process, because the characteristics of swarms change constantly and the CMS must be able to improve the credit mining by changing the set of active swarms. The design of the CMS includes a pluggable policy for swarm selection. We have previously studied methods to predict the upload potential of swarms using simulation [117]. We have shown that a multivariate adaptive regression splines (MARS) model can select swarms with good upload potential for the majority of peers in our simulation. At the same time, we obtained positive results for the same simulation setup using a simpler swarm selection policy, namely selecting the swarm with the lowest proportion of seeders. Because the regression model we used for simulation is difficult to integrate into Tribler, in this chapter we use simpler policies, which still give good results, as we will show in Section 6.5. The first policy is the seeder ratio (SeederRatio) policy, which selects for credit mining the swarms with the lowest ratio of seeders to all peers (seeders and leechers). Intuitively, this policy selects the most undersupplied swarms, i.e., the swarms with 6 the least bandwidth supply—seeders—compared to the total bandwidth demand— leechers. The second policy is the swarm age (SwarmAge) policy, which selects swarms based on their age. Previous work has shown that the age of a swarm plays a significant role in the potential for upload [48]. Specifically, the newest swarms offer the best potential. Most users will download any content only once or a few times, and there is a bounded number of users. It follows that there is a bounded number of times any content will be downloaded in total. So it is expected that an early download will have more potential for upload—because of subsequent downloads from other peers—than late downloads. The third policy, Random, is used mainly to as a baseline for evaluating the first two policies, but also to stress test the other components of the CMS. Using the Random policy, the CMS selects swarms for credit mining uniformly at random from the swarms available.

6.2.3. Behavior within a swarm The behavior of the CMS within a swarm is crucial to the success of credit mining. The goal of the CMS when joining a swarm is the same as its global goal: to upload as much as possible while downloading as little as possible in order to maximize the credit gained. In previous work, we studied the theoretical properties of in-swarm behavior algorithms through a mathematical model [108]. We also identified a heuristic that provides the desired in-swarm behavior. The heuristic, introduced by the Libtorrent .88 6. Decentralized credit mining in P2P systems

open-source software library [51], downloads the content in a swarm one piece at a time and only downloads additional pieces when it estimates there is enough potential for upload in the swarm. The algorithm is governed by one parameter, the sharing ratio target, which represents the desired ratio between upload and download for swarms in the CMS. A value of 3 for this target—the default in Libtorrent—means that the CMS will only download pieces from a swarm if it estimates it will be able to upload it back to at least 3 other downloaders in the swarm. Furthermore, once a piece is downloaded, the CMS waits for it to actually be uploaded 3 times before it proceeds to download other pieces. This condition acts as a fail-safe mechanism for the situation when the upload potential estimation is wrong, and guarantees that in the worst case the loss of the CMS in a swarm is equal to the size of one piece. A crucial characteristic of the algorithm is its resilience to widespread deployment in the community. If two CMS peers meet, they recognize each other through a flag present in the BitTorrent handshake message. Thus, the two CMS peers do not needlessly upload and download data from each other, preventing the waste of resources.

6.2.4. Archival mode Previous work has shown that it is more advantageous to seed new files than old 6 files for earning credit in a community[48]. At the same time, users are frequently interested in the long term availability of certain files. Therefore, we provide a special mode of operation for the CMS, called archival mode. All swarm sources—channels, web feeds, and folders—are compatible with archival mode. When the users adds a new source to the CMS, they can mark it for archival. When a swarm is in archival mode, the CMS will ensure that it is always seeded by a minimum of two seeders—the minimum necessary to provide fault tolerance. In practice, if the CMS observes that an archival swarm has two or less seeders, it downloads and seeds it. Whenever the number of seeders raises again above three, the CMS pauses seeding, while continuing to monitor the swarms through trackers scrapes. Archival mode swarms generate less credit than the normal swarms in the CMS. However, they always get priority over normal swarms because the user explicitly asked the CMS to seed them by using archival mode. At the same time, the CMS does not necessarily seed all archival mode swarms at all times. When other peers are seeding, the CMS pauses seeding and starts instead uploading in normal swarms which have the potential to generate more credit. Consequently, archival mode offers more flexibility than simply having the user seed swarms in Tribler outside of the CMS.

6.3. Internet deployment challenges In this section, we focus on several issues tangential to credit mining that are crucial for a system deployed on the Internet. 6.3. Internet deployment challenges .89

Table 6.1: Duplicate content example: ubuntu-14.04-desktop-amd64.iso shared in different swarms because of different piece sizes and private flag usage

Hash Piece [B] Flag 18AC50D74C61883B3AB4C40F5DD3E35F157DE1A2 1 048 576 N/A 4D753474429D817B80FF9E0C441CA660EC5D2450 524 288 N/A F88ED0C16CF7F452A5C737A0B7503F925E11FE00 524 288 0

6.3.1. Duplicate content detection In many P2P communities where the addition of new content is open to all users, du- plicate content is a problem that leads to a poor distribution of resources[118]. For example, searching for “Ubuntu 14.04 AMD64 torrent” on Google at the time of writ- ing yields several results. Three contain exactly the same content, the Ubuntu ISO file named ubuntu-14.04-desktop-amd64.iso, 1 010 827 264 bytes in size. How- ever, they have different hashes, because they use different piece sizes or needlessly use the BitTorrent private flag, as summarized in Table 6.1. We design a duplicate content detection subsystem for the CMS that groups files based on readily available metadata, specifically, file size and file name. Our design deliberately does not include inspecting the actual data, therefore providing fast results. The subsystem uses the file size information recorded in the torrents as the 6 first criterion for duplicate detection. If two files have the same size, the subsystem compares their names. The subsystem uses the Levenshtein distance [119]—a type of edit distance—to decide if two equally sized files are duplicates or not. Whenever the file names are less than five edits apart, they are considered duplicates. For every group of duplicates detected, the CMS only uses the swarm with the largest number of seeders for credit mining. This peer-level behavior aligns with the community goal of consolidating swarms, which has been proven to increase performance [115]. However, this may seem counter intuitive at first sight from a credit mining perspective. Were the peer to choose the swarm with less seeders, the potential for upload would be greater. However, in the long term, the swarm with less seeders will disappear first from the community, so the long term potential for upload is greater for swarms with more seeders. Each peer runs the duplicate detection algorithm whenever a new torrent is added to the CMS. In case the new torrent is a duplicate of an existing torrent, the selection for credit mining may stop the existing torrent if it has less seeders. Furthermore, the CMS reevaluates the selection periodically to ensure the torrent with most seeders is used for credit mining.

6.3.2. Spam detection Spam, defined as misrepresenting content in order to circumvent established retrieval and ranking techniques, is a pervasive phenomenon in P2P systems [120]. For the CMS implementation, we use an explicit spam detection mechanism based on the votes of users in the Tribler community. Dispersy, the Tribler sub- .90 6. Decentralized credit mining in P2P systems

system that implements channels, enables users to do collaborative spam filtering at the channel level. A specific Dispersy message is created for every vote cast by a user on a channel, classifying the channel as spam or not spam. Each peer adds the Dispersy message representing the user vote to its Bloom filter. Dispersy then distributes the messages such that a global consensus can be formed. Each peer is thus aware of the spam status of channels as resulting from collaborative filtering. The CMS uses this information for detecting spam among the swarms in its sources. If a swarm is present in more spam than not spam channels, the CMS considers it spam and does not take it into consideration for credit mining.

6.3.3. Updating information It is important that the information used by the CMS swarm selection algorithm as input is constantly updated to reflect changes in the P2P community. The network traffic generated while updating this information may constitute a non-negligible overhead compared to the upload generated by the CMS. Each peer periodically retrieves updated information from trackers through a standard BitTorrent protocol called scraping. We design an efficient tracker scraping subsystem that minimizes network traffic. As a result, the tracker load is also minimized, which again represents an alignment of peer- and community-level goals. The main idea is for each peer to group together torrents by tracker and make a single scrape request for all torrents in a group. Naturally, there has to be a limit on the 6 number of torrents part of a single request, i.e., the maximum size of the message to be sent over the network. In accordance with the BitTorrent UDP Tracker Protocol, we limit the number of torrents in a single scrape request to 74 [121]. Each peer scrapes trackers at an interval of equal to the swarm selection interval, which we experimentally explore in Section 6.5.

6.4. Tribler implementation From a software engineering perspective, Tribler consists of modules divided in two layers: the core and the graphical user interface, to facilitate the development of alternative user interfaces. Our CMS implementation covers both layers. Like the rest of Tribler, our implementation uses the Python programming language and consists of approximately 2000 lines of changes on top of Tribler, including the unit tests. We allow for easy extension of the CMS through new pluggable policies. For swarms selection, new policies must implement a comparison function for swarms, which has access to all swarm parameters in Tribler, including number of seeders and leechers, and creation date, which we used for the policies tested in this chapter. The CMS can be configured through a configuration file which specifies param- eters, such as the swarm selection policy or the sharing ratio target. However, we have also implemented a full graphical user interface for the CMS, which is depicted in Figure 6.2. First of all, the user interface allows users to manage the credit mining sources, i.e., add and remove sources. Second, it shows an overview of the credit mining 6.5. Evaluation of credit mining in Tribler .91

Figure 6.2: The Tribler credit mining user interface, showing an overview, the list of potential swarms, and one active swarm. activity, including the total amount of bytes uploaded and downloaded, and the upload and download speed of the CMS. Finally, users can check the list of swarms that are part of the CMS and understand exactly what the CMS is doing. We believe 6 this transparency will encourage people to actively use credit mining in Tribler.

6.5. Evaluation of credit mining in Tribler In this section, we evaluate the CMS as implemented in Tribler through experiments we conduct over the Internet. We use Tribler instances that we control to connect to real-world swarms we do not control, and record the activity of the CMS. First, we evaluate the CMS configuration parameters and propose a default value for each parameter. We then evaluate the effect of a community-wide CMS deployment of by inserting many CMS peers at the same time in a small number of swarms.

6.5.1. Experimental setup As source of swarms, we use the RSS web feed of etree.org, a community that shares music with permission from authors. (In fact, BitTorrent was originally designed to serve this community [122].) The RSS web feed contains the most recent 30 torrents published in the community. New torrents are added at a rate of approximately 20 per day, according to our observation. We check the RSS web feed for new torrents every 30 minutes, the default Tribler interval. At any moment, we credit mine a maximum of 13 swarms simultaneously, the Tribler default for the maximum number of active swarms. In Table 6.2 we summarize the CMS parameters we change during the ex- periments. These parameters cover both the swarm selection stage—policy and interval—and the in-swarm behavior stage—sharing ratio target. We evaluate each .92 6. Decentralized credit mining in P2P systems

Table 6.2: Parameters used in the experiments

Parameter Values Swarm selection policy SeederRatio, SwarmAge, Random Swarm selection interval 5 min, 10 min, 30 min Sharing ratio target 1, 3, 5

of the three parameters in turn. For each of the parameters, we run three instances of Tribler, one for each value of the parameter, while the other parameters have a fixed value. The full range of potential values for each parameter is impossible to explore exhaustively using real-world experiments. We limit this experiment to three values per parameter, which we select based on our knowledge gained from designing and implementing the CMS. Throughout the experiments, we use as a performance evaluation metric for the CMS the net credit gain in Tribler, which is equal to the net upload gain in bytes:

net upload gain = uploaded bytes − downloaded bytes When discussing the efficiency of the CMS, we also refer to the normalized upload 6 gain, which normalizes the net upload gain by the amount of bytes downloaded: net upload gain normalized upload gain = downloaded bytes

6.5.2. Choosing a swarm selection policy In our first experiment, we evaluate the swarm selection policy. Figure 6.3 depicts the evolution over a two-day period of the credit earned by the CMS expressed as net upload gain. The colored dots represent individual swarms while the black line represents the overall result. The swarm selection policy differs between the three subgraphs, while the other two parameters, swarm selection interval and sharing ratio target, are fixed to 5 minutes and 3, respectively. Note that we use a logarith- mic vertical axis in Figure 6.3 (as well as Figures 6.4 and 6.5), to better illustrate the evolution of each swarm. At the beginning of credit mining in a swarm, the net upload gain is always negative, because it is necessary to download at least one piece before uploading. Furthermore, because of transfer errors, some pieces require more than one download attempt, also resulting in a negative upload gain. In order to take the logarithm given these negative values, we apply a translation to the data, adding 100 MB to each swarm and to the total. Analyzing Figure 6.3, we conclude that the SeederRatio policy is the most ef- ficient for credit mining. The credit earned by the CMS during the experiment with the SeederRatio policy, 24.27 GB, is more than twice as high as the 11.25 GB earned with the second best policy, SwarmAge, and more than 20 times as high compared to 1.12 GB of the worst policy, Random. Following the evolution of indi- vidual swarms (identified by color), we notice a similar relative evolution of swarms 6.5. Evaluation of credit mining in Tribler .93

10000 SeederRatio

1000

100

10000 SwarmAge

1000 6 Net upload gain [MB]

100

10000 Random

1000

100

2014−11−16 2014−11−16 2014−11−17 2014−11−17 2014−11−18 00:00 12:00 00:00 12:00 00:00 Timestamp

Figure 6.3: The net upload gain for the three swarm selection policies. Colored lines represent in- dividual swarms, the black line represents the total. Logarithmic vertical axes with data translated by 100 MB. .94 6. Decentralized credit mining in P2P systems

in each of the subgraphs; even for the Radom policy, we note that the successful swarms are also successful when using the other two policies. The individual swarm analysis also reveals why the SwarmAge policy performs worse than SeederRatio; while the beginning of credit mining is almost identical for both policies, SwarmAge stops mining a swarm when newer swarms appear, while the SeederRatio policy returns to swarms in need of upload throughout the experiment.

6.5.3. Choosing a swarm selection interval In Figure 6.4, we plot the results of evaluating the second parameter, the swarm selection interval. It is important to note that we set tracker scraping to use the same interval. Intuitively, the smaller the interval, the more up-to-date the CMS information will be, and the more credit should be gained. The other two CMS parameters are set to SeederRatio and 3. Indeed, we observe an inverse correlation between the interval duration and the amount of credit gained, with values 5 min, 15 min, and 30 min producing 7.87 GB, 7.10 GB, and 5.23 GB, respectively. Note that the numbers for credit gained should only be compared among themselves and not with the numbers in Figure 6.3, because the set of swarms used differs. Individual swarm behavior (identified by color) is almost identical across all subgraphs, the only difference being the amount of credit gained for each swarm. Recall that the three Tribler instances in the experiment use the same set of swarms for credit mining. Therefore, by the time swarms appear 6 in the Tribler instance using the 30 min interval, the potential for upload is already largely fulfilled by the other two Tribler instances with shorter intervals. However, even the 30 min interval generates substantial credit, so it may be desirable to use it as a default value, considering the reduced overhead it brings to both the peer and the community tracker.

6.5.4. Choosing a sharing ratio target Finally, in Figure 6.5 we depict the results of the third experiment, evaluating last CMS parameter, the sharing ratio target; the other two parameters are set to SeederRatio and 5 minutes. Considering the total net credit gain, setting the sharing ratio target to 1 produces the best results, 15.86 GB, followed by 3, 10.71 GB, and 5, which produces 6.28 GB. However, if we compute the normalized credit gain, we observe that it is pro- portional to the sharing ratio target; the normalized credit gains are 1.08, 3.81, and 5.92 for sharing ratio targets 1, 3, and 5, respectively. Using a sharing ratio target of 1 is inefficient (there are only 1.08 GB of upload gained for each GB downloaded), but this lack of efficiency is compensated by an increase in the activity of the CMS: data is transferred in many swarms which accumulates to an overall high net upload gain. Thus, if we consider the resources used for credit mining (e.g., network traffic, storage space), it is desirable to select a higher sharing ratio target instead of 1. Overall, we conclude that the Libtorrent default of 3 is a good value for the sharing ratio target, resulting in a good balance between the amount of credit gained and resources used. 6.5. Evaluation of credit mining in Tribler .95 5 min 1000

100 15 min 1000 6 Net upload gain [MB]

100 30 min 1000

100

2014−11−19 2014−11−19 2014−11−20 2014−11−20 00:00 12:00 00:00 12:00 Timestamp

Figure 6.4: The net upload gain for the three swarm selection intervals. Colored lines represent in- dividual swarms, the black line represents the total. Logarithmic vertical axes with data translated by 100 MB. .96 6. Decentralized credit mining in P2P systems

10000

1000 1

100

10000

1000 3 6

Net upload gain [MB] 100

10000

1000 5

100

2014−11−23 2014−11−23 2014−11−24 2014−11−24 00:00 12:00 00:00 12:00 Timestamp

Figure 6.5: The net upload gain for the three sharing ratio targets. Colored lines represent indi- vidual swarms, the black line represents the total. Logarithmic vertical axes with data translated by 100 MB. 6.5. Evaluation of credit mining in Tribler .97

1500

1000

500 Net upload gain [MB]

0

2014−12−02 2014−12−02 2014−12−02 2014−12−02 00:00 06:00 12:00 18:00 Timestamp

Figure 6.6: The net upload gain of 20 credit mining peers deployed simultaneously, represented by colored lines. The thick line represent the median and the shaded area represents the 10th to 90th percentile region. 6

6.5.5. Effect of widespread credit mining

In this fourth experiment, we emulate the widespread deployment of the CMS by running 20 instances of Tribler simultaneously using the same source of swarms, the etree.org RSS web feed. From our observation, most of the swarms have fewer than 5 downloaders, not including our Tribler peers. Thus we produce during this experiment a high proportion of credit mining peers.

Figure 6.6 illustrates the effect of simultaneously running the 20 credit mining peers, all using the SeederRatio swarm selection policy, a swarm selection interval of 5 minutes, and a sharing ratio target of 3. For clarity, we only plot the overall net upload gain for each peer, without the evolution of individual swarms. Each colored line represent one peer, the thicker line represents the median of all peers, and the shaded area represents the gap between the 10th and 90th percentiles.

Even taking into account the variability of the swarms used as input, we notice a significantly lower net upload gain for the individual credit miners compared to the previous experiments, with a median of 1.20 GB after one day of credit mining. However, there are no catastrofic effects of widespread CMS deployment. In fact, all peers still obtain a positive net upload gain after one day of activity, with a minimum of 1.00 GB. The distance between the 10th percentile and the 90th percentile is only 0.54 GB, showing that credit mining peers have similar chances of contributing to the community. .98 6. Decentralized credit mining in P2P systems

6.6. Conclusion The evolution of P2P systems has been tightly coupled to the evolution of incen- tive mechanisms. Accounting mechanisms based on credit or on sharing ratio are currently a widespread method for incentivizing seeding in BitTorrent communities. In this chapter, we have introduced CMS, a decentralized system for credit mining that enables honest users to contribute their idle upload bandwidth to the com- munity. Using the CMS is beneficial both for the user, who earns credit, and for the community, where the contributed upload speed results in faster downloads for other users. We have explored the parameters that control the CMS behavior and provided insight into their effect on credit mining efficiency and resource usage. Ultimately, the balance between credit earned and resources used is best left to the user and we plan to implement in Tribler advanced settings for changing this balance. Furthermore, we have shown that the use of CMS is reasonable even considering widespread deployment in the community. The behavior of CMS peers within a swarm prevents them from needlessly wasting resources on uploading and down- loading from each other. Instead, the CMS peers take turns at providing upload bandwidth to the actual downloaders in the community, thus always obtaining a positive net upload gain.

6 7 Conclusion

The aim of this thesis has been to analyze user contribution in P2P communities and to introduce a system for increasing it. To this end, we have focused on communities based on the most prevalent P2P protocol, BitTorrent. We have explored the file- sharing use case, while also considering the alternative use case of video streaming. Our research ranges from low-level analytical modeling to global scale monitoring; from private communities which use sharing ratio enforcement to public communities which use decentralized reputation systems; from trace-based simulation of simple prediction algorithms to the deployment and measurement over the Internet of a complete credit mining system. In this chapter, we present a summary of our work in relation to the research questions we have identified in the introduction, together with suggestions for future work.

7.1. Summary We address each of the research questions introduced in Chapter 1 in turn and summarize the part of the thesis that provides an answer to the question.

RQ1: How to study the operation of global P2P networks? We have presented the design of BTWorld, a system that can monitor the global Bit- Torrent file-sharing network in a cost-effective manner. The key feature of BTWorld is the usage of BitTorrent trackers for obtaining information, instead of contacting users directly. The system allows the inclusion of new trackers in the monitoring list from several sources to ensure a comprehensive coverage of the network. Looking at one week of collected data, we have observed a large diversity in tracker sizes, with some trackers serving fewer than 100 users concurrently, while others serve millions. We have noted a high proportion of seeders, with a median of 5.64 seeders for every leecher. Studying the data collected through BTWorld gives insight into phenomena that are difficult to understand without a global view. Almost all giant swarms in

99 .100 7. Conclusion

our trace (defined as having at least 5000 peers) use a single tracker, which raises availability concerns. Out of the 769 trackers we monitor, we have determined 17% to be spam trackers.

RQ2: How do P2P protocols perform at the community level? We have extended to the community level previous analysis that had proved BitTor- rent to be very effective at utilizing upload bandwidth at the swarm level. We have proposed a model that represents a BitTorrent community as a flow network, and used this model to solve BitTorrent resource allocation problems using graph theory. Through trace-based simulation, we have compared the performance of algorithms used by popular BitTorrent clients with ideal bandwidth allocations obtained in flow network representations of BitTorrent communities. Our research has shown there is room for improvement in the way BitTorrent clients allocate bandwidth between multiple concurrent swarms, especially when considering video streaming, where clients only provide around 35% of the ideal bandwidth allocation.

RQ3: Is it possible to increase the contribution of honest users? The first problem faced by altruistic users willing to increase their upload bandwidth contribution is selecting swarms with good upload potential. We have introduced a predictor algorithm that ranks swarms according to their future upload potential using only data readily available to BitTorrent peers. Our trace-based simulations show that our algorithm makes correct predictions in 75% of the cases. Based on this upload potential predictor, and on the Libtorrent open-source library, we 7 have built a complete decentralized credit mining system for the Tribler commu- nity. Through tests conducted over the Internet with real-world swarms, we have demonstrated that the credit mining system is effective, generating without user intervention 24 GB of upload in two days, which represents sufficient credit to download more than ten average-sized swarms.

RQ4: Can user self-interest be aligned with overall community interest? To analyze the conflict between user self-interest and overall community interest, we have introduced a model for estimating the impact of increased user upload band- width contribution on the average download performance in a community. The model shows that increased contribution does benefit the community in the pre- vailing situation when the seeder is well provisioned. Furthermore, we have shown in an emulation experiment that adding normal BitTorrent leechers to a swarm when the seeder represents a bottleneck decreases overall performance in the com- munity by approximately 15%. However, adding credit mining peers to the same seeder-bottlenecked swarm does not harm the community. Finally, we have ana- lyzed the potential widespread Internet deployment of our credit mining system. In swarms were the majority of peers were credit mining peers, the minimum contri- bution increase was 1 GB per day, which represents sufficient credit to download 7.2. Future work .101 one average-sized swarm. We conclude that widespread deployment of our credit mining system is desirable both for users individually and for the community as a whole.

7.2. Future work Based on the findings presented in this thesis, we suggest the following topics for future work:

1. User contribution can also benefit other types of functionality offered by P2P communities. Specifically, work is underway in the Tribler research group to establish a Tor-like P2P system for Internet anonymity based on user band- width contribution. Currently, Tor provides anonymity through onion rout- ing, but requires the use of centralized directory servers to introduce users to each other. Removing these servers from the architecture would increase the resiliency of the whole system, because an adversary can now take down the whole system by attacking the centralized servers. One of the most difficult challenges in building a P2P onion routing system is exchanging cryptographic keys between users, a task carried out by the directory servers in Tor.

2. Music and video streaming providers have started using content delivery net- works and clouds more than P2P networks in recent years. For example, Spo- tify and Skype migrated away from P2P towards centralized network archi- tectures, mainly to have better control of the content distribution and support lawful interception, but also because of the falling bandwidth price. However, companies are reconsidering the use of P2P networks, thanks to the resiliency they bring in the absence of network neutrality. P2P networks cannot be eas- 7 ily discriminated against (e.g., through bandwidth throttling), because their traffic does not pass through central servers that are easy to identify. Thus, there is room for future work on combining P2P networks with centralized systems in the context of network neutrality.

3. The BTWorld measurement project described in Chapter 2 has been running since 2009 and has collected over 15 TB of data, which we are in the process of analyzing. Combining this data with the results of previous studies con- ducted by researchers from the Parallel and Distributed Systems group will result in the first comprehensive history of P2P networks, including over 10 years of activity, legal challenges, technical advances, and the evolution of user contribution.

4. Personal computing is evolving towards mobile and embedded devices. Thanks to increasing IPv6 adoption and manufacturing improvements, more and more devices are connected to the Internet of Things. While this evolution brings advances to many domains, such as healthcare, home automation, and trans- portation, it also raises serious concerns about privacy and security. In this context, building networks based on P2P architectures and algorithms is key .102 7. Conclusion

to maintaining privacy and security, and hence autonomy and control for indi- viduals. Furthermore, P2P networks are designed to work in an environment with intermitent connectivity and strict power usage requirements that favors network locality. P2P researchers have a great opportunity to apply their expertise to this emergent field of computer science.

7 Bibliography

[1] R. Allbery and C. H. Lindsey, “Netnews Architecture and Protocols,” RFC Editor, Internet Requests for Comments RFC 5537, 2009. http: //tools.ietf.org/html/rfc5537

[2] J. C. Klensin, “Simple Mail Transfer Protocol,” RFC Editor, Internet Requests for Comments RFC 5321, 2008. http://tools.ietf.org/html/rfc5321

[3] S. Saroiu, K. P. Gummadi, and S. D. Gribble, “Measuring and analyzing the characteristics of Napster and Gnutella hosts,” Multimedia Systems, vol. 9, no. 2, pp. 170–184, Aug. 2003. https://doi.org/10.1007/s00530-003-0088-1

[4] E. Adar and B. A. Huberman, “Free Riding on Gnutella,” First Monday, vol. 5, no. 10, Oct. 2000. https://doi.org/10.5210/fm.v5i10.792

[5] H. Bleul and E. P. Rathgeb, “A Simple, Efficient and Flexible Approach to Measure Multi-protocol Peer-to-Peer Traffic,” in International Conference on Networking (ICN), ser. Lecture Notes in Computer Science, P. Lorenz and P. Dini, Eds. Springer Berlin Heidelberg, 2005, vol. 3421. https://doi.org/10.1007/978-3-540-31957-3_69

[6] N. Leibowitz, M. Ripeanu, and A. Wierzbicki, “Deconstructing the Kazaa network,” in IEEE Workshop on Internet Applications (WIAPP). IEEE, 2003, pp. 112–120. https://doi.org/10.1109/WIAPP.2003.1210295

[7] K. Tutschku, “A Measurement-Based Traffic Profile of the eDonkey Filesharing Service,” in Passive and Active Network Measurement (PAM), ser. Lecture Notes in Computer Science, C. Barakat and I. Pratt, Eds. Springer Berlin Heidelberg, 2004, vol. 3015. https://doi.org/10.1007/ 978-3-540-24668-8_2

[8] B. Cohen, “Incentives Build Robustness in BitTorrent,” in Workshop on Economics of peer-to-peer systems (P2PECON), Berkley, California, USA, 2003. http://sims.berkeley.edu/research/conferences/p2pecon/papers/ s4-cohen.pdf

[9] B. Cohen, “The BitTorrent Protocol Specification,” BitTorrent extension proposal BEP 3, 2008. http://bittorrent.org/beps/bep_0003.html

[10] A. Parker, “P2P: Threat or opportunity,” in IEEE International Workshop on Web Content Caching and Distribution, 2005. http: //2005.iwcw.org/slides/Panel/Andrewwcw2005.zip

103 104 Bibliography

[11] H. Schulze and K. Mochalski, “Internet Study 2008/2009,” Ipoque, Tech. Rep., 2009. http://www.ipoque.com/sites/default/files/mediafiles/ documents/internet-study-2008-2009.pdf

[12] Sandvine Incorporated, “Global Internet Phenomena.” https://www.sandvine. com/trends/global-internet-phenomena

[13] “Secure Hash Standard (SHS) (FIPS PUB 180-4),” National Institute of Stan- dards and Technology (NIST), Federal Information Processing Standards Pub- lication, 2012. http://csrc.nist.gov/publications/fips/fips180-4/fips-180-4.pdf

[14] R. M. Axelrod, The Evolution of Cooperation, 2nd ed. Perseus Books Group, 2006.

[15] D. Harrison, “The BitTorrent Enhancement Proposal Process,” BitTorrent extension proposal BEP 1, 2008. http://bittorrent.org/beps/bep_0001.html

[16] John Hoffman, “Multitracker Metadata Extension,” BitTorrent extension proposal BEP 12, 2008. http://bittorrent.org/beps/bep_0012.html

[17] A. Loewenstern and A. Norberg, “DHT Protocol,” BitTorrent extension proposal BEP 5, 2008. http://bittorrent.org/beps/bep_0005.html

[18] M. Meulpolder, L. D’Acunto, M. Capotă, M. Wojciechowski, J. A. Pouwelse, D. H. J. Epema, and H. J. Sips, “Public and pri- vate BitTorrent communities: A measurement study,” in International Workshop on Peer-to-Peer Systems (IPTPS). San Jose, Califor- nia, USA: USENIX, 2010. https://www.usenix.org/conference/iptps-10/ public-and-private-bittorrent-communities-measurement-study

[19] D. Harrison, “Private Torrents,” BitTorrent extension proposal BEP 27, 2008. http://bittorrent.org/beps/bep_0027.html

[20] “Tribler.” http://tribler.org

[21] R. Petrocco, J. A. Pouwelse, and D. H. J. Epema, “Performance analysis of the Libswift P2P streaming protocol,” in IEEE International Conference on Peer-to-Peer Computing (P2P). IEEE, Sep. 2012, pp. 103–114. https://doi.org/10.1109/P2P.2012.6335790

[22] A. Bakker, “Merkle hash torrent extension,” BitTorrent extension proposal BEP 30, 2009. http://bittorrent.org/beps/bep_0030.html

[23] N. Zeilemaker, M. Capotă, A. Bakker, and J. A. Pouwelse, “Tribler: P2P media search and sharing,” in ACM International conference on Multimedia (MM). ACM Press, 2011, pp. 739–742. https://doi.org/10.1145/2072298. 2072433 Bibliography 105

[24] N. Zeilemaker, B. Schoon, and J. A. Pouwelse, “Large-scale message synchronization in challenged networks,” in ACM Symposium on Applied Computing (SAC). ACM Press, 2014, pp. 481–488. https://doi.org/10.1145/ 2554850.2554908 [25] J. J. D. Mol, J. A. Pouwelse, M. Meulpolder, D. H. J. Epema, and H. J. Sips, “Give-to-Get: free-riding resilient video-on-demand in P2P systems,” in Multimedia Computing and Networking (MMCN), R. Rejaie and R. Zimmermann, Eds., Jan. 2008. https://doi.org/10.1117/12.774909 [26] J. J. D. Mol, A. Bakker, J. A. Pouwelse, D. H. J. Epema, and H. J. Sips, “The Design and Deployment of a BitTorrent Live Video Streaming Solution,” in IEEE International Symposium on Multimedia (ISM). IEEE, 2009, pp. 342–349. https://doi.org/10.1109/ISM.2009.16 [27] M. Meulpolder, J. A. Pouwelse, D. H. J. Epema, and H. J. Sips, “BarterCast: A practical approach to prevent lazy freeriding in P2P networks,” in IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, May 2009, Conference proceedings (article), pp. 1–8. https://doi.org/10.1109/IPDPS.2009.5160954 [28] N. Zeilemaker, M. Capotă, and J. A. Pouwelse, “Open2Edit: A peer- to-peer platform for collaboration,” in IFIP Networking Conference, 2013. http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6663524 [29] B. H. Bloom, “Space/Time Trade-offs in Hash Coding with Allowable Errors,” Communications of the ACM, vol. 13, no. 7, pp. 422–426, Jul. 1970. https://doi.org/10.1145/362686.362692 [30] “P2P-Next.” http://p2p-next.org [31] A. Bakker, R. Petrocco, M. Dale, J. Gerber, V. Grishchenko, D. Rabaioli, and J. A. Pouwelse, “Online Video Using BitTorrent and HTML5 Applied to Wikipedia,” in IEEE International Conference on Peer-to-Peer Computing (P2P). IEEE, Aug. 2010. https://doi.org/10.1109/P2P.2010.5569984 [32] L. D’Acunto, N. Andrade, J. A. Pouwelse, and H. J. Sips, “Peer Selection Strategies for Improved QoS in Heterogeneous BitTorrent-Like VoD Systems,” in IEEE International Symposium on Multimedia (ISM). IEEE, Dec. 2010, pp. 89–96. https://doi.org/10.1109/ISM.2010.22 [33] B. Zhang, A. Iosup, J. A. Pouwelse, and D. H. J. Epema, “Identifying, analyzing, and modeling flashcrowds in BitTorrent,” in IEEE International Conference on Peer-to-Peer Computing (P2P). IEEE, Aug. 2011, pp. 240–249. https://doi.org/10.1109/P2P.2011.6038742 [34] L. D’Acunto, T. Vinkó, and H. J. Sips, “Bandwidth allocation in BitTorrent-like VoD systems under flashcrowds,” in IEEE International Conference on Peer-to-Peer Computing (P2P). IEEE, Aug. 2011, pp. 192–201. https://doi.org/10.1109/P2P.2011.6038735 106 Bibliography

[35] R. Petrocco, M. Eberhard, J. A. Pouwelse, and D. H. J. Epema, “Deftpack: A Robust Piece-Picking Algorithm for Scalable Video Coding in P2P Systems,” in IEEE International Symposium on Multimedia (ISM). IEEE, Dec. 2011, pp. 285–292. https://doi.org/10.1109/ISM.2011.52

[36] N. Chiluka, N. Andrade, J. A. Pouwelse, and H. J. Sips, “Leveraging Trust and Distrust for Sybil-Tolerant Voting in Online Social Media,” in Workshop on Privacy and Security in Online Social Media (PSOSM). ACM Press, 2012, pp. 1–8. https://doi.org/10.1145/2185354.2185355

[37] N. Chiluka, N. Andrade, D. Gkorou, and J. A. Pouwelse, “Personalizing EigenTrust in the Face of Communities and Centrality Attack,” in IEEE International Conference on Advanced Information Networking and Applications (AINA). IEEE, Mar. 2012, pp. 503–510. https: //doi.org/10.1109/AINA.2012.48

[38] “QLectives.” http://qlectives.eu

[39] R. Delaviz, N. Zeilemaker, J. A. Pouwelse, and D. H. J. Epema, “A Network Science Perspective of a Distributed Reputation Mechanism,” in IFIP Networking Conference, 2013. http://ieeexplore.ieee.org/xpls/abs_all. jsp?arnumber=6663517

[40] D. Gkorou, “Reducing the History in Decentralized Interaction-Based Reputation Systems,” in IFIP Networking Conference, ser. Lecture Notes in Computer Science, R. Bestak, L. Kencl, L. E. Li, J. Widmer, and H. Yin, Eds. Springer Berlin Heidelberg, 2012, vol. 7290. https://doi.org/10.1007/978-3-642-30054-7_19

[41] A. L. Jia, R. Rahman, T. Vinkó, J. A. Pouwelse, and D. H. J. Epema, “Systemic Risk and User-Level Performance in Private P2P Communities,” IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 24, no. 12, pp. 2503–2512, Dec. 2013. https://doi.org/10.1109/TPDS.2012.332

[42] R. Delaviz, N. Andrade, J. A. Pouwelse, and D. H. J. Epema, “SybilRes: A sybil-resilient flow-based decentralized reputation mechanism,” in IEEE International Conference on Distributed Computing Systems (ICDCS), 2012, pp. 203–213. https://doi.org/10.1109/ICDCS.2012.28

[43] D. Gkorou, “Exploiting Graph Properties for Decentralized Reputation Systems,” Ph.D. dissertation, Delft University of Technology, 2014. https://doi.org/10.4233/uuid:0fc40431-61ce-4bf6-b028-37c316e3e6b7

[44] BitTorrent, Inc., “BitTorrent and µTorrent Software Surpass 150 Million User Milestone,” 2012. http://www.bittorrent.com/company/about/ces_ 2012_150m_users

[45] A. Legout, G. Urvoy-Keller, and P. Michiardi, “Rarest first and choke algorithms are enough,” in ACM SIGCOMM conference on Bibliography 107

Internet measurement (IMC). ACM Press, 2006, pp. 203–216. https: //doi.org/10.1145/1177080.1177106

[46] C. Zhang, P. Dhungel, D. Wu, and K. W. Ross, “Unraveling the BitTorrent Ecosystem,” IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 22, no. 7, pp. 1164–1177, Jul. 2011. https: //doi.org/10.1109/TPDS.2010.123

[47] A. L. Jia, R. Rahman, T. Vinkó, J. A. Pouwelse, and D. H. J. Epema, “Fast download but eternal seeding: The reward and punishment of Sharing Ratio Enforcement,” in IEEE International Conference on Peer-to-Peer Computing (P2P). IEEE, Aug. 2011, pp. 280–289. https://doi.org/10.1109/P2P.2011.6038746

[48] I. A. Kash, J. K. Lai, H. Zhang, and A. Zohar, “Economics of BitTorrent communities,” in International Conference on World Wide Web (WWW). ACM Press, 2012, p. 221. https://doi.org/10.1145/2187836.2187867

[49] D. Qiu and R. Srikant, “Modeling and performance analysis of BitTorrent-like peer-to-peer networks,” ACM SIGCOMM Computer Communication Review, vol. 34, no. 4, p. 367, Oct. 2004. https://doi.org/10.1145/1030194.1015508

[50] L. Guo, S. Chen, Z. Xiao, E. Tan, X. Ding, and X. Zhang, “Measurements, analysis, and modeling of BitTorrent-like systems,” in ACM SIGCOMM conference on Internet measurement (IMC). Berkley, California, USA: USENIX, 2005. https://www.usenix.org/conference/imc-05/ measurements-analysis-and-modeling-bittorrent-systems

[51] A. Norberg, “Libtorrent.” http://libtorrent.org

[52] M. Wojciechowski, M. Capotă, J. A. Pouwelse, and A. Iosup, “BTWorld: Towards Observing the Global BitTorrent File-Sharing Network,” in Workshop on Large-Scale System and Application Performance (LSAP) in conjunction with ACM International Symposium on High Performance Distributed Computing (HPDC). ACM Press, 2010, p. 581. https: //doi.org/10.1145/1851476.1851562

[53] J. A. Pouwelse, P. J. Garbacki, D. H. J. Epema, and H. J. Sips, “The Bittorrent P2P File-Sharing System: Measurements and Analysis,” in International Workshop on Peer-to-Peer Systems (IPTPS), ser. Lecture Notes in Computer Science, M. Castro and R. van Renesse, Eds., vol. 3640. Springer Berlin Heidelberg, 2005. https://doi.org/10.1007/11558989_19

[54] M. Izal, G. Urvoy-Keller, E. W. Biersack, P. A. Felber, A. Al Hamra, and L. Garcés-Erice, “Dissecting BitTorrent: Five Months in a Torrent’s Lifetime,” in Passive and Active Network Measurement (PAM), ser. Lecture Notes in Computer Science, C. Barakat and I. Pratt, Eds. Springer Berlin Heidelberg, 2004, vol. 3015. https://doi.org/10.1007/978-3-540-24668-8_1 108 Bibliography

[55] A. Iosup, P. J. Garbacki, J. A. Pouwelse, and D. H. J. Epema, “Correlating Topology and Path Characteristics of Overlay Networks and the Internet,” in IEEE International Symposium on Cluster Computing and the Grid (CCGrid) International Workshop on Global and Peer-to-Peer Computing (GP2P). IEEE, 2006. https://doi.org/10.1109/CCGRID.2006.1630905

[56] D. A. Patterson, D. D. Clark, A. Karlin, J. Kurose, E. D. Lazowska, D. Liddle, D. McAuley, V. Paxson, S. Savage, and E. W. Zegura, “Looking Over the Fence at Networks: A Neighbor’s View of Networking Research,” Committee on Research Horizons in Networking, Computer Science and Telecommunications Board, National Research Council, Tech. Rep., 2001.

[57] G. Siganos, J. M. Pujol, and P. Rodriguez, “Monitoring the Bittorrent Monitors: A Bird’s Eye View,” in Passive and Active Network Measurement (PAM), ser. Lecture Notes in Computer Science, S. B. Moon, R. Teixeira, and S. Uhlig, Eds. Springer Berlin Heidelberg, 2009, vol. 5448. https://doi.org/10.1007/978-3-642-00975-4_18

[58] J. S. Otto, M. A. Sánchez, D. R. Choffnes, F. E. Bustamante, and G. Siganos, “On Blind Mice and the Elephant: Understanding the Network Impact of a Large Distributed System,” ACM SIGCOMM Computer Communication Review, vol. 41, no. 4, pp. 110–121, Aug. 2011. https://doi.org/10.1145/2043164.2018450

[59] B. Zhang, A. Iosup, J. A. Pouwelse, D. H. J. Epema, and H. J. Sips, “Sampling Bias in BitTorrent Measurements,” in Euro-Par, ser. Lecture Notes in Computer Science, P. D’Ambra, M. Guarracino, and D. Talia, Eds. Springer Berlin Heidelberg, 2010, vol. 6271, pp. 484–496. https://doi.org/10.1007/978-3-642-15277-1_46

[60] C. Shannon, D. Moore, K. Keys, M. Fomenkov, B. Huffaker, and K. Claffy, “The Internet Measurement Data Catalog,” ACM SIGCOMM Computer Communication Review, vol. 35, no. 3, p. 97, Oct. 2005. https://doi.org/10.1145/1096536.1096552

[61] J. Yeo, D. Kotz, and T. Henderson, “CRAWDAD: A Community Resource for Archiving Wireless Data at Dartmouth,” ACM SIGCOMM Computer Communication Review, vol. 36, no. 2, pp. 21–22, Apr. 2006. https://doi.org/10.1145/1129582.1129588

[62] A. G. Miklas, S. Saroiu, A. Wolman, and A. K. Demke Brown, “Bunker: A Privacy-Oriented Platform for Network Tracing,” in USENIX Symposium on Networked Systems Design and Implementation (NSDI). USENIX Association, 2009, pp. 29–42. https://www.usenix.org/conference/nsdi-09/ bunker-privacy-oriented-platform-network-tracing

[63] N. Andrade, E. Santos-Neto, F. Brasileiro, and M. Ripeanu, “Resource demand and supply in BitTorrent content-sharing communities,” Computer Bibliography 109

Networks, vol. 53, no. 4, pp. 515–527, Mar. 2009. https://doi.org/10.1016/j. comnet.2008.09.029 [64] D. Wu, P. Dhungel, X. Hei, C. Zhang, and K. W. Ross, “Understanding Peer Exchange in BitTorrent Systems,” in IEEE International Conference on Peer-to-Peer Computing (P2P). IEEE, Aug. 2010. https://doi.org/10.1109/ P2P.2010.5569967 [65] G. Hazel and A. Norberg, “Extension for Peers to Send Metadata Files,” BitTorrent extension proposal BEP 9, 2008. http://bittorrent.org/beps/bep_ 0009.html [66] G. Neglia, G. Reina, H. Zhang, D. Towsley, A. Venkataramani, and J. Danaher, “Availability in BitTorrent Systems,” in IEEE International Conference on Computer Communications (INFOCOM). IEEE, 2007, pp. 2216–2224. https://doi.org/10.1109/INFCOM.2007.256 [67] S. Sen and J. Wang, “Analyzing Peer-To-Peer Traffic Across Large Networks,” IEEE/ACM Transactions on Networking (TON), vol. 12, no. 2, pp. 219–232, Apr. 2004. https://doi.org/10.1109/TNET.2004.826277 [68] T. Karagiannis, A. Broido, M. Faloutsos, and K. C. Claffy, “Transport layer identification of P2P traffic,” in ACM SIGCOMM conference on Internet measurement (IMC). ACM Press, 2004. https://doi.org/10.1145/1028788. 1028804 [69] J. Falkner, M. Piatek, J. P. John, A. Krishnamurthy, and T. Anderson, “Profiling a Million User DHT,” in ACM SIGCOMM conference on Internet measurement (IMC). San Diego, California, USA: ACM Press, 2007. https://doi.org/10.1145/1298306.1298325 [70] D. R. Choffnes and F. E. Bustamante, “Taming the Torrent: A Practical Approach to Reducing Cross-ISP Traffic in Peer-to-Peer Systems,” ACM SIGCOMM Computer Communication Review, vol. 38, no. 4, pp. 363–374, Oct. 2008. https://doi.org/10.1145/1402946.1403000 [71] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker, “A Comparison of Approaches to Large-Scale Data Analysis,” in International Conference on management of data (SIGMOD). ACM Press, 2009. https://doi.org/10.1145/1559845.1559865 [72] Z. Gyöngyi and H. Garcia-Molina, “Web Spam Taxonomy,” in International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005. http://airweb.cse.lehigh.edu/2005/gyongyi.pdf [73] M. Piatek, T. Isdal, T. Anderson, A. Krishnamurthy, and A. Venkatara- mani, “Do Incentives Build Robustness in BitTorrent?” in USENIX Symposium on Networked Systems Design and Implementa- tion (NSDI). USENIX, 2007. https://www.usenix.org/conference/nsdi-07/ do-incentives-build-robustness-bittorrent 110 Bibliography

[74] D. S. Menasche, A. A. A. Rocha, B. Li, D. Towsley, and A. Venkataramani, “Content Availability and Bundling in Swarming Systems,” in International conference on Emerging networking experiments and technologies (CoNEXT). ACM Press, 2009. https://doi.org/10.1145/1658939.1658954

[75] G. Dán and N. Carlsson, “Dynamic Swarm Management for Improved BitTorrent Performance,” in International Workshop on Peer-to-Peer Systems (IPTPS). USENIX, 2009. https://www.usenix.org/conference/iptps-09/ dynamic-swarm-management-improved-bittorrent-performance

[76] G. Dán and N. Carlsson, “Power-law Revisited: A Large Scale Measurement Study of P2P Content Popularity,” in International Workshop on Peer-to-Peer Systems (IPTPS). San Jose, Califor- nia, USA: USENIX, 2010. https://www.usenix.org/conference/iptps-10/ power-law-revisited-large-scale-measurement-study-p2p-content-popularity

[77] M. Ripeanu, A. Iamnitchi, and I. Foster, “Mapping the Gnutella network,” IEEE Internet Computing, vol. 6, no. 1, pp. 50–57, 2002. https://doi.org/10.1109/4236.978369

[78] D. Stutzbach, R. Rejaie, N. Duffield, S. Sen, and W. Willinger, “On Unbiased Sampling for Unstructured Peer-to-Peer Networks,” IEEE/ACM Transactions on Networking (TON), vol. 17, no. 2, pp. 377–390, Apr. 2009. https://doi.org/10.1109/TNET.2008.2001730

[79] F. Le Fessant, S. B. Handurukande, A.-M. Kermarrec, and L. Massoulié, “Clustering in Peer-to-Peer Workloads,” in International Workshop on Peer-to-Peer Systems (IPTPS), ser. Lecture Notes in Computer Science, G. M. Voelker and S. Shenker, Eds. Springer Berlin Heidelberg, 2004, vol. 3279. https://doi.org/10.1007/978-3-540-30183-7_21

[80] R. Bhagwan, S. Savage, and G. M. Voelker, “Understanding Availability,” in International Workshop on Peer-to-Peer Systems (IPTPS), ser. Lecture Notes in Computer Science, M. F. Kaashoek and I. Stoica, Eds. Springer Berlin Heidelberg, 2003, vol. 2735. https://doi.org/10.1007/978-3-540-45172-3_24

[81] M. Steiner, T. En-Najjary, and E. W. Biersack, “Long Term Study of Peer Behavior in the KAD DHT,” IEEE/ACM Transactions on Networking (TON), vol. 17, no. 5, pp. 1371–1384, Oct. 2009. https: //doi.org/10.1109/TNET.2008.2009053

[82] B. Zhang, A. Iosup, J. A. Pouwelse, and D. H. J. Epema, “The Peer-to-Peer Trace Archive: Design and Comparative Trace Analysis,” in ACM CoNEXT Student Workshop. ACM Press, 2010. https://doi.org/10.1145/1921206. 1921229

[83] M. Capotă, N. Andrade, T. Vinkó, F. Santos, J. A. Pouwelse, and D. H. J. Epema, “Inter-swarm resource allocation in BitTorrent communities,” in Bibliography 111

IEEE International Conference on Peer-to-Peer Computing (P2P). IEEE, Aug. 2011, pp. 300–309. https://doi.org/10.1109/P2P.2011.6038748

[84] M. Meulpolder, J. A. Pouwelse, D. H. J. Epema, and H. J. Sips, “Modeling and analysis of bandwidth-inhomogeneous swarms in BitTorrent,” in IEEE International Conference on Peer-to-Peer Computing (P2P). Seattle, Washington, USA: IEEE, Sep. 2009, pp. 232–241. https: //doi.org/10.1109/P2P.2009.5284523

[85] A. Legout, N. Liogkas, E. Kohler, and L. Zhang, “Clustering and sharing incentives in BitTorrent systems,” in ACM International conference on measurement and modeling of computer systems (SIGMETRICS). ACM Press, 2007, p. 301. https://doi.org/10.1145/1254882.1254919

[86] A. V. Goldberg and R. E. Tarjan, “A new approach to the maximum-flow problem,” Journal of the ACM, vol. 35, no. 4, pp. 921–940, Oct. 1988. https://doi.org/10.1145/48014.61051

[87] Mosek ApS, “MOSEK.” http://mosek.com

[88] B. Radunović and J.-Y. Le Boudec, “A Unified Framework for Max- Min and Min-Max Fairness With Applications,” IEEE/ACM Transactions on Networking (TON), vol. 15, no. 5, pp. 1073–1083, Oct. 2007. https://doi.org/10.1109/TNET.2007.896231

[89] D. Bertsekas and R. Gallager, Data Networks, 2nd ed. Upper Saddle River, New Jersey, USA: Prentice-Hall, Inc., 1992.

[90] T. Isdal, M. Piatek, A. Krishnamurthy, and T. Anderson, “Leveraging BitTorrent for End Host Measurements,” in Passive and Active Network Measurement (PAM), ser. Lecture Notes in Computer Science, S. Uhlig, K. Papagiannaki, and O. Bonaventure, Eds. Springer Berlin Heidelberg, 2007, vol. 4427, pp. 32–41. https://doi.org/10.1007/978-3-540-71617-4_4

[91] R. J. Dunn, S. D. Gribble, and H. M. Levy, “The Importance of History in a Media Delivery System,” in International Workshop on Peer-to-Peer Systems (IPTPS), 2007. http://www.iptps.org/papers.html

[92] R. S. Peterson and E. G. Sirer, “Antfarm: efficient content distribution with managed swarms,” in USENIX Symposium on Networked Systems Design and Implementation (NSDI). Boston, Massachusetts, USA: USENIX, 2009, Conference proceedings (article), pp. 107–122. https://www.usenix.org/ conference/nsdi-09/antfarm-efficient-content-distribution-managed-swarms

[93] M. Capotă, N. Andrade, J. A. Pouwelse, and D. H. J. Epema, “Investment strategies for credit-based P2P communities,” in Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP). IEEE, Feb. 2013, pp. 437–443. https://doi.org/10.1109/PDP.2013.70 112 Bibliography

[94] Z. Liu, P. Dhungel, D. Wu, C. Zhang, and K. W. Ross, “Understanding and Improving Ratio Incentives in Private Communities,” in IEEE International Conference on Distributed Computing Systems (ICDCS). Genoa, Italy: IEEE, Jun. 2010, pp. 610–621. https://doi.org/10.1109/ICDCS.2010.90 [95] A. Norberg, G. Hazel, and A. Grunthal, “Extension for partial seeds,” BitTorrent extension proposal BEP 21, 2008. http://bittorrent.org/beps/ bep_0021.html [96] J. H. Friedman, “Multivariate Adaptive Regression Splines,” The Annals of Statistics, vol. 19, no. 1, pp. 1–67, Mar. 1991. https://doi.org/10.1214/aos/ 1176347963 [97] R. Rahman, T. Vinkó, D. Hales, J. A. Pouwelse, and H. J. Sips, “Design space analysis for modeling incentives in distributed systems,” ACM SIGCOMM Computer Communication Review, vol. 41, no. 4, pp. 182–193, Oct. 2011. https://doi.org/10.1145/2018436.2018458 [98] M. Piatek, T. Isdal, A. Krishnamurthy, and T. Anderson, “One hop reputations for peer to peer file sharing workloads,” in USENIX Symposium on Networked Systems Design and Implementation (NSDI). San Francisco, California, USA: USENIX, 2008, pp. 1–14. https://www.usenix.org/ conference/nsdi-08/one-hop-reputations-peer-peer-file-sharing-workloads [99] J. Terrace, H. Laidlaw, H. E. Liu, S. Stern, and M. J. Freed- man, “Bringing P2P to the Web: Security and Privacy in the Firecoral Network,” in International Workshop on Peer-to-Peer Systems (IPTPS). USENIX, 2009. https://www.usenix.org/conference/iptps-09/ bringing-p2p-web-security-and-privacy-firecoral-network [100] S. Iyer, A. Rowstron, and P. Druschel, “Squirrel: A decentralized peer-to-peer web cache,” in Symposium on Principles of distributed computing (PODC). ACM Press, 2002, p. 213. https://doi.org/10.1145/571825.571861 [101] A. Wierzbicki, N. Leibowitz, M. Ripeanu, and R. Wozniak, “Cache Replacement Policies Revisited: The Case of P2P Traffic,” in IEEE International Symposium on Cluster Computing and the Grid (CCGrid). Chicago, Illinois, USA: IEEE, 2004, pp. 182–189. https://doi.org/10.1109/ CCGrid.2004.1336565 [102] M. Hefeeda and O. Saleh, “Traffic Modeling and Proportional Partial Caching for Peer-to-Peer Systems,” IEEE/ACM Transactions on Networking (TON), vol. 16, no. 6, pp. 1447–1460, Dec. 2008. https://doi.org/10.1109/TNET. 2008.918081 [103] D. Wu, Y. Liu, and K. W. Ross, “Modeling and Analysis of Multichannel P2P Live Video Systems,” IEEE/ACM Transactions on Networking (TON), vol. 18, no. 4, pp. 1248–1260, Aug. 2010. https: //doi.org/10.1109/TNET.2009.2038910 Bibliography 113

[104] N. Carlsson, D. L. Eager, and A. Mahanti, “Using Torrent Inflation to Efficiently Serve the Long Tail in Peer-Assisted Content Delivery Systems,” in IFIP Networking Conference, ser. Lecture Notes in Computer Science, M. Crovella, L. M. Feeney, D. Rubenstein, and S. V. Raghavan, Eds. Springer Berlin Heidelberg, 2010, vol. 6091, ch. 1, pp. 1–14. https://doi.org/10.1007/978-3-642-12963-6_1 [105] P. J. Garbacki, D. H. J. Epema, and M. van Steen, “An Amortized Tit-For-Tat Protocol for Exchanging Bandwidth instead of Content in P2P Networks,” in International Conference on Self-Adaptive and Self-Organizing Systems (SASO). Cambridge, Massachusetts, USA: IEEE, Jul. 2007, pp. 119–128. https://doi.org/10.1109/SASO.2007.9 [106] D. Nie, Q. Ma, L. Ma, and W. Tan, “Predicting Peer Offline Probability in BitTorrent Using Nonlinear Regression,” in IFIP International Conference on Entertainment Computing (ICEC), ser. Lecture Notes in Computer Science, L. Ma, M. Rauterberg, and R. Nakatsu, Eds. Springer Berlin Heidelberg, 2007, vol. 4740, ch. 40, pp. 339–344. https://doi.org/10.1007/978-3-540-74873-1_40 [107] A. H. Rasti and R. Rejaie, “Understanding Peer-level Performance in BitTorrent: A Measurement Study,” in International Conference on Computer Communications and Networks (ICCCN), Univ. of Oregon, Eugene. Honolulu, Hawaii, USA: IEEE, Aug. 2007, pp. 109–114. https://doi.org/10.1109/ICCCN.2007.4317805 [108] M. Capotă, J. A. Pouwelse, and D. H. J. Epema, “Towards a Peer-to-Peer Bandwidth Marketplace,” in International Conference on Distributed Computing and Networking (ICDCN), M. Chatterjee, J.- n. Cao, K. Kothapalli, and S. Rajsbaum, Eds. Coimbatore, Tamil Nadu, India: Springer Berlin Heidelberg, 2014, pp. 302–316. https: //doi.org/10.1007/978-3-642-45249-9_20 [109] A. L. Jia, X. Chen, X. Chu, J. A. Pouwelse, and D. H. J. Epema, “How to Survive and Thrive in a Private BitTorrent Community,” in Distributed Computing and Networking (ICDCN), ser. Lecture Notes in Computer Science, D. Frey, M. Raynal, S. Sarkar, R. K. Shyamasundar, and P. Sinha, Eds. Springer Berlin Heidelberg, 2013, vol. 7730, pp. 270–284. https://doi.org/10.1007/978-3-642-35668-1_19 [110] L. Guo, S. Chen, Z. Xiao, E. Tan, X. Ding, and X. Zhang, “A performance study of BitTorrent-like peer-to-peer systems,” IEEE Journal on Selected Areas in Communications (JSAC), vol. 25, no. 1, pp. 155–169, Jan. 2007. https://doi.org/10.1109/JSAC.2007.070116 [111] S. Katti, D. Katabi, C. Blake, E. Kohler, and J. Strauss, “MultiQ: Automated Detection of Multiple Bottleneck Capacities Along a Path,” in ACM SIGCOMM conference on Internet measurement (IMC). ACM Press, 2004, p. 245. https://doi.org/10.1145/1028788.1028820 114 Bibliography

[112] T. M. Schaap, “P2P Testing Framework.” https://github.com/schaap/ p2p-testframework

[113] Z. Chen, Y. Chen, C. Lin, V. Nivargi, and P. Cao, “Experimental Analysis of Super-Seeding in BitTorrent,” in IEEE International Conference on Communications (ICC). IEEE, 2008, pp. 65–69. https://doi.org/10.1109/ ICC.2008.20

[114] C. Aperjis, M. J. Freedman, and R. Johari, “Peer-assisted content distribution with prices,” in ACM International Conference On Emerging Networking Experiments And Technologies (CoNEXT). ACM Press, 2008, pp. 1–12. https://doi.org/10.1145/1544012.1544029

[115] G. Dán and N. Carlsson, “Centralized and Distributed Protocols for Tracker-Based Dynamic Swarm Management,” IEEE/ACM Transactions on Networking (TON), vol. 21, no. 1, pp. 297–310, Feb. 2013. https://doi.org/10.1109/TNET.2012.2198491

[116] M. Capotă, J. A. Pouwelse, and D. H. J. Epema, “Decentralized credit mining in P2P systems,” in IFIP Networking Conference, 2015.

[117] M. Capotă, N. Andrade, J. A. Pouwelse, and D. H. J. Epema, “Investment strategies for credit-based P2P communities,” Delft University of Technology, Tech. Rep., 2014. http://www.pds.ewi.tudelft.nl/fileadmin/ pds/reports/2014/PDS-2014-005.pdf [118] O. Papapetrou, S. Ramesh, S. Siersdorfer, and W. Nejdl, “Optimizing Near Duplicate Detection for P2P Networks,” in IEEE International Conference on Peer-to-Peer Computing (P2P). IEEE, Aug. 2010, pp. 1–10. https://doi.org/10.1109/P2P.2010.5570001 [119] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions and reversals,” Soviet Physics Doklady, vol. 10, p. 707, 1966. [120] D. Jia, W. G. Yee, and O. Frieder, “Spam characterization and detection in peer-to-peer file-sharing systems,” in ACM Conference on Information and knowledge mining (CIKM). ACM Press, 2008, p. 329. https://doi.org/10.1145/1458082.1458128 [121] Van der Spek, Olaf, “UDP Tracker Protocol for BitTorrent,” BitTorrent extension proposal BEP 15, 2008. http://bittorrent.org/beps/bep_0015.html [122] K. Peterson, “BitTorrent file-sharing program floods the Web,” Seattle Times, 2005. http://seattletimes.com/html/businesstechnology/2002146729_ bittorrent10.html Summary

Peer-to-peer (P2P) communities enable users to share with each other photos, videos, and other files over the Internet without any centralized control or infrastruc- ture. Compared to the traditional client-server architecture, the P2P architecture utilized by these communities enables them to scale better and be more resilient when faced with network failures and attacks. At the same time, the operation of P2P communities depends on the contribution of computer resources by users, most importantly storage and bandwidth. Some free-riding users take resources from the community without contributing back, for example by downloading files without uploading. Many P2P communities try to stop free riding by employing security measures, but in doing so, they create a barrier to entry for honest users. Addition- ally, the temporal variability in the popularity of files coupled with the asymmetric upload to download bandwidth ratios typical for residential Internet connections further restrict user contribution. In this thesis, we conduct empirical and theoretical studies of user contribution in P2P communities and design, implement, and test a system to help honest users overcome barriers to entry and increase their contribution. We identify four research questions that we answer in the course of solving the user contribution problem. In Chapter 2 we answer the first research question, RQ1: How to study the oper- ation of global P2P networks? The decentralized nature of P2P networks makes it particularly hard to study them. We propose BTWorld, a system for observing the global BitTorrent network, the most popular type of P2P network. In BTWorld, we do not record the activity of users, which is prohibitive on a global scale, we record instead the activity of the servers that intermediate the communication between users. This design choice enables us to collect sufficient information for charac- terizing the performance and reliability of BitTorrent using only a few monitoring machines. Furthermore, we shed light on the problem of spam trackers and on the availability problem of giant swarms, both previously undocumented BitTorrent phenomena. In Chapter 3 we answer RQ2: How do P2P protocols perform at the community level? A wealth of previous work showed that the performance of the BitTorrent P2P protocol is close to ideal at swarm level. We examine the community-level performance of BitTorrent, taking into account the simultaneous participation of peers in multiple swarms. We propose a mathematical model in which we represent BitTorrent communities as flow networks and compute performance bounds for P2P protocols using graph-theory techniques. Through simulation, we compare the per- formance of BitTorrent to the bounds given by our model. Considering the popular use case of video streaming, we show that BitTorrent’s performance is three times lower than the theoretical bound. In Chapter 4 we answer RQ3: Is it possible to increase the contribution of honest

115 116 Summary users? The mismatch between supply and demand in P2P communities has been thoroughly documented in previous work. Users are often idle in P2P communities, with no other users interested in the resources they offer. We propose a method for predicting demand in P2P communities using only the information available to each user in the community. We show through trace-based simulation that we can correctly rank swarms in a BitTorrent community according to their future demand. In Chapter 5 we answer RQ4: Can user self-interest be aligned with overall community interest? The self-interest of free-riding users is clearly not aligned with overall community interest. If all users were free riders, no one would contribute and the service level in the community would collapse. We investigate whether the behavior of honest, non-free-riding users can also have negative effects for the community as a whole. We propose a model for assessing the effect of increased user upload on the average download speed in the community. We show that an algorithm specifically designed for increasing user contribution always has a positive effect on the community. In Chapter 6 we design, implement, and test a system based on ideas from the preceding chapters that automatically and autonomously increases user contribution in the Tribler P2P community. Our Credit Mining System (CMS) selects from a whitelist provided by the user the swarms with the highest predicted demand. The CMS participates in these swarms with an algorithm designed to maximize the ratio of uploaded to downloaded data. We show with Internet tests that the CMS is effective at increasing user contribution and has a positive effect on the service level in the community. Samenvatting

Peer-to-peer (P2P) gemeenschappen laten gebruikers foto’s, video’s en andere be- standen met elkaar delen via het internet, zonder enige gecentraliseerde controle of infrastructuur. Vergeleken met de traditionele client-server-architectuur laat de P2P architectuur gebruikt door deze gemeenschappen ze beter schalen en zorgt ervoor dat ze bestand zijn tegen fouten en aanvallen. Tegelijkertijd vereist de werking van een P2P gemeenschap de inzet van computerhulpbronnen van zijn gebruikers, vooral met betrekking tot opslag en bandbreedte. Sommige gebruikers nemen middelen van de gemeenschap zonder geven, bijvoorbeeld door het downloaden van bestan- den zonder te uploaden. Veel P2P gemeenschappen proberen dit te voorkomen door het gebruik van veiligheidsmaatregelen, maar daarmee creëren ze een toetredings- drempel voor eerlijke gebruikers. Bovendien beperkt de tijdsafhankelijkheid van de populariteit van bestanden in combinatie met de asymmetrische verhouding tussen de upload- en download-bandbreedte die thuis-internetverbindingen kenmerkt, de bijdrage van gebruikers. In dit proefschrift onderzoeken we proefondervindelijk en theoretisch de bijdrage van gebruikers in P2P gemeenschappen, en ontwerpen, implementeren, en testen we een systeem dat belemmeringen voor eerlijke gebruikers wegneemt en zodoende hun bijdrage verhoogt. We identificeren vier onderzoeksvragen die we beantwoorden bij het oplossen van het probleem van de bijdrage van gebruikers. In Hoofdstuk 2 beantwoorden we de eerste onderzoeksvraag, RQ1: Hoe kun je de werking van globaal opererende P2P-netwerken bestuderen? Het decentrale ka- rakter van P2P-netwerken maakt het bijzonder moeilijk om ze te bestuderen. Wij stellen BTWorld voor, een systeem geschikt om het wereldwijde BitTorrent-netwerk te observeren, het meest populaire type P2P-netwerk. In BTWorld meten we niet de activiteit van een individuele gebruiker, dat is onmogelijk op een wereldwijde schaal, maar meten we in plaats daarvan de activiteit van de servers die de communicatie tussen de gebruikers mogelijk maken. Door deze ontwerpkeuze kunnen wij voldoende informatie verzamelen om de karakteristieken, prestaties, en betrouwbaarheid van BitTorrent te meten met behulp van slechts een enkele observatie-machines. Verder laten we licht schijnen op het probleem van spam trackers en op het beschikbaar- heidsprobleem in reuze-swarms, beide nog niet eerder beschreven verschijnselen in BitTorrent. In Hoofdstuk 3 beantwoorden we RQ2: Hoe presteren P2P-protocollen op ge- meenschapsniveau? Uit eerder werk is gebleken dat de prestaties van het BitTor- rent P2P-protocol vrijwel ideaal is op swarmniveau. We onderzoeken de prestaties van BitTorrent op gemeenschapsniveau, rekening houdend met de gelijktijdige deel- name van peers in meerdere swarms. We formuleren een wiskundig model waarin we BitTorrent gemeenschappen uitdrukken in een stroomnetwerk, en we berekenen de prestatiegrenzen van P2P-protocollen met behulp van graaf-theoretische technie-

117 118 Samenvatting ken. Door middel van simulatie vergelijken we de prestaties van BitTorrent met de grenzen die door ons model worden gevonden. In het populaire geval van video streaming tonen we aan dat de prestaties van BitTorrent drie keer lager liggen dan de theoretische grens. In Hoofdstuk 4 beantwoorden we RQ3: Is het mogelijk om de bijdrage van eerlijke gebruikers te verhogen? In P2P gemeenschappen is de mismatch tussen vraag en aanbod uitvoerig beschreven in eerder werk. Gebruikers zijn vaak inactief in P2P gemeenschappen omdat geen enkele andere gebruiker geïnteresseerd is in de middelen die ze aanbieden. We stellen een methode voor die de vraag in P2P gemeenschappen kan voorspellen en daarvoor alleen de informatie nodig heeft die beschikbaar is voor elke gebruiker. We tonen door middel van trace-gebaseerde simulaties aan dat we de swarms in een BitTorrent gemeenschap correct kunnen sorteren op basis van hun toekomstige vraag. In Hoofdstuk 5 beantwoorden we RQ4: Kan het eigenbelang van gebruikers in overeenstemming worden gebracht met het gemeenschappelijk belang? Het eigen- belang van free-riders ligt duidelijk niet in lijn met het algemene belang. Als alle gebruikers free-riders waren, zou niemand bijdragen en zou het service niveau in de gemeenschap instorten. We onderzoeken of het gedrag van eerlijke, non-free-riding gebruikers ook negatieve gevolgen kan hebben voor de gemeenschap als geheel. We stellen een model voor dat het effect van het verhogen van de upload van gebruikers bepaalt op de gemiddelde downloadsnelheid in de gemeenschap. We tonen aan dat een algoritme specifiek voor het verhogen van de bijdrage van gebruikers altijd een positief effect heeft op de gemeenschap in zijn geheel. In Hoofdstuk 6 ontwerpen, implementeren en testen we een systeem dat, ge- baseerd op ideeën uit de voorgaande hoofdstukken, automatisch en autonoom de bijdrage van gebruikers in de Tribler P2P-gemeenschap verhoogt. Ons Credit Mi- ning System (CMS) selecteert uit een lijst die door de gebruiker is samengesteld de swarms met de hoogste voorspelde vraag. Het CMS participeert in deze swarms met een algoritme dat ontworpen is om de verhouding tussen upload en download te maximaliseren. We laten door middel van Internet tests zien dat het CMS de bijdrage van gebruikers verhoogt en een positief effect heeft op het serviceniveau in de gemeenschap. Curriculum Vitæ

Mihai Capotă

1983-06-29 Born in Buzău, Romania.

Education 1998–2002 Baccalaureate B. P. Hasdeu National College, Buzău, Romania

2002–2007 Engineer’s degree in computer science Politehnica University of Bucharest, Romania Thesis: Mobility Enhancements for Real-time Communi- cation in an IPv6 Environment Supervisor: Prof. dr. ing. Valentin Cristea 2007–2009 Master’s degree in computer science Politehnica University of Bucharest, Romania Thesis: Quantifying Swarm Health to Improve BitTorrent Performance Supervisor: Prof. dr. ing. Nicolae Țăpuș

International collaboration 2007 Engineering intern Fraunhofer FOKUS Institute, Berlin, Germany

2012 Visiting research assistant Federal University of Campina Grande, Brazil

Awards 2000 Romanian Olympiad in Informatics honorable mention

2014 IEEE SCALE Challenge winner

2014 ACM SAC best paper award

119

List of Publications

1. Mihai Capotă, Tim Hegeman, Alexandru Iosup, Arnau Prat-Pérez, Orri Erling, and Peter Boncz, “Graphalytics: A Big Data Benchmark for Graph-Processing Platforms”, in Workshop on Graph Data Management Experiences and Systems (GRADES), doi:10.1145/2764947.2764954 (2015).

2. Mihai Capotă, Johan A. Pouwelse, and Dick H. J. Epema, “Decentralized credit mining in P2P systems”, in IFIP Networking Conference, (2015).

3. Bogdan Ghiț, Mihai Capotă, Tim Hegeman, Jan Hidders, Dick H. J. Epema, Alexandru Iosup, “V for vicissitude: The challenge of scaling complex big data work- flows”, in IEEE/ACM International Symposium on Cluster, Cloud and Grid Com- puting (CCGrid) SCALE Challenge, doi:10.1109/CCGrid.2014.97 (2014). (IEEE SCALE Challenge winner)

4. Mihai Capotă, Johan A. Pouwelse, Dick H. J. Epema, “Towards a Peer-to-Peer Bandwidth Marketplace”, in International Conference on Distributed Computing and Networking (ICDCN), doi:10.1007/978-3-642-45249-9_20 (2014).

5. Riccardo Petrocco, Mihai Capotă, Johan A. Pouwelse, Dick H. J. Epema, “Hiding User Content Interest while Preserving P2P Performance”, in ACM Symposium on Applied Computing (SAC), doi:10.1145/2554850.2555006 (2014). (Best paper award)

6. Niels Zeilemaker, Mihai Capotă, Johan A. Pouwelse, “Open2Edit: A peer-to-peer platform for collaboration”, in IFIP Networking Conference, http://ieeexplore.ieee. org/xpl/articleDetails.jsp?arnumber=6663524 (2013).

7. Mihai Capotă, Nazareno Andrade, Johan A. Pouwelse, Dick H. J. Epema, “In- vestment Strategies for Credit-Based P2P Communities”, in Euromicro Interna- tional Conference on Parallel, Distributed, and Network-Based Processing (PDP), doi:10.1109/PDP.2013.70 (2013).

8. Tim Hegeman, Bogdan Ghiț, Mihai Capotă, Jan Hidders, Dick H. J. Epema, Alexandru Iosup, “The BTWorld Use Case for Big Data Analytics: Description, MapReduce Logical Workflow, and Empirical Evaluation”, in IEEE International Conference on Big Data, doi:10.1109/BigData.2013.6691631 (2013).

9. Tamás Vinkó, Flávio Santos, Nazareno Andrade, Mihai Capotă, “On swarm-level resource allocation in BitTorrent communities”, Optimization Letters, doi:10.1007/ s11590-012-0477-5 (2012).

10. Mihai Capotă, Nazareno Andrade, Tamás Vinkó, Flávio Santos, Johan A. Pouwelse, Dick H. J. Epema, “Inter-swarm resource allocation in BitTorrent communities”, in IEEE International Conference on Peer-to-Peer Computing (P2P), doi:10.1109/P2P. 2011.6038748 (2011).

121 122 List of Publications

11. Maciej Wojciechowski, Mihai Capotă, Johan A. Pouwelse, and Alexandru Iosup, “BTWorld: Towards observing the global BitTorrent file-sharing network”, in Work- shop on Large-Scale System and Application Performance (LSAP) in conjunction with ACM International Symposium on High Performance Distributed Computing (HPDC), doi:10.1145/1851476.1851562 (2010).

12. Michel Meulpolder, Lucia D’Acunto, Mihai Capotă, Maciej Wojciechowski, Johan A. Pouwelse, Dick H. J. Epema, and Henk J. Sips, “Public and private BitTorrent communities: A measurement study”, in International Workshop on Peer-to-Peer Systems (IPTPS), https://www.usenix.org/conference/iptps-10/public-and-private- bittorrent-communities-measurement-study (2010).