University POLITEHNICA of Bucharest

Automatic Control and Computers Faculty, Computer Science and Engineering Department

PHD THESIS

Protocol Measurements and Improvements in Peer-to-Peer Systems

Scientific Adviser: Author:

Prof. dr. ing. Nicolae T, apus˘ , drd. Razvan˘ Deaconescu

Bucharest, 2011 Universitatea POLITEHNICA din Bucures, ti

Facultatea de Automatica˘ s, i Calculatoare, Catedra de Calculatoare

TEZA˘ DE DOCTORAT

Evaluarea s, i îmbunat˘ at˘ ,irea protocoalelor în sisteme Peer-to-Peer

Conducator˘ S, tiint, ific: Autor:

Prof. dr. ing. Nicolae T, apus˘ , drd. Razvan˘ Deaconescu

Bucures, ti, 2011 To those who dream of making the impossible possible ii

The work and contributions highlighted in this thesis are the result of fruitful collaboration with a diversity of people. People who provided me with advice, with motivation, with competition and enthusiasm. I have to thank everyone who stood by me for the past years and supported my research and my professional activity and, at the same time, tolerated my sometimes abrassive behavior and harsh attitude. Firstly, I am in great debt to the effort and careful pieces of advice provided by my supervisor, prof.

Nicolae T, apus˘ , . His calm attitude and insightful remarks have been of exellent value; I owe him a lot of gratitude for the directions he pointed out and the support he offered. Our research collaboration was strenghten by working together in the P2P-Next project where most of my research work had taken place.

Huge thanks go to my friend, colleague and collaborator, Razvan˘ Rughinis, .Razvan˘ has been on my side for the whole duration of my PhD and has provided me with constant motivation to push forward. Though we’ve had our share of differences, Razvan˘ has never failed to provide me his full support towards doing good quality research and putting it to good use. My “PhD brethren”, Mircea Bardac and George Milescu have been my major collaborators and fellow researchers in the Peer-to-Peer field. All three of us have worked closely and collaborated on many research tracks and published many scientific articles together. We have learned a lot from each other. I thank Mircea for providing us with new information regarding research tools and applications. I thank George for his neverending effort and perfectionism and his constant desire to do things the right way. Working within the P2P-Next project I’ve had the pleasure of working close to high quality re- searchers and engineers. People such as Johan Pouwelse, Keith Mitchell, Johnathan Ishmael have been instrumental in my learning of clever approaches to researching and skills required to be part of a large project. With a high level of expertise in distributed systems research, prof. Valentin Cristea provided me with valuable pieces of advice and directions for my PhD. Prof. Cristea managed to transfer me some of his passion for research activities and his insight on dealing with scientific challenges.

Andrei Pitis, and Tavi Purdila˘ have ensured mentorship in team-based environments and technical matters. Their vast knowledge and skills and their neverending trust in me mean a lot and have acted as a continuous of enthusiasm. Within the P2P-Next project in particular and the Peer-tp-Peer research area, collaborators such as

Calin˘ Burloiu, Adriana Draghici,˘ Marius Sandu-Popa, Costin Lupu, Aurelian Bogdan, Bogdan Drut, u, Oana Baron supported my research and allowed the birth of new ideas and expansion of old ones. Thanks go to my colleagues and fellow researchers Laura Gheorghe, Mugurel Andreica, Alex

Costan, Eliana Tîrs, a, Vlad Posea, Traian Rebedea, Andreea Urzica.˘ We’ve interacted plenty of times in various projects and exchanged ideas, opinions and skills. As part of the Systems group, people such as Vlad Dogaru, Alex Juncu, Daniel Rosner, Laura Ghe- orghe, Mihai Bucicoiu, Bogdan Doinea have taken part in fruitful discussions regarding research tracks. The diversity and activism of the group are a constant source of motivation for my taking on as many challenges as possible. Working close to clever and dilligent students in engineering and research projects has been in- strumental to my constant yearn to try out new things. Many thanks go to people such as Andrei Buhaiu, Lucian Cojocar, Alex Eftimie, Vali Priescu, Cat˘ alin˘ Moraru, Sergiu Iordache. We’ve worked together on a variety of projects that are continuing today with more and more people and ideas. I thank all members of the ROSEdu community for the constant involvement in open-source and educational projects, for taking up challenges and for helping me whenever possible. I thank Mihai Maruseac, Andrada Georgescu, Laura Vasilescu, Victor Carbune,˘ Adrian Scoica,˘ Andrei Maruseac, iii

Vlad Voicu, Dragos, Dena and others for supporting me in all good or bad ideas I’ve come up with and providing the best outcome to adjiacent activities. My family has and played a big part in my life and professional activity. Their constant support, unending trust and continuous possitive attitude form an infinite reservoir of energy for all activities I’m involved in. I thank my parents, Ion and Cornelia, my sister Alina and my brother-in-law and good friend Alex for always being there for me and their providing of help whenever required. For the past years, I’ve had the privilege to meet, interact and collaborate with a diversity of people. I’ve learned a lot, gathered experience and received constat support from the ever growing crowd of enthusiastic and passionate people around. Thank you all for your patience, your stamina and your trust in me. This work is embued with your effort and I will strive to give you back my whole appreciation. Abstract

In the ever evolving Internet, Peer-to-Peer systems have grown from a fan-based file sharing so- lution to account for the highest percentage of Internet bandwidth. The development of hardware systems and network links ensured powerful resources for edge nodes in the Internet. Peer-to-Peer systems took advantage of these newly developed resources and provided the means to provide new services to users. While traditionally being used for file sharing, Peer-to-Peer solutions have been developed for streaming (both live and video-on-demand), social networking, collaborative work, anonymity. Peer-to-Peer protocols in general, and the BitTorrent protocol in particular form the subject of this work. We’ve undertaken a thorough analysis of current protocols and their particularities and pro- vided enhancements. We consider the current design to be a good one, gaining experience for about 10 years of continuous research and development. As we are focused on research, our aim is to take protocols that are doing things right and work to make that better. This generally means making them faster. Protocol design contributions are present in the design and implementation of the Tracker Swarm Unification Protocol (TSUP) and contributions to a multiparty protocol. The former provides an enhancement to an existing protocol – BitTorrent, while the other provides updates to a newly designed Peer-to-Peer protocol, challenging current status of data distribution. Peer-to-Peer streaming has been tackled in the local LivingLab. The LivingLab has witnessed multiple experiments using the P2P-Next framework. These experiments have provided information and measurement data for analysis and comparison between classical Peer-to-Peer distribution and streaming. As formal results, metrics for virtualization solutions and Peer-to-Peer measurements are proposed. The aim is to provide the context for analyzing and improving Peer-to-Peer protocols and node distribution. In order to run experiments and to provide insight to the inner protocol workings and network, mea- surement frameworks have been designed and implemented. Our approach means using logging information provided by Peer-to-Peer implementations and gather protocol performance data. This has been possible with the help of a designing, building and using a lightweight virtualized infras- tructure. The infrastructure is used to configure and deploy a complete Peer-to-Peer network and gather information afterwards. Considering the sheer size of research and development in Peer-to-Peer systems, this work couldn’t possibly present an exhaustive approach to improving Peer-to-Peer protocols. We have constructed a Peer-to-Peer infrastructure stack for experimentation and provided protocol improvements and updates. We hope that we’ve set up new bricks in the right direction and are anxious to see new challenges being taken.

iv Contents

Acknowledgements i

Abstract iv

1 Introduction 1 1.1 Thesis Objective...... 2 1.2 Thesis Scope...... 3 1.3 Related Work...... 3 1.4 Context...... 6 1.5 Contents...... 7

2 Peer-to-Peer Systems and Protocols9 2.1 The Peer-to-Peer Paradigm...... 10 2.1.1 Features and Models of Peer-to-Peer Systems...... 11 2.1.2 P2P Resource Management...... 11 2.1.3 P2P Topologies for File Sharing...... 13 2.2 Peer-to-Peer Systems Deployed in the Internet...... 15 2.2.1 Napster...... 15 2.2.2 Gnutella...... 16 2.2.3 FastTrack...... 17 2.2.4 eDonkey Network...... 18 2.2.5 DirectConnect...... 18 2.2.6 Instant Messaging. Jabber/XMPP...... 19 2.2.7 P2P Collaborative Spaces. JXTA...... 20 2.2.8 BitTorrent...... 20 2.3 Content Distribution in Peer-to-Peer Systems...... 22 2.3.1 Peer-to-Peer Streaming...... 23 2.3.2 Video-on-Demand...... 23 2.3.3 Live Streaming...... 24 2.3.4 Existing Solutions...... 26 2.4 Issues and Challenges in Peer-to-Peer Systems...... 27

3 Deploying Peer-to-Peer Network Infrastructures through Virtualization 29 3.1 Virtualization Solutions...... 30 3.1.1 Benefits of Virtualization...... 30 3.1.2 Types of Virtualization Solutions...... 31 3.1.3 OpenVZ...... 33 3.2 Virtualized Peer-to-Peer Infrastructure Setup...... 34 3.2.1 Overall View...... 34 3.2.2 Virtualized Network Configuration...... 35 3.2.3 Virtualized Hardware Configuration...... 36 3.3 Evaluating Virtualization...... 39

v CONTENTS vi

3.4 Automating Deployment and Management of Peer-to-Peer Clients...... 42 3.4.1 Instrumenting Peer-to-Peer Clients...... 43 3.4.2 Framework Setup...... 44 3.4.3 Client Startup and Management...... 46 3.5 Simulating Connection Dropouts...... 47 3.5.1 Reproducing Network Behavior...... 48 3.5.2 Experimental Setup...... 48 3.5.3 Results...... 50 3.6 Deployed Setup and Experimental Scenarios...... 52 3.6.1 Virtualized Configuration...... 52 3.6.2 Monitoring Experiment...... 52 3.6.3 Performance Evaluation Experiments...... 54 3.6.4 Results...... 54 3.6.5 Deploying the Rendering Engine...... 56 3.7 Conclusion...... 57

4 Protocol Measurement and Analysis in Peer-to-Peer Systems 58 4.1 BitTorrent Messages and Parameters...... 58 4.1.1 Protocol Messages...... 59 4.1.2 Measured Data and Parameters...... 61 4.1.3 Approaches to Collecting and Extracting Protocol Parameters...... 61 4.2 A Generic BitTorrent Protocol Logging Library...... 63 4.2.1 Library APIs...... 63 4.2.2 Library Initialization and Cleanup...... 64 4.2.3 Peer Logging Structure...... 65 4.2.4 Core Module...... 65 4.2.5 Buffering Module...... 66 4.2.6 Providing Logging Output...... 66 4.2.7 Clients Instrumented to Use the Library...... 68 4.3 Log Collecting for BitTorrent Peers...... 71 4.3.1 Instrumenting BitTorrent Clients for Logging Support...... 71 4.3.2 Storage of Logging and Processed Data...... 72 4.3.3 Peer-to-Peer Log-Centric Experiments...... 73 4.3.4 Approaches to Processed Data Analysis...... 74 4.4 Protocol Data Processing Engine...... 74 4.4.1 Post Processing Framework for Real-Time Log Analysis...... 76 4.4.2 Monitoring Peer-to-Peer Swarms using MonALISA...... 79 4.5 Evaluation Based on Peer-to-Peer Protocol Measurements...... 81 4.6 Conclusion...... 85

5 A Novel BitTorrent Tracker Overlay for Swarm Unification 87 5.1 Context and Objective...... 88 5.2 Swarm Unification Protocol...... 88 5.2.1 Protocol Overview...... 89 5.2.2 Tracker Awareness...... 90 5.2.3 Tracker Networks...... 90 5.3 Protocol Implementation...... 92 5.4 Experimental Setup...... 94 5.5 Experimental Results and Evaluation...... 95 5.6 Conclusion...... 96

6 Designing an Improved Multiparty Protocol in the Kernel 99 6.1 The swift Multiparty Protocol...... 99 6.2 Designing a Multiparty Protocol...... 100 CONTENTS vii

6.3 Raw Socket Wrapper Implementation...... 101 6.3.1 Raw Socket-Based Architecture...... 101 6.3.2 Testing of Raw Socket Implementation...... 103 6.4 Kernel Framework for Multiparty Protocol Implementation...... 105 6.4.1 Transport Protocol Implementation on the Linux Kernel...... 105 6.4.2 Testing the Kernel Protocol Implementation...... 108 6.5 Conclusion...... 108

7 Measurements of Multimedia Distribution in Peer-to-Peer Networks 110 7.1 Video Content Creation and Processing...... 111 7.2 Adding Video Streaming Support in ...... 113 7.2.1 Employed Streaming Solutions...... 114 7.2.2 Design and Implementation of a Streaming Component in libtorrent...... 115 7.2.3 Experimental Evaluation of Streaming Support...... 116 7.3 Using Next-Share Technology for P2P Streaming...... 116 7.3.1 Next-Share Architecture and Features...... 117 7.3.2 NextSharePC...... 119 7.3.3 NextShareTV...... 120 7.3.4 LivingLab...... 121 7.4 Evaluation of P2P Streaming Solutions...... 122 7.4.1 LivingLab Infrastructure...... 123 7.4.2 Evaluation Goals of Streaming Trials in the LivingLab...... 126 7.4.3 Experiments...... 127 7.4.4 Issues and Feedback...... 130 7.5 Conclusion...... 133

8 Conclusion 135 8.1 Summary...... 136 8.2 Contributions...... 137 8.3 Future Work...... 139 8.4 Publications and Talks...... 140 8.4.1 Papers...... 140 8.4.2 Books...... 141 8.4.3 Workshops...... 141 Chapter 1

Introduction

“Make it run, then make it right, then make it fast.” – Kent Beck Technology is strongly related to progress. As time goes by, technology is the prime ingredient for progress. Progress, in its turn, forces technology to adapt itself, to become better and more useful, to surpass its original design and provide new outcomes. Computer Science and Computer Engineering are among the most dynamic fields of human knowl- edge with continuous transformations of the technology, scope and principles it enables. Among the myriad of inventions, discoveries and contributions to the development of human civi- lization, the Internet stands out as one of the most impressive and extraordinary. With only a few decades of existence, the Internet has managed to provide the means for uniting the entire planet, for ensuring access to knowledge, information and digital resources. The Internet is also a catalyst for all major technological advances in many other fields of study where researchers and engineers alike have been challenged to provide better services, better connectivity, better user experience, better performance and increased diversity. On the frontier between two millennia, when the Internet had become a mainstream technology and de facto service, a new technology emerged, fulfilling the need for data exchange among edge users: Peer-to-Peer Systems. The exploding evolution of the Internet had provided users everywhere with higher and higher band- width. The similarly explosive evolution of computer technology ensured faster CPUs, larger hard disks, more memory. As even fringe users had access to a good quality resources, the need to easily interconnect and share information became preeminent. Peer-to-Peer Systems thus filled the vacuum technological space enabled by the ever progressing Internet technology. Peer-to-Peer technology is as early as the Internet. In the beginnings of the Internet, all stations had limited power and limited bandwidth and had to cooperate to achieve a common goal. A prime ex- ample of this is the email service. Peer-to-Peer have reemerged in late 90s when Napster1 became a true powerful service in use by multiple users. Through more than 10 years of research, engi- neering and updates, Peer-to-Peer concepts, applications and protocols have managed nowadays to become common knowledge. As one of the most highly successful and most heavily used protocol in the Internet2, the BitTorrent protocol [11] has captured the attention of research and commercial entities alike. With a “tit-for- tat”3 algorithm at its base, the BitTorrent protocol is nowadays the de facto protocol used for large

1http://www.napster.com/ 2http://www.sandvine.com/news/global_broadband_trends.asp 3http://www.gametheory.net/dictionary/TitforTat.html

1 CHAPTER 1. INTRODUCTION 2

file distribution, offering advantages such as high speed, uniform bandwidth consumption, low load on stations and extensibility. The BitTorrent protocol is, in this thesis author’s opinion, an example of a protocol that had been made to work, and then has been made to work right, though, as every protocol, has its own downsides [24]. Simple, yet powerful, BitTorrent shines when used for large content distribution and has subsequently been adapted for other features such as streaming [56]. It is around the BitTorrent protocol that this thesis is centered around. Our aim, as highlighted in the thesis, is to take the protocol, run measurements, analyze it and make it better: make it faster, enhance its performance, its usability, broaden its scope. We do not aim to fill every possible spot of improvement, but rather signal relevant aspects to be taken into account to ensure increased performance.

1.1 Thesis Objective

When involved in Peer-to-Peer systems, updates, enhancements and improvements have to take into account the rules, diversity and particularities of such systems. Carefully crafted experiments, sensible measurements, reasonable conclusions, useful proposals are important components of research activities in this domain. The objective of this thesis is providing a series of improvements to Peer-to-Peer systems at proto- col level, either through updating and enhancing existing protocols or designing and implementing new ones. Formal and experimental evaluations are employed to provide arguments regarding the advantages of these updates. Providing improvements is based on careful trial deployment and analysis of Peer-to-Peer protocols. The use of testing environments and the employment of mea- surements and evaluation techniques are critical to the successful providing of protocol updates and improvements. As a secondary objective, the thesis highlights the methods and mechanisms for creating realistic, scalable and automated trial environments and the approaches available for collecting, measuring and interpreting Peer-to-Peer protocol parameters. Providing proper evaluation of deployed trials information is critical to the relevance of collected data. An important part of the thesis is dedicated to evaluating architectures, protocols, actual implementations, internal messages, network topologies with the help of versatile infrastructures. Monitoring, log processing, internal processing are all employed to allow in-depth view and analysis of P2P protocols. As much of the work highlighted in this paper has been part of the P2P-Next Project1, extensive focus has been given to using P2P technology for providing content streaming. With BitTorrent at the focus of the research in place, various adaptations and extensions have been analyzed. BitTorrent’s rarest-piece-first strategy is not suitable for streaming and adaptations to the piece- picker policy have to be taken into account. The NextShare technology, used in the P2P-Next project is the core of analysis and measurements. Measurements, evaluation and analysis were driven at providing better network architectures and protocols for improved performance in P2P systems. A tracker overlay that ensures swarm uni- fication is a proposal for providing healthy swarms and content unification. Providing a proper inter-tracker communication protocol, marking relevant measurements and evaluation is part of the objective of providing enhancements to existing solution. As is the kernel-based implementation of a BitTorrent-alternative dubbed swift, designed from ground up as an OSI-stack Transport-layer protocol.

1http://www.p2p-next.org CHAPTER 1. INTRODUCTION 3

1.2 Thesis Scope

In accordance with the stated objective of providing improvements and updates to Peer-to-Peer systems (as also stated in the title), this thesis primarily defines its scope in the field of Peer- to-Peer systems, with an overal view present in Chapter2. In particular, most of the research and development of this work is centered around the BitTorrent protocol, one of the most popular protocols in the Internet. Motivated by the desire to provide new and useful updates to existing applications and creating new ones, the approach employed is one that carefully deploys and monitors Peer-to-Peer applications, collects protocol parameters, puts them under scrutiny and analysis and provides the input required for improvements, extensions and updates. While multiple approaches may be considered for client, protocol, network and swarm analysis, the approach chosen relies on low level information, considered to be the basis of protocol parameters. We are not (directly) concerned with the overall view of the Peer-to-Peer system, but rather with an in depth view of the client behavior. Hook points in Peer-to-Peer client implementations, client logging or network traffic analysis provide the means to gather the valuable low level Peer-to-Peer protocol information. Protocol messages contain valuable information such as client download rate, client upload rate, number of connected peers, protocol events. These parameters are analysed and conclusions are drawn allowing for protocol improvements to be proposed, analyzed and then put to use. An im- portant approach is using collected parameters for comparison between various implementations, protocols or behavior in diverse situations. Protocol improvements may result in updates to existing protocols and solutions as is the case with the tracker overlay presented in Chapter5 or the design of a novel protocol as is the case with the multiparty protocol implementation in the Linux kernel detailed in Chapter6.

1.3 Related Work

Measurements and improvements in Peer-to-Peer systems have been the focus of commercial and scientific players alike. Thorough analysis, a diversity of approaches, different interests have sparked a plethora of novel mechanisms, architectures, updates and extensions to existing appli- cations as well as new and improved implementations to both old and new problems. Recent years have shown particular interest in the streaming capabilities of Peer-to-Peer solutions, resulting in a surge of updates, extensions and trials. Novel protocols such as swift 1 are proof to the continuous birth of new protocols in the ever evolving Internet. Most measurements and evaluations involving the BitTorrent protocol and applications are either concerned with the behavior of a real-world swarm or with the internal design of the protocol. There has been little focus on creating a self-sustained swarm management environment capable of de- ploying hundreds of controlled peers, and subsequently gathering results and interpreting them. The PlanetLab2 infrastructure provides a realistic testbed for Peer-to-Peer experiments. PlanetLab nodes are connected to the Internet and experiments have a more realistic testbed where delays, bandwidth and other are subject to change. Tools are also available to aid in conducting experi- ments and data collection [73]. Pouwelse et al. [58] have done extensive analysis on the BitTorrent protocol using large real-world swarms focusing on the overall performance and user satisfaction. Guo et al. [28] have modeled

1http://tools.ietf.org/html/draft-grishchenko-ppsp-swift-03 2http://www.planetlab.org CHAPTER 1. INTRODUCTION 4 the BitTorrent protocol and provided a formal approach to its functionality. Bharambe et al. [6] have done extensive studies on improving BitTorrent’s network performance. Garbacki et al. [25] have created a simulator for testing 2Fast, a collaborative download protocol. The simulator is useful only for small swarms that require control. Real-world experiments involved using real systems communicating with real BitTorrent clients in the swarm. A testing environment involving four major BitTorrent trackers for measuring topology and path characteristics has been deployed by Iosup et al. [31]. They used nodes in PlanetLab. The mea- surements were focused on geo-location and required access to a set of nodes in PlanetLab. The authors’ efforts are directed towards correlating characteristics of BitTorrent and its Internet under- lay, with focus on topology, connectivity, and path-specific properties. For this purpose they de- signed and implemented Multiprobe, a framework for large-scale P2P file sharing measurements. The main difference between their implementation and our approach is that we focus on an in-depth client-level analysis and not on the whole swarm. Dragos Ilie et al. [30] developed a measurement infrastructure with the purpose of analyzing P2P traffic. The measurement methodology is based on using application logging and link-layer packet capture. Meulpolder et al. [47] present a mathematical model for bandwidth-inhomogeneous BitTorrent swarms. Based on a detailed analysis of BitTorrent’s unchoke policy for both seeders and leechers, they study the dynamics of peers with different bandwidths, monitoring their unchoking and uploading/- downloading behavior. Their analysis showed that having only peers with the same bandwidth is not enough to determine in-depth the peers’ behavior. In those experiments they split the peers into two bandwidth classes - slow and fast - and they observed that slow ones usually unchoked other slow peers, their data being transfered from fast peers. Although they do not offer precise details about the experimental part of monitoring unchoking behavior and transfers rates, their work relates to what we intend to do: put to use the BitTorrent logging messages that a Peer-to-Peer network generates, parses and stores. While Meulpolder et al. [47] provide a peer-level analysis, another approach is to study BitTorrent at tracker level, as described by Bardac et al. [4]. This paper implements a scalable and exten- sible BitTorrent tracker monitoring architecture, currently used in the Ubuntu Torrent Experiment1 experiment at University POLITEHNICA of Bucharest (UPB), the Computer Science and Engineer- ing Department. The system analyses the Peer-to-Peer network considering both the statistic data variation and the geographical distribution of data. This study is based on a similar infrastructure with the one we use for our client and protocol level analysis. As video-based traffic has continuously filled the Internet backbone, effort has been invested in pro- viding streaming functionality to Peer-to-Peer solutions. As Peer-to-Peer file sharing solutions use the “offline playback” model, protocol and overlay network topology updates had to be considered. Guo el al. [40] have put up a survey of available P2P streaming solutions in late 2007. They created a state of the art of streaming solutions and took into consideration two basic overlay topologies: tree-based systems and mesh-based systems (similar to swarm). For each of them they surveyed existing Peer-to-Peer solutions for VoD and for live streaming. They concluded that P2P streaming systems are capable of streaming video to a large population at low server cost and with minimal dedicated infrastructure. However, they signaled fundamental limitations such as reduced Quality of Experience and the non-ISP friendly status of P2P applications and. An important paper that was referred to by Guo et. al, when considering VoD on mesh-based systems was “BiToS: Enhancing BitTorrent for Supporting Streaming Applications”. In the latter article, Vlavianos et al. [72], propose updates to the popular BitTorrent protocol to turn in into a VoD-friendly protocol, which they refer to view-as-you-download. They acknowledge the fact that BitTorrent is an adequate protocol for large data distribution but lack support for basic streaming

1http://torrent.cs.pub.ro/ CHAPTER 1. INTRODUCTION 5 due to its “rarest-piece-first” piece selection policy. Therefore, they split the pieces that need to be collected into two sets: a high priority set and a remaining piece set. They attach a probability to a given piece and conclude that an updated priority-enhanced rarest piece first-based policy retains good performance specific to the BitTorrent protocol and provides VoD streaming support, allowing the construction of a view-as-you-download service. Lessons from BiToS were used and enhanced in “Give-to-Get”, an updated form of “tit-for-tat” within BitTorrent developed at TUDelft by Mol et al. [51]. Give-to-Get assumes altruistic behavior for peers – that is it will voluntarily give out pieces to a peer in hope that peer will be reciprocal. Give-to-Get takes into account that a peer’s good history is mostly irrelevant as a peer is concerned with data that needs to be played out “now”; peer ranking is built up accordingly. The solution using Give-to- Get makes to important updates to BitTorrent: the neighbor selection policy and the piece selection policy. Give-to-Get borrows heavily from BiToS, and uses priority sets for altering the piece selection policy. Unlike BiToS, however, Give-to-Get uses three sets: a high priority set, a medium priority set and low-priority set. Give-to-Get had been designed with fairness in mind and the issue of free riding is addressed heavily in the paper. Xing et al. [33] proposed methods to detect malicious peers in a streaming network that may be used to increse the health of the swarm. Somewhat similar, Lehriender et al. [36] proposed mechanisms to mitigate unfairness in locality-aware Peer-to-Peer networks. Mol et. al [50] had also worked on creating a live streaming update to BitTorrent, as described in his technical paper “The Design and Deployment of a BitTorrent Live Video Streaming Solution”. As with Give-to-Get, their solution was a mesh/swarm-based one and requested important updates to the non-sequential BitTorrent protocol. A “.tstream” file had been generated to describe content that, being live stream, had no pre-determined length. A sliding window had been added for each peer, such that information prior and before and certain point would be requested. The initial seeder, also dubbed injector, provides the stream to other peers that may, in turn, provide it to others. Dealing with the constraints and specifics of a live streaming network, a seeder is redefined to being a peer which is always trusted an unchoked by the injector and thus has access to the whole stream. Seeders are required to ensure stability of the stream. The latter two approaches (designed and implemented by Mol et. al [51][50]) form the basis for the NextShare P2P streaming solution described in Section 7.3. A novel approach to Peer-to-Peer file sharing, the swift protocol1 aims to transform the virtually infinite capacity of nodes in the Internet in the ground for continuous file sharing. While still in its infancy, the swift protocol has proven to be a serious candidate in the Peer-to-Peer protocol market. Inspired by the BitTorrent protocol, swift has been dubbed BitTorrent at Transport layer. Current implementation is user space based on UDP, but progress has been made toward a complete implementation of swift in kernel space. swift discards options it considers obsolete such as the use of TCP, and thus ordered delivery and reliable transfer, and proposes new approaches to storing and exchanging metadata such as binmaps [26], integrated tracker and the effective use of Merkle hashes and trees. One major asset of swift is the current effort (as of the writing of this thesis) to create an IETF standardisation. There has been some progress but the road is still ahead to transform swift into a fully fledged Transport layer protocol to serve as a multiparty Peer-to-Peer protocol. A current draft has been submitted to the IETF PPSP team.

1http://libswift.org CHAPTER 1. INTRODUCTION 6

1.4 Context

The context of this work is defined by several key components, ranging from the EU FP7 P2P-Next project to protocols such as BitTorrent or swift, from the streaming updates and LivingLab used for trials to approaches such as virtualization and logging information that formed the basis for the infrastructure used as experimental setup. The EU FP7 P2P-Next project1 is the major component in the context of this work. Much of the effort and results in the presented work have been part of the P2P-Next project. The project has stated an ambitious goal of “providing the next generation content delivery platform”. With high ranked commercial and academic partners in Europe, it incorporates P2P technology to provide streaming (both Video-on-Demand and live streaming) for PC and for CE such as set-top boxes. Technology used by the P2P-Next project is dubbed NextShare technology. The BitTorrent protocol may be considered the flagship of nowadays P2P file sharing networks. A highly successful and powerful protocol2, BitTorrent stands at the basis of the NextShare core of the P2P-Next project. BitTorrent’s success story has spawned a wild number of implementations leading to our analysis of its inner workings for a variety of applications and topologies. The BitTor- rent protocol is easily identified by the two core algorithms used for sharing content: tit-for-tat for fair piece exchange and rarest-piece-first for the piece picking policy. swift is a radical new approach to content distribution as Peer-to-Peer. It discards many of the non-useful characteristics of TCP and proposes a Transport layer protocol (in the OSI stack) that would be able to intrinsically communicate with peers. It would not follow the usual end-to-end approach as other Transport layer protocols, but rather create a “multiparty” approach where a request is simultaneously sent to multiple peers. Facilities such as in-order delivery or reliability are discarded, making swift act as as “BitTorrent protocol at Transport layer”. A significant component of this work, mostly highlighted in Chapter7 has been based around the local LivingLab3. A LivingLab is a user-centric testing ground for use within the P2P-Next project. Major partners of the project possess LivingLabs as a real environment for evaluating, testing and comparing existing applications. The focus of the local UPB LivingLab is comparison of various Peer-to-Peer streaming implementations and classical BitTorrent distribution implementations. Withing the LivingLab, the major keyword is Peer-to-Peer streaming. A major interest in recent years, streaming requires important (and sometimes severe) updates to existing protocols. The BitTorrent protocol, for example, has to be updated and its “rarest piece first” policy is replaced by a set of priority queues[72], some of which use a sequential piece selection policy in order to ensure proper delivery and streaming playback, especially when considering live streaming. Testbeds and experimental setups heavily employed throughout this work have made use of virtu- alization technology. Virtualization (especially through the use of OpenVZ) has provided the means for deploying medium to large scale swarms in a realistic, automated and efficient manner. As it is difficult to provide a high enough number of systems for a realistic swarm, virtualization offers the benefit of providing a close-to-real environment on top of a much lower number of systems (by the order of tens) [17]. Information required for thorough analysis of protocol behavior, messages and parameters is typi- cally collected through client-based logging (and, to some extent, tracker logging). Client logging contains all relevant information for analysis, with parameters such as download rate, upload rate, connected peers, peer IDs, protocol events and protocol message types. This information may be filtered and parsed to provide an easy to access, use and process storage of Peer-to-Peer param- eters.

1http://www.p2p-next.org 2http://www.sandvine.com/news/global_broadband_trends.asp 3http://p2p-next.cs.pub.ro CHAPTER 1. INTRODUCTION 7

1.5 Contents

Thesis contents follow both a chronological order, when considering the work employed, and a logical order as latter chapters rely on information presented by former chapters. Forward and backward references are provided where necessary, while always keeping an eye out for providing the initial information in a previous chapter. Chapter2– Peer-to-Peer Systems presents a state of the art of Peer-to-Peer systems and imple- mentations. It provides insight on the birth of P2P systems, their evolution and current challenges. Particular focus is attributed to the BitTorrent Protocol and content distribution as they collectively form the context of this work. We insist on streaming features for P2P systems as it is a relevant aspect presented in this thesis. Issues and challenges are also identified and described. The backbone of most of the experiments and trials involved is the virtualized network infrastructure and features presented in Chapter3. We argument the benefits of using virtualization for creating P2P environments and highlight the characteristics of the employed infrastructure. Thorough infor- mation is provided about automating the deployment of the infrastructure, Peer-to-Peer clients and gathering results. We have designed, developed and deployed a novel virtualization infrastructure providing such features as: scalability, extensibility, automation. Based on OpenVZ and using a plethora of tools, the infrastructure has been used to create experimental setups for scenarios including performance measurements, implementation comparison, streaming versus classical distribution. We propose a set of metrics that provide performance information about various virtualization solutions, including OpenVZ, Xen and LXC. Chapter4 is concerned with methods and approaches of gathering information from Peer-to-Peer clients, parsing that information in protocol parameters and subjecting it to analysis. Two ap- proaches used for collecting logging information are the implementation of a generic library whose API is “hooked” in existing applications and the usage of client logging data. A custom processing framework is provided to interpret data provided by clients. Information may be analyzed either subsequent to the log collection activity (post-processing) or in the same time (live/real-time). The latter approach may be used to provide monitoring. The generic library has been integrated in the libtorrent-rakshasa and Transmission BitTorrent clients, while the custom processing framework has been integrated with a variety of clients, due to its less intrusive nature. Storage, parsing and analysis components allow analysis and mea- surements of BitTorrent swarms. We propose a formal analysis framework based on client speed parameters; it is focused on providing peer and swarm performance information. A first improvement to BitTorrent swarms is the design, implementation, testing and evaluation of a novel protocol that allows swarm unification, as described in Chapter5. The protocol, dubbed TSUP (Tracker Swarm Unification Protocol) allows swarms that part the same .torrent file, yet are different, to be unified, such that peers would be able to communicate one with the other even if initially in different swarms. Intermediation is enabled by trackers and the new protocol forms a tracker overlay network – each tracker is initially part of the single swarm. TSUP has been designed from scratch and has been implemented into the XBT tracker. The im- plementation has been evaluated both formally and experimentally on top of the virtualized infras- tructure. The experimental setup made use of diverse swarms with respect to number of trackers, seeders and leechers. Chapter6 presents the design and initial implementation of a multiparty protocol in the Linux kernel. Heavily based on swift this implementation aims to provided a real Transport layer of a multiparty protocol directly within the Linux networking stack. An intermediate approach, using raw sockets, is employed to make use of the favorable user space environment while still providing the same interface. CHAPTER 1. INTRODUCTION 8

The streaming part of this work and description of trials and results in the local LivingLab are presented in Chapter7. The addition of streaming facilities in the popular libtorrent-rasterbar imple- mentation1 is an important component, highlighting the steps required to provide streaming. The major part of the chapter is dedicated to the deployment, analysis and evaluation of the NextShare technology2, an advanced Peer-to-Peer implementation based on BitTorrent. Results and analysis of Peer-to-Peer streaming deployment in the LivingLab form an important contribution of this work. Both feedback from users and collected information from clients are used to provide an insight on the performance of video-on-demand streaming in a Peer-to-Peer network. The last chapter (Conclusion) highlights the major contributions and results of this work and also points out what should be undertaken in the near future. It rounds up the scientific relevance of this work and sets the ground for further actions on the explored directions.

1http://www.rasterbar.com/products/libtorrent/ 2http://www.p2p-next.org Chapter 2

Peer-to-Peer Systems and Protocols

Peer-to-Peer (P2P) Systems have become a “buzzword” in the early 21st century and are nowadays one of the most common technologies used in the Internet. Although until 2000 the most common paradigm in Internet was client-server, the emergence of Napster1 has pushed Peer-to-Peer sys- tems to become one of the most used (and controversial) technologies. With a total of 60 million users in 2001 and many terabytes of music shared2, Napster became the center of public attention and the promoter of P2P technology. Soon after that a series of Peer-to- Peer solutions led to new ways of sharing information, files, knowledge and data (JXTA, Gnutella, BitTorrent, DirectConnect, Jabber). In its original meaning, “Peer-to-Peer” refers to communication between equals, between two au- tonomous entities that are independent of each other. Generally, a “Peer-to-Peer” system is a decentralized system: each node in the network possesses the same role as any other node. The opposite is a client-server system, which is centralized on the server. An immediate advantage of Peer-to-Peer architecture is scalability and fault tolerance. In order to claim a system as being “Peer-to-Peer”, it must provide a positive response to the following questions:

1http://www.napster.com/ 2http://money.cnn.com/2010/02/02/news/companies/napster_music_industry/

Figure 2.1: Client-Server versus Peer-to-Peer

9 CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 10

• Does the system deal with variable connectivity and temporary addresses as a matter of course? • Does the system provide autonomy for nodes at the edge of network? In addition, there is a third default condition, which states that communication between nodes must exists for the system to work. It is noticeable that these conditions don’t require a decentralized architecture as Peer-to-Peer sys- tems are usually charged. These conditions require that each network node is an independent player and its presence is not mandatory for the system to work. The Internet is a shared resource, a reservoir of information and a cooperative network built with millions of nodes worldwide. As of the 90s, the Internet has experienced an explosive growth which generated an acute demand for one of the basic network resources: bandwidth. The firewall arose from the need of supporting a variety of applications with constraints regarding security. However, in 2000 everything changed, or, better said, returned to the initial state. Passive elements have become active elements. Through sharing of music files and through a larger movement (called Peer-to-Peer), millions of users connected to the Internet began using their systems for more than reading e-mails and browsing. Systems connected through the Internet could form groups and collaborate to meet various needs. Since 2000, Peer-to-Peer applications have become more common and widely used. Peer-to-Peer applications allow easy publication of information and designing of decentralized applications, which is simultaneously a challenge and an opportunity. In this case, Peer-to-Peer model is closely linked with the idea of decentralization. In a fully decen- tralized system, each node is an equal participant without having any special role. DNS (Domain Name System) is a Peer-to-Peer protocol that has the implicit idea of hierarchy. Many other Peer- to-Peer systems have a semi-centralized organization with some nodes having special roles. However, Peer-to-Peer applications encounter severe problems in the current Internet architecture: firewalls and NAT (Network Address Traversal) make it more difficult to create connections between stations; asymmetric bandwidth prevents users from efficiently serving files from their computers. Beyond technical problems, Peer-to-Peer applications face social problems. Sharing files in a Peer- to-Peer system must take into account free-riders. Free-riders are those who use resources shared by others (files, bandwidth, memory, CPU power) without giving anything back (or giving to little). The resulting lack of reciprocity is damaging for the health of the Peer-to-Peer network. As such, a form of accounting effort for each customer has to be employed. At the time, Peer-to-Peer applications have attracted many issues related to sharing copyrighted materials, such as audio, video, electronic documentation, etc. Legal actions, such as the one involving The Pirate Bay1 have highlighted the troubled environment in which Peer-to-Peer applica- tions are used.

2.1 The Peer-to-Peer Paradigm

The Peer-to-Peer paradigm has established itself as an alternative (if not opposite) to the client- server paradigm. Providing scalability and taking advantage of the continuous increase in Internet network bandwidth, the Peer-to-Peer paradigm has been used in various situations such as file sharing, communication, collaboration, social networking, load distribution.

1http://thepiratebay.org/ CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 11

2.1.1 Features and Models of Peer-to-Peer Systems

As mentioned previously, the characteristics of P2P networks are sharing and distribution of re- sources, decentralization and autonomy. Sharing and distribution of resources refers to the functionality provided by each node of the Peer- to-Peer network. A node can function as a server or as a client and may also act as a provider and consumer of resources and services (information, files, bandwidth, storage, processor cycles). These nodes are also called “servants”. Decentralization refers to the absence of a central coordinating authority for organizing the network or for resource usage and distribution or communication between nodes. Communication takes place directly between nodes. There is a frequent distinction between pure P2P networks and hybrid networks. In hybrid P2P networks, certain functions (such as indexing and authentication) are allocated to a subset of nodes that have coordinating role. Autonomy means that each node independently decides what resources it provides to other nodes. The lower cost of CPU power, bandwidth and storage space, combined with the growth of the Internet have created new areas of use for P2P networks. This has led to an explosive growth of P2P applications, but also to discussions about limitations, performance and economic, social and legal topics. A basic model of P2P consists of three components/levels: • P2P infrastructures: communication mechanisms and techniques and integration of IT com- ponents • P2P applications: applications based on P2P infrastructures • P2P communities: cooperative interaction between persons with similar interests or rights P2P infrastructures are positioned over existing communication networks and act as a foundation for other levels. P2P infrastructures provide communication, integration and translation between various IT components. It also provides search services, communication and sharing between network nodes and initiation of security processes such as authentication and authorization. Level 2 is represented by P2P applications that use P2P infrastructure services. Applications are used to allow communication and collaboration among different entities in the absence of a central supervisor/coordinating node. Level 3 focuses on the social networking phenomenon; in particular, on building communities and their dynamics.

2.1.2 P2P Resource Management

In scientific and IT communities, P2P applications are often classified in categories such as: instant messaging, file sharing, grid computing and collaboration. This classification has been developed over time and fails to make clear distinctions between different applications. Nowadays there are many situations where several categories may be considered equivalent. For these reasons, an- other classification could take into account issues regarding resource usage in P2P systems. As such, there is a classification of P2P systems regarding available resources.

Data

Information is used in collaborative spaces in the form of sharing or exchanging information and document management. Information about the presence of nodes in P2P systems is fundamental CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 12 for the organization of a P2P system and for knowledge of resources they make available. Informa- tion regarding knowledge of free CPU cycles on a computer system may be sucessfully used in a Grid application. DMS (Document Management Systems) are storage spaces that enable shared storage, man- agement and use of data. Formation of collaborative groups for document management is a re- sult of information-sharing applications (P2P groupware). In client-server groupware systems, a workspace for information management is needed. A P2P system can avoid all tasks related to the administration of that workspace.

Files

Sharing files is the most common form of P2P applications. 70% of the Internet traffic may be at- tributed to file exchange (mostly media files). One of the main problems of P2P systems (and file sharing applications) is localizing resources. In the context of file sharing, P2P systems have devel- oped three algorithms: flood model, centralized directory model and conveyance documents model. These models are best illustrated by proeminent implementations such as BitTorrent, Gnutella, Nap- ster and Freenet.

Bandwidth

With increasing demand for transmission capacity of the network (mainly due to increasing multime- dia transfer) effective use of bandwidth becomes more important. Usually, a centralized approach where files are kept on the server leads to the occurrence of a bottleneck in the network; the bot- tleneck is generated by a set of requests being sent simultaneously from multiple clients to a single one. In these situations, using a P2P systems have two advantages: • increasing load balancing through the use of unexploited transmission routes; • usage of shared bandwidth that implies acceleration of download speed and rapid transfer of large files simultaneously requested by various entities. Commonly, files are divided into small blocks downloaded simultaneously from different nodes of the network.

Storage

Increasing connectivity and bandwidth availability allows alternative forms of storage management that solves these problems and requires less administrative effort. In P2P storage networks, only a portion of a desktop PC space will be used. A P2P storage network is a cluster of computers based on networks sharing existing storage. When a file is stored in the network, a unique identification number (obtained by applying a hash function on the content of file name) is assigned to it. In addition, a number of copies of the file are also stored on the network.

CPU Cycles

Realising that computer networks possess a large quantity of unused computing power, a decision was made to engage P2P applications in harnessing that power. Using P2P applications to collect CPU cycles allows a potential higher use of computing power; such power may be difficult to be CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 13 provided even with the employment of expensive supercomputers. The approach of using coor- dinated resources of distributed computing that extend beyond a simple institution falls under the auspices of the domain “grid computing”. One of the most common examples of grid projects in the context of P2P networks if SETI@home1. SETI@home is a scientific initiative launched by University of California, Berkeley, in order to dis- cover radio signals from extraterrestrial intelligence. For this purpose, a radio telescope from Puerto Rico records a portion of the electromagnetic spectrum space. Data is transmitted to the central server based in California. Instead of analyzing data using supercomputers, the server divides the information into small units and sends them for processing to millions of computers volunteering to participate in this project.

2.1.3 P2P Topologies for File Sharing

Topology of P2P systems refers to: • how nodes are connected; • the existence of specialized nodes; • transfer mode for shared file or meta-information. In terms of network topology, P2P systems have two main characteristics: • scalability: there is no technical or algorithmic limitation for the size of the system; system complexity must be relatively constant compared to the number of nodes in the system; • reliability: the absence of a given node or its poor performance will not affect the whole system (if possible, neither of the other nodes). P2P systems may be grouped into two categories: pure P2P systems and hybrid P2P. Pure P2P systems (such as Gnutella or Freenet) possess no central server. The hybrid P2P model (such as Napster, eDonkey2000 or Groove) uses a central server to obtain meta-information or to verify credentials. In a hybrid system, nodes always contact a central server before directly contacting other nodes. All P2P topologies, regardless of differences, have one common feature: all data transfer sessions between nodes are achieved through direct connection between nodes. However, controlling the process may be implemented in several ways. As a result, P2P networks may be classified into four major categories: centralized, decentralized, hierarchical and ring. Although these topologies can exist by themselves (independent), distributed systems use more complex topologies by combining some basic systems into creating what is usually referred to as a hybrid system.

Centralized

Centralized topology concept is based on the traditional client-server paradigm. There must be a central server used to manage files and databases of user nodes that enter the system. Clients contact the server to inform it about their current IP address and the files they want to share. This is performed whenever the application is launched. Information collected from nodes will subsequently be used by the server to create a centralized dynamic database that maps file names with sets of IP addresses. All search requests are submitted to the server which will consult its local database. If a match occurs, it creates a direct link to the node sharing the file and run the transfer. In no situation will the shared file be present on the server.

1http://setiathome.berkeley.edu/ CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 14

Ring

The most important disadvantage of a centralized topology is the central server; it can easily be- come a bottleneck in the system (in case of a heavy load from client requests) and a critical point in case of failure. As a way to circumvent the disadvantage, ring topologies have been employed. Ring topology is made of a cluster of systems that are arranged in a ring shape to act as a distributed server. This cluster provides the best possible load balancing for higher availability. This topology is commonly used when systems are topologically close and, quite likely, held by a single organization where anonymity is of little importance.

Hierarchical

Hierarchical topologies are probably the most common topological example, claiming their existence since the beginning of human civilization. The most famous example of a hierarchical system in the Internet is the DNS (Domain Name System). Authority delegates from the root name servers to the registered name server. The topology is suitable for systems that require a governing entity involving delegation of rights or authority. Another example is the certification authority (CA) that certifies the validity of an entry in the Internet, and takes part in the creation of a PKI (Public Key Infrastructure).

Fully Decentralized

In a pure P2P architecture, there are no centralized servers. All nodes are equal; this ends up in a flat, unstructured topology. In order to join the network, a node should contact a bootstrap node (a node which is always online). It will provide the new node with various properties such as IP addresses of one or more nodes already in the network, making it officially part of the dynamic network. Most of the time, each node will have only information about its neighbors – nodes which are directly connected with it. Since there is no control server for searching, file requests are sent in the form of floods through the network. As a consequence, flooding means additional load of the network bandwidth.

Hybrid

Nowadays P2P networks combine many of the basic topologies shown above; they are also called hybrid architectures. In such systems, nodes will often play several roles.

Centralized Ring

This hybrid topology is very common for Web hosting systems. As mentioned previously, Web servers typically have a ring of servers specialized in load balancing and fault tolerance. As such, web servers providing content for a single entity/organization are commonly arranged in a ring topology. Clients are connected to a ring topology using a centralized topology.

Centralized-Centralized

It can often happen that a server in a P2P network is a client in a larger network. This hybrid topology is also common in organizations that provide Web services. CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 15

Centralized-Decentralized

In this topology, some nodes function as group leaders (usually called Super-Nodes). Super-Nodes provide server tasks only in a subset of nodes Super-Nodes are connected in a decentralized manner. As in a centralized topology, Super-Nodes maintain a database that maps IP addresses to files. Super-Nodes database maintains information only about the group nodes. A sample application that uses such P2P topology is FastTrack. Another example of such a topology is the e-mail service of the Internet.

2.2 Peer-to-Peer Systems Deployed in the Internet

Since early 2000s, a variety of Peer-to-Peer solutions have emerged. Though made popular due to the diversity of P2P file sharing solutions, other applications such as collaborative software (as in JXTA), video streaming and chat (such as Jabber) have made their way in the Internet. Each type of application possesses specific features and approaches towards Peer-to-Peer com- munication. Napster, similar to DirectConnect, provides a central indexing server, while actual communication (sharing) is enabled between peers. Gnutella is a fully decentralized approach to file sharing. Peer connections and interactions require no central server, albeit for super nodes who are required to bootstrap the network. BitTorrent, one of the most successful protocols, isolates a sharing session within a swarm, itself described by a metadata file, dubbed a .torrent file.

2.2.1 Napster

Napster1 was the first successful example of a Peer-to-Peer system. Napster was permanently closed in June 2001, an event that marked the end of an outstanding era, with 65 million users in only 20 months. Napster’s central control model was possible due to focusing user approach. In a common architecture, communication between Napster components is mediated by the server. Clients are connecting to a well known meta-server. The meta-server associates a server with reduced load from one of the clusters. A cluster consists of 5 servers, each of which can control about 15.000 users. The client registers to the chosen server, providing information about its identity and shared files. Afterwards, it receives information about other online users and available files. Users are anony- mous to each other and local directory structure is not directly interrogated. The main interest is searching for content and determine where the requested resource may be downloaded from. The server directory is only used to translate between station identity, the desired resource and the IP address necessary to initiate the connection. Because network performance is essential, searches are constrained to a maximum of 100 hits and a maximum of 15 seconds duration. Search actions take place in the local database maintained by the server, but, sometimes, may be passed to neighboring servers within the same cluster. Conceived originally as a system for sharing MP3 files, a whole culture of clones and utilities ap- peared. However, this poses disadvantages too. For example, certain tools (such as Wrapster) would be able to make other files look like MP3 files. Other utilities (such as Napigator) allowed users connect to certain servers, bypassing the meta-server arbitration phase.

1http://www.napster.com/ CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 16

However, even if consistent effort has been put into redesigning Napster, it still used a centralized and compatible server. This meant a serious constraint for Napster networks or any other clone im- plementations. In order to change this limitation, several protocols and architectures were designed. These are serverless implementations in the style of Gnutella.

2.2.2 Gnutella

Initially, Gnutella1 was the name of a prototype client developed in just a few weeks in March 2000 by the same team that created WinAmp. At its beta version launching, almost everyone saw Gnutella as a competitor for Napster, designed to overcome many of its limitations. However, AOL, who had just bought Nullsoft, decided to immediately stop working on the client. Everything would have ended if hadn’t been Bryan Mayland who managed to deduce Gnutella protocol and published its specifications on the Internet. Thus began the development of Gnutella open-source project. Nowadays Gnutella is a generic term with various meanings: the protocol, the open-source technol- ogy and the Internet network used. There are several clients for the Gnutella protocol and, although most of them focus on sharing and searching files, many other operations may be enabled. Gnutella is mainly a file sharing network that allows arbitrary types of files. There is no central server and therefore there is no point of failure. Public or private networks are defined only by clients who are currently in communication with each other. Each user can build a local map of the network capturing messages from other clients. There may be multiple networks, depending on how clients are configured to interconnect. The lack of a central control point in Gnutella means that legal responsibility for file transfers remains in the user’s hands. Depending on the point of view, this may turn out to be a good or a bad thing. On gNET, one may find illegal copies of almost everything one can think of. There is no standard client, but there are several clients built on top of the Gnutella protocol. These clients may communicate with each other, but developers have liberty too add functionalities and extend the implementation. Gnutella specifications and most of the clients are open-source, but there are also proprietary implementations. Authentication in a network like Gnutella is similar to a group of people looking into different direc- tions. A user only communicates with immediate neighbors and, through this process, neighbors can communicate with the others. In each session there is a different selection of people and resources. Normally a node is able to connect to any node from its neighborhood. But just like in real life, some nodes may be too busy to talk and will not provide attention. Other nodes will completely ignore the connection request. Other nodes will change a few words and will move on. Eventually, it will reach a node they can establish a long-term contact and to which they can send requests and results. Nodes appear and disappear continuously and local configuration requests will change constantly; over time, a node will connect to multiple nodes. Atomistic P2P networks such as Gnutella are very dynamic and don’t have central addresses lists. From a pragmatic point of view, a list of bootstrap nodes for initial discovery is needed. The inherent virtual network segmentation results in the so called “horizon effect”. Each node’s visibility is limited to a given number of nodes. In gNet, this number is 15,000 nodes. Nodes are continuously entering and leaving the network thus influencing reachable nodes. The main reason for the occurrence of “horizon effect” is the existence of a TTL counter (time-to-live) similar to IP packets. Typically, TTL is set between 5 and 7 and is decremented by each node through which the message passes. TTL’s value and the number of connected nodes, together with the bandwidth and capacity of a node, combine to determine the network performance and its stability. Some

1http://rfc-gnutella.sourceforge.net/ CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 17 clients allow the user to manually adjust the TTL and the number of active nodes, thus succeeding in expanding the horizon.

2.2.3 FastTrack

One of the latest and most innovative applications in Peer-to-Peer architectures is the FastTrack network. FastTrack arrived as a solution to Napster and Gnutella problems. FastTrack architecture is hybrid by nature, which, as noted above, is an intersection of two or more basic network topologies. In FastTrack we are talking about the intersection between centralized and decentralized topologies. FastTrack protocol is a proprietary architecture; using rights must be obtained from the Sherman Networks company. Therefore, limited information is provided about the current protocol. There were several attempts to understand FastTrack protocol through reverse engineering. The most known is the giFT1 project, which was very close to "break" the protocol. FastTrack reacted by changing the encryption used so that is almost impossible deducting the protocol. However, the effort was enough to define the operational mode of the protocol. FastTrack uses two levels of control within its network. First level consists of clusters of nodes that are connected to common Super-Nodes (common systems that offer high-speed connections). As discussed previously, this type of connection is similar to centralized topology. The second level consists of Super-Nodes connected into a decentralized manner. The number of nodes defined as Super-Nodes may vary from tens to hundreds. This happens because Super-Nodes are ordinary nodes that can enter or leave the network when they want. As a result, the network is dynamic and in constantly changing. To ensure constant availability, the presence of several nodes is required. Such nodes are called bootstrap nodes and must always be active (online). When a FastTrack client (for example Kazaa) is running on a node, it will initially contact the starting node. Next, the starting node will determine if that node can be considered a Super-Node. If so, it will be given some (or perhaps all) of the IP addresses of other Super-Nodes. If it classifies as a common node, the starting node will respond by providing the IP address of a Super-Node. Some clients (like Kazaa2) use a method named reputation system. Reputation is associated for each user and is reflected by the level of participation (a value between 0 and 1000) in the network. The longer a user is connected, the higher the level of participation is, such that it will be favored in priority policies and will receive the best services. This measure aims to encourage users to share files and reduce the number of “passengers clients” on the network. Resources are discovered through diffusion between Super-Nodes. When a node from the second level makes a request, it is directed through the partner Super-Node that will send the request to other Super-Nodes connected with it. Diffusion is repeated until TTL’s value reaches 0. This provides the FastTrack client with a greater coverage and best search results. A disadvantage is the resource usage of the method of diffusion that may lead to a significant load of bandwidth between Super-Nodes. However, this problem is not as serious as is the case for Gnutella and Kazaa due to fast available connections of Super-Nodes. Each request received by a Super-Node is searched through the database. The database contains information about all files shared through the network. When a match is found, a reply is send on the same path to the original node. The method is similar to response propagation in Gnutella network. However, the problem of packet loss is not as severe as in Gnutella. In Gnutella, the backbone consists of nodes that connect and disconnect from the network at will. In FastTrack, the backbone consists of nodes with high connectivity speed (Super-Nodes) and, therefore, return paths may be considered safer.

1http://gift-fasttrack.berlios.de/ 2http://www.kazaa.com/ CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 18

2.2.4 eDonkey Network

The eDonkey network1 is a centralized P2P file sharing system. Several central servers maintain information about various shared files, following the actual transfer of files that take place directly between nodes. eD2K original protocol was not formally documented. However, given that most of nowadays clients (*Mule) are open-source, it can be said that the formal specification of the eDonkey network is defined by how the eMule client and the eDonkey server interact. By running an eDonkey server on a system with Internet connection, any user can add a server in its network. Clients frequently update their local list of servers as IP addresses continue to change. eDonkey network files are uniquely identified by an MD4 summary. This means that files with identical content but different names are the same. Files are divided into "blocks" of 9,728,000 (9500 x 1024) bytes plus a residual block. An MD4 checksum of 128 bits is computed for each block. Errors may thus be detected in transmission; corruption may occur only at block level, not at the file level. In addition, successfully downloaded blocks are available for file sharing before the entire file is downloaded (similar to the BitTorrent protocol). Initially, the eDonkey protocol used a set of servers provided by users who donated the necessary bandwidth, processing overhead and disk space. Such servers generate heavy traffic and, as a consequence, are more vulnerable to attacks. As a result, Overnet2 was developed by MetaMachine, the initial developer of eDonkey client, and [45] was developed by eMule project. Kademlia is the serverless version of the eDonkey network, similar to Gnutella. Unlike Gnutella, Kademlia uses distributed hash tables (DHT - Distributed Hash Tables) to store information about files across the network. Thus, while Gnutella search operations were achieved through “flooding” the network with Query messages, Kademlia searching is performed on a hash function that iden- tifies nodes on which information about file location is recorded. Locality information is stored in several nodes to allow entering or leaving the system without losing location information. Joining the network is achieved, as in the case of Gnutella, through a bootstrap node, known from the beginning by the Kademlia client.

2.2.5 DirectConnect

DirectConnect3 is a centralized protocol for P2P file sharing. Its architecture is similar to Napster, but while Napster was using a single central server (and therefore a unique point of failure), Direct- Connect provides multiple dedicated servers called hubs to which clients connect. DirectConnect is a text protocol; commands and information are sent in plain text without being encrypted. As clients connect to a central source of distribution of information, the hub must have substantial bandwidth. There is no official specification for the protocol. This means that each client or hub obtained information about functionality by breaking the original protocol (reverse engineering). The P2P part of protocol is based, as in eDonkey’s case, on the concept of slot, a number that denotes the number of users that may simultaneously download files from the current user. These slots are controlled by the client. Downloads use TCP while active requests use UDP. A client may find itself in one of two states: active or passive. Clients in the active state may download from anyone in the network. Clients in the passive state may only download from active users.

1http://www.emule-project.net/ 2http://www.overnet.org/ 3http://dcplusplus.sourceforge.net/ CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 19

Hubs hold information about clients and possess features such as searching and chat. File transfer takes place directly between clients, as in a common P2P system. Many hubs impose requirements related to the number of files its members can share and restrictions on the quality and content of shared files. Additionally, a hub may choose some users as operators to maintain certain rules. One problem of DirectConnect hubs is scalability, central-server architecture preventing this.

2.2.6 Instant Messaging. Jabber/XMPP

One of the most common forms of P2P applications is instant messaging. During the early age of the Internet, e-mail was inherently Peer-to-Peer. The message was sent through a connection from a user terminal to another terminal. However, as time went by, communication between users became an indirect one. Most personal computers could not support communication protocol im- plementation and used dedicated servers (MTAs – Mail Transfer Agents) for sending messages. One problem of the e-mail service was that it didn’t act as a real-time service. Messages could be delayed or even lost. The result was the inception of chat applications that allowed real-time communication. At the beginning, applications emerged in which communication was mediated by servers (server-relayed chat). An example is IRC (Internet Relay Chat). This type of application is not Peer-to-Peer. However, IRC is the precursor of a P2P user type that allowed nodes to discover and initiate contact using other client technologies. The term “chat” is used for broadcasting transmission for multiple users (broadcast relay) and for private messages transmitted over a point to point connection. A separate group of applications are dedicated to the second category. In other words, “instant messaging” refers to different types of chat that imply connectivity between communicating nodes. The main purpose of Jabber/XMPP (eXtensible Messaging and Presence Protocol)1 is a “technol- ogy of conversation” for P2P in the general sense, including not only user-user communications, but also user-application interaction or application-application communication. Jabber evolved as a project that brings about various applications to create an open and distributed P2P architecture as a basis for specialized actions. The design involved choosing XML as a consistent way of en- capsulating information, regardless of the application, allowing, at the same time, the addition of extensions. Applications are implemented in XML and translated transparently in the native format. Some other projects developed in Jabber are: • gateways for a large part of instant messaging protocols; • libraries most of programming languages; • a modular open-source server and client implementation for any platform; • a number of specialized server functions such as language translations. A special emphasis is given to user-applications and to designing a protocol that facilitates all forms of P2P communication. The first application of Jabber technology is an instant messaging system that takes into account security, usability, network accessibility (from anywhere, using any type of device) and interoperability with Web services of instant messaging or telephone. In the center of the open architecture lies the idea that communication takes place using XML for- mat, enabling Jabber to work both as a storage service and as a replacement. Meta-information and structure are similarly defined in XML and possess established sets of tags to describe document files, audio, video, etc. Although Jabber can be used with multiple purposes, the most common implementation is an instant messaging client.

1http://xmpp.org/ CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 20

2.2.7 P2P Collaborative Spaces. JXTA

Many of innovative aspects of the Internet have evolved due to people requesting to collaborate regardless of their location. Peer technologies are natural solutions to informal collaboration and encourage the formation of collaborative groups of people. In the absence of physical presence, many people need working together and it is vital that there are no barriers in the way of effective communication. P2P informal groups are among the best suited mechanisms of work. Moreover, these informal groups are widely used by many people, regardless of formal structures that can be centrally imposed. The trend is so intense that people don’t realize what is happening. When will be asked, they will say that they work as in a working methodology from a hierarchical structure. However, studies performed by external group persons shows a rapid succession of contacts in P2P format. JXTA project1 (short for juxtaposition), or side by side, is a Sun Microsystems initiative that seeks to integrate P2P technology as complementary tools distributed in Internet. The basic platform is a fully decentralized network architecture. However, both centralized and decentralized services may be developed over the platform.

2.2.8 BitTorrent

One of the disadvantages of P2P systems is the “opening” for free riding [41] or free loading. While peers are generally considered to be altruistic, this behavior is not generally enforced. Free riding is equivalent to an egoistic behavior, where a peer gets information from other peers and gives nothing back. This behavior disables the very nature of P2P systems: sharing information. If a given network consists of a high number of free riders, then data load among peer is unbal- anced. When a new node shares a popular file in the native P2P system, it will attract a large number of connections. In case of free riding, this node must provide bandwidth for additional clients. BitTorrent2 is a new protocol being used to solve the free riding problem, though not completely [41]. The idea behind the protocol is fairly simple: the node that plays the role of the server breaks the file in pieces. If the file is requested by more clients simultaneously, each client will get a different piece of that file. When a client gets a complete piece it will allow other clients to download that subfile, while it will continue downloading the second piece. In other words, the node acts simultaneously as a client and a server after receiving the first piece. The process continues until the download completes. The BitTorrent protocol is highly suitable for downloading large files where the download process takes a long time. As a direct consequence, more “server” peers are available for a longer duration. More clients will not result in a performance decrease of the whole network as the load is distributed. Consider a situation where the “server” or seeder possesses 4 pieces. Each client (leecher) re- quests one piece from the server. As noticed, client #1 gets all pieces but from different nodes. The order in which pieces reach the client may not respect the initial piece succession. The BitTorrent algorithm will try maximizing the number of pieces available for download at a certain point in time – that is balancing the availability of pieces among peers in a network. In order to download a file in BitTorrent network, the user would follow the step below: • install a BitTorrent application; • surf the web;

1http://java.net/projects/jxta 2http://www.bittorrent.com/ CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 21

• find a file on the server; • select location where the file would be saved; • wait for completion of the process; • instruct application to close (the upload process, also called “seeding” continues until the user closes the application).

BitTorrent Keywords

A peer is the basic unit of action in any P2P system, and, as such, a BitTorrent system. It is typically a computer system or program that is actively participating in sharing a given file in a P2P network. A peer is generally characterized by its implementation, download/upload bandwidth capacity (or limitation), download/upload speed, number of download/upload slots, geolocation and general behavior (aggressiveness, entry–exit time interval, churn rate). The context in which a BitTorrent content distribution session takes place is defined by a BitTorrent swarm which is the peer ensemble that participates in sharing a given file. It is characterized by the number of peers, number of seeders, file size, peer upload/download speed. A healthy swarm, one that allows rapid content distribution to peers, is generally defined by a good proportion of seeders and stable peers (peers that are part of the swarm for prolonged periods of time). A swarm consists of two types of peers: leechers and seeders. A seeder is a peer that has complete access to the distributed content and is, thus, only uploading data. A leecher has incomplete access to distributed content and is both uploading and downloading. The BitTorrent negotiation protocol uses a form of tit-for-tat that forces peers to upload in order to download (though this has been proven to be abused [41]) therefore ensuring fairness and rapid distribution. [16]. A good number of healthy seeders is a requirement for healthy swarms that allow rapid distribution of data. A swarm is given birth by an initial seeder – a peer sharing a file it has complete access to. The seeding/sharing process within a swarm is bootstrapped through a metafile, a .torrent file, that defines swarm trackers and piece hashes to ensure content integrity. Typically, a peer uses a web server to download a .torrent file and then uses a BitTorrent client to interpret it and take part into a swarm. It is possible to create different swarms for the same content by using different trackers in a .torrent file. Communication of peers in a swarm is typically mediated by a BitTorrent tracker or several trackers which are defined in the .torrent file. It is periodically contacted by eers to provide information regarding piece distribution within the swarm. A peer would receive a list of peers from the tracker and then connect to these peers in a decentralized manner. As it uses a centralized tracker, the swarm may suffer if the tracker is inaccessible or crashes. Several solutions have been devised, such as PEX ()1 or DHT () [2]. As shown in Figure 2.2, there are two types of communication paths. The control path is established between each client and the tracker to request information about other peers in a swarm. The data paths are typical paths used between peers. Peers exchange information according to the internal BitTorrent algorithm.

BitTorrent Internals

The core of the BitTorrent protocol is the tit for tat mechanism, also called optimistic unchoking allowing for upload bandwidth to be exchanged for download bandwidth. A peer is hoping another peer will provide data, but in case this peer doesn’t upload, it will be choked. Another important

1http://wiki.vuze.com/w/Peer_Exchange CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 22

Figure 2.2: Overview of BitTorrent Components mechanism for BitTorrent is rarest piece first allowing rapid distribution of content across peers. If a piece of the content is owned by a small group of peers it will be rapidly requested in order to increase its availability and, thus, the overall swarm speed and performance. The usage of the rarest piece first algorithm means that pieces in a file would be equally distributed among peers in a swarm and provide rapid access of each of those. This approach has the down side of not providing data sequentially which is troubling for streaming applications. In such appli- cations the algorithm is updated to use priority sets; only pieces that are not part of a priority set wll fetch it in a “rarest piece first manner”. In order to allow client bootstrapping and access to piece information, initial peers must be able to gather some initial data without giving anything back. In order to accomplish this bootstrap- ping phase, BitTorrent provides an “optimistic unchoke” slot to each client. This is also useful for establishing potentially better connections than the ones currently used. The use of choking and unchoking is the tit-for-tat variant of the BitTorrent protocol. Peers recipro- cate uploading to peers which upload to them, with the goal of having multiple connections actively transferring in both directions. Clients choose peers to choke or unchoke based on their current transfer rate, such that connections used would tend to be the best available from a download speed point of view.

2.3 Content Distribution in Peer-to-Peer Systems

Peer-to-Peer solutions have traditionally been used for file sharing among users in the Internet. Generally, file transfer occurs in small chunks (also called pieces) that are transferred or exchanged from one client to the other. Two different pieces may be transferred from two different clients, depending on their availability. Piece transfer may not be (and generally isn’t) sequential; that is, a piece is transferred according to its availability, peer approval and implementation of the Peer-to- Peer protocol. As multimedia content forms a large part of data that is distributed over the Internet, recent years have witnessed the rise of streaming solutions and their introduction to Peer-to-Peer solutions. Both live streaming and on-demand video streaming have been added to existing Peer-to-Peer solutions and new applications have been designed. We introduce the main methods for providing streaming solutions for P2P systems, such as tree, multi tree and mesh based systems. Streaming video over the Internet has mostly relied on a client-server paradigm. A client connects to a stream server and video content is streamed from the server directly to the client. A specialized CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 23 variant of the client-server model is the Content Delivery Network (CDN). In CDN-based solutions, the video source server pushes the content to a set of delivery servers which are subsequently accessed by clients. Such technology is employed by YouTube1.

2.3.1 Peer-to-Peer Streaming

In a P2P network the user may not only download a video asset, but he/she may also upload it. Thus a user becomes an active participant in the streaming process. Recent years have seen the emergence of video streaming applications based on P2P solutions. P2P streaming solutions [40] create an overlay network topology for delivering content (that is, a virtual network topology over a physical one). The overlay network typically follows one the two structures: tree-based and mesh-based. In a tree-based topology data is pushed from its root node to children nodes, then to other children nodes and so on. Its main disadvantage is peer departure (peer churn). A peer departure will temporarily disrupt video delivery to child peers of the departed node. In a mesh-based topology, peers are able to communicate with other peers without having to ad- here to static topologies. Peer relations are established based on data availability and network bandwidth. A peer periodically initiates connections to other peers in the network, exchanges infor- mation regarding data availability and pulls data from neighboring peers. This has the advantage of robustness to churning. It does, however, suffer from video playback quality degradation when no clear data distribution path exists. As will be described below, mesh-based systems and tree-based systems are used both for live streaming and video-on-demand depending on protocol internals and application design.

2.3.2 Video-on-Demand

Video-on-Demand (VoD) ensures the availability of the whole video at the time of the transfer. No content is generated/updated while data is transferred and rendered. It allows users to watch any point of the video at any time. As peers in a given network are playing a different part of the file, pieces of that file have to be made available to other potential peers. One peer may discard pieces of the VoD asset as time goes by or preserve the whole file for seeding. While mesh-based topologies are usually enabled for VoD delivery, tree-based topologies still find their way. A tree-based P2P VoD service arranges users in sessions based on their arrival time, using a threshold. Users that arrive close to that time (within the threshold) constitute a session. The server streams the video over the base tree as in tree-based P2P live streaming. Users cache certain pieces/patches of the video to make it available to other users. Users will thus provide two functionalities: base stream forwarding and patch serving. In Figure 2.3, there are two sessions (3 and 4) starting at time 20.00 and 31.0, respectively, with a threshold of 10. Each user is marked with its arrival time in the system. A straight arrow is used to represent a base stream while a dotted arrow is a patch stream. Each user may receive information from multiple users and it could also send information to multiple users. Mesh-based P2P file sharing network uses swarming (a collection of peers) and is suitable for BitTorrent protocols. In a mesh-based network, peers connect to other available peers depending on various factors such as availability, bandwidth and protocol internals. Data delivery is highly similar to a BitTorrent environment: there are seeders that possess all data and leechers that

1http://www.youtube.com/ CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 24

Figure 2.3: Tree-based Video-on-Demand [40]

Figure 2.4: Video-on-Demand in BiToS [72] possess only a subset of information. Data is fragmented into pieces and is subsequently delivered (upon request) to various peers. One of the first attempt to design a mesh-based P2P VoD service was BiToS [72]. There are three components within the piece processing activity in BiToS, as seen in Figure 2.4: the receive buffer, consisting of all information received so far, the high priority set, consisting of pieces that are required to allow smooth prerendering, and the remaining set. The optimal allocation of resources is an important design challenge.

2.3.3 Live Streaming

Live streaming consists of a server that is generating video content in real time. Content dissemi- nation is acted upon by other peers (clients). The video playback is synchronized among all peers unlike peers in a VoD-like network, where each peer may be positioned in a different part of the video. Live streaming is usually making use of tree-based structures. A root server is constantly (real-time) generating content and peers connect as clients forming a tree. One of the basic frameworks for creating a tree-based overlay is the deployment of IP level multicast. Due to implementation lack of adoption, the IP level multicast hasn’t been deployed and the multicast function has been imple- mented at application level. A tree-based approach could either be using a single tree streaming or CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 25

Figure 2.5: Single-tree Streaming multi-tree streaming. In a single-tree streaming (Figure 2.5), a tree is formed at application level based on the root node which is equal to the server. Each user is a client and, subsequently, a server and is joining the tree at a certain level. It receives the live stream from the above node and delivers the information to the children peers. A node receives data from a single parent node but may deliver it to multiple nodes. An important issue of single-tree streaming is its construction. One would required a small fan out and a small number of tree levels. A small fan out would imply that nodes use their upload bandwidth to provide information to a small number of nodes. A small number of levels means reduced delay in propagating the stream. As one can not achieve both a small number of levels and a small fan out, compromise must be taken into account. Typically, one would decide on the upload bandwidth available, compute the maximum fan out and construct the tree accordingly. As more nodes are part of the network, the number of levels would increase. Another important issue is tree maintenance. Churning is an integral part of a Peer-to-Peer network. Peers are dynamic and may leave the network (either gracefully or unexpectedly). The child peers are not able to stream and the tree has to be reconstructed as fast as possible. This is possible by reassigning the peers to the server (similar to the arrival of a new peer). The recovery may be done through the decision of a centralized server or in a decentralized fashion. The single-tree approach does possess important drawbacks. One of them is that such a topology can not recover fast enough for frequent peer churn. Another drawback is the lack of contribu- tion from leaves. Leave peers don’t put up use of their upload bandwidth as no other peers are connected to them and request streaming information. The issue of leaf nodes is addressed through the use of multi-tree based approaches. A server divides the streamed information in sub-streams. Each peer joins all substreams, and information is flowed from the root node to leaf nodes. A node may not (and usually is not satisfactory to) be placed in the same position in each of the trees. In a given tree the node may be a leaf node, meaning that it would only receive information from a substream; in another substream it could be placed as an internal node. In the latter substream the node would thus be able to use its upload bandwidth to provide information to other peers. Multi-tree based systems would still suffer from peer churning and tree maintenance has to be taken into account. CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 26

2.3.4 Existing Solutions

Several commercial solutions have been developed around the concepts of Peer-to-Peer streaming, making use of the scalability benefits and low cost for distribution. One of the most successful media streaming platforms, Octoshape1 uses various optimizations to deliver high quality streaming to users. Octoshape was responsible for allowing the streaming of the US Inauguration of President Barack Obama or the wedding of Prince William to Kate Middle- ton. Octoshape uses a strategic partnership with different CDNs allowing a distribution grid that minimizes cost but also ensures HD content availability to average users. It is similar to a Peer- to-Peer engine that uses a plethora of high quality peers to deliver content all over the world. Its optimizations range from using a specialized Octoshape streaming protocol, multiple point failover, loss resilience, a secure suite of multicast technologies. While not a Peer-to-Peer streaming application, Akamai2 is an extended content delivery network that positions servers in various locations to allow rapid access. Akamai created a globally dis- tributed network of tens of thousands of servers which is controlled by proprietary software. Akamai servers typically mirror content which is subsequently served to users. Akamai clients provide their content to Akamai servers which mirror and distribute it across the Akamai network. Users would typically access the server which is closest to their location. Peer-to-Peer video streaming networks are PPTV and PPS.tv. Both are Chinese oriented solutions. PPS.tv uses 8.9% of the market share in China. Through the use of P2P-streaming it supports high- volume traffic, allowing several hundred users to watch a live stream at 300-500 Kbps with a server possessing only 5-10 Mbps. PPTV offers both live streaming and video-on-demand, covering a large spectrum of video assets for Chinese audience. Recently, the PPLive team behind PPTV has migrated to a hybrid CDN-P2P cloud technology, called PPCloud, considered as scalable and cost-effective. Several other solutions have failed to gain momentum due to legal issues and locality constraints. BitLet3, an abbreviation for BitTorrent Applet is a BitTorrent program enabling the use of the file sharing protocol inside a Java-enabled web browser. The client download and uploads content as long as the program window is open. As a specialized functionality, BitLet allows streaming audio; it also possesses an experimental streaming video feature (based on Theora). RTMFP4 is a proprietary protocol developed by Adobe Systems that enables direct Peer-to-Peer communication between multiple Adobe Flash Players and applications build using Adobe AIR framework. RTMFP retains the benefits of P2P-based streaming such as reduced bandwidth cost and scalability. Cirrus5 is the technology that enables peer assisted networking using RTMFP within the Adobe Flash platform. Cirrus 2, available in Flash Player 10.1, supports groups, allowing application-level multicast and reducing the load on the source publisher. BitTorrent DNA6 (BitTorrent Delivery Network Accelerator) is a program designed to speed up view- ing of streaming video through the use of the BitTorrent protocol. BitTorrent DNA is UDP based, replacing regular TCP-based bandwidth throttling with a much more sensitive technique. Users will be able to download pieces of content from other users, releasing bandwidth pressure from the original content source. It is targeted at streaming and online video games.

1http://www.octoshape.com/ 2http://www.akamai.com/ 3http://www.bitlet.org/ 4http://www.adobe.com/products/flashmediaserver/rtmfp_faq/ 5http://labs.adobe.com/wiki/index.php/Cirrus 6http://www2.bittorrent.com/dna CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 27

2.4 Issues and Challenges in Peer-to-Peer Systems

The plethora of Peer-to-Peer solutions in the Internet makes it difficult to choose the best one for a given job. Each application or application type is specialized in a given kind of activity; variations of each kind of application is itself an issue when aiming to use the best one. Measurements, metrics and evaluations have to be considered to allow a formal classification of these solutions. Streaming (both VoD and live) has been integrated into many Peer-to-Peer solutions. However, performance has been reported to still lag behind a medium-load HTTP streaming server and far behind a classical Content Delivery Network (CDN). Better bandwidth utilization, increased peer incentive, new business models and hybrid P2P/CDN infrastructures may provide increased perfor- mance for Peer-to-Peer applications when dealing with streaming requests. In order to provide an evaluation of various Peer-to-Peer solutions out there, one would either turn to experimentation and validation or to formal analysis. While formal analysis is achievable through a “pen and paper” approach, experimentation requires a high number of resources and carefully placed “information probes” to gather information. A common approach involves “hooking” probes in existing Peer-to-Peer swarms [31] and gather information in those points. Another one involves creating an infrastructure [14] that is able to realistically simulate a Peer-to-Peer network; this approach would, however, require a large number of systems, close to the size of an average swarm – in order to ensure a realistic trial environment. Apart from the technical challenges surrounding a Peer-to-Peer based solution, issues are con- stantly in place regarding legal aspects of data distribution. As most of the content is owned by publishing houses, even tens of years after its creation, it is implied that transferring that content is illegal. A typical BitTorrent network doesn’t integrated features such as encryption, DRM and ownership information, meaning that it is highly plausible that, by acting to download a piece of content, you come across a legal aspect. Often vague and inconsistent legislation only adds to the problem. The plethora of legal issues concerning BitTorrent is made apparent by the high num- ber of legal disputes regarding various distribution sites, one of the most famous being “”1. Various Peer-to-Peer streaming solutions have failed to gain momentum or have even been closed/disabled due to legal troubles. The legal aspects of Peer-to-Peer streaming solutions incur a greater difficulty in providing a suc- cessful business model. There are two approaches. A first approach is one that considers using a proprietary and secure platform that selectively trans- fers information to users, possibly in an encrypted form, such that other users in the swarm may not access that information. This approach would be adequate to large media creators that want to deliver their content to users but want to keep information private and only paying users would access that – as it is typical in CDN-based solutions such as Akamai. The second solution takes into account that protecting content ownership and making profit out of it is not a priority. Rather making profit from side activities such as commercials, user credits, premium content, social networking solutions. This model is close to the functionality provided by YouTube where all content is available, a user may choose to ingest content and profit is taken out of commercials. The selective insertion of commercials into video streams is a possible solution for making profit out of “free” content into Peer-to-Peer networks. There are still lingering questions regarding the possibility of success for any of the above solutions. Given the vague legislation and continuous debates between anti and pro-piracy groups, a suitable solution is yet to come out. Even after 10 years of continuous development, Peer-to-Peer systems raise interesting questions and pose significant challenges to Internet scientists and engineers. Thorough analysis and eval-

1http://thepiratebay.org/ CHAPTER 2. PEER-TO-PEER SYSTEMS AND PROTOCOLS 28 uation coupled with cleverly designed and efficient solutions are key to improving Peer-to-Peer applications and their usage in the Internet. Chapter 3

Deploying Peer-to-Peer Network Infrastructures through Virtualization

Due to the large number of research and development efforts for designing and implementing new applications, hardware components, network protocols, services, the existence of suitable test en- vironments is essential. Functionality testing, performance evaluation, user experience evaluation have to be undertaken to ensure quality and compliance of future products. Test environments for software testing are nowadays an integral part of the software development cycle. Specialized systems or setups allow developers or quality engineers to subject their applica- tions to various conditions in order to grade their functionality, standard compliance, performance etc. The rapid advance of virtualization technologies and cloud computing have enabled rapid deployment of testing environments and quick decommission when they are no longer required. With respect to network protocol and applications, two kinds of environments are generally used for testing, measurements, evaluation and analysis: lab setups and real world experiments. Lab setups or experimental setups are custom setups for protocol evaluation; the experimenter has full control of the environment. This is typically the case for functionality testing or experiments in which the experimenter needs full control of the process. The experimenter may use a lab-based infrastructure, a cluster/cloud based one, or one provided by the community, such as PlanetLab1. Rias [39] is an example of an overlay topology created on top of a PlanetLab infrastructure. Large scale scenarios are deployed in the Internet and allow limited control for the experimenter. Large BitTorrent experiments such as those employed by Pouwelse et al. [58] use existing infras- tructures and participants. The purpose of large scale scenarios is to collect statistical information from real world sessions and subject it to dissemination. The experimenter has little or no control over the setup; the purpose is to analyze a protocol, class of protocols, applications or participants in a real world environment. Most challenges revolve around the ability to collect information, upset by peer connectivity and relevance. Our work presents an custom Peer-to-Peer infrastructure to be used mostly for lab setups; it can be integrated in the larger “Internet cloud” as part of a larger swarm. The main goal is to ensure an easily deployable and customizable platform for Peer-to-Peer experiments with the possibility of completely defining peer behavior and network characteristics. The infrastructure has been suc- cessfully employed in several experiments regarding BitTorrent implementations [16] and provided important results regarding peer and overall swarm performance.

1http://www.planet-lab.org/

29 CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 30

The aim is to provide an automated, extensible, scalable, realistic, easy to use and deploy, efficient and general purpose infrastructure for Peer-to-Peer networks. The infrastructure uses virtualiza- tion [17] as the means for providing a large number of Peer-to-Peer “nodes” and an will employ a plethora of tools required to create a realistic environment such as creating bandwidth limitation, selecting node characteristics and ensuring automation. The infrastructure forms the basis for de- ploying medium-scale experiments that provide valuable insight in the behavior of the Peer-to-Peer protocols, through analysis of protocol parameters.

3.1 Virtualization Solutions

An important “buzzword” in the last decade, virtualization technology has broken ground from its theoretical base in the late 70s to a diversity of full-fledged implementations nowadays. With the continuous increasing capacity of HDD space, memory and CPU power (mostly in number of cores), virtualization solutions provide the best way for proper allocation of resources. New features such as hardware-assisted virtualization, I/O improvements, live migration have ignited the demand for efficient virtualization solutions that consolidate current and future hardware resources. Cloud computing, an already established technology in modern Internet, makes heavy use of virtu- alization in order to satisfy user demands while cutting hardware and software infrastructure costs. Major players such as Amazon, Microsoft and Rackspace deploy important virtualization solutions such as Xen, KVM and Hyper-V. Generally speaking, virtualization refers to mechanisms for providing a similar interface on top of an existing entity. That entity may be a program (software virtualization, e.g. Java Virtual Machine), a concept (e.g. memory virtualization, or a computer system (hardware virtualization). We will focus on hardware virtualization, as it this the most common form for use of the word virtualization and as it forms the basis of our work. Although difficult to give out a proper description of virtualization, Dittner et al. [22] give the following definition: A framework or methodology of dividing the resources of a computer hardware into mul- tiple execution environments, by applying one or more concepts or technologies such as hardware and software partitioning, time-sharing, partial or complete machine simulation, emulation, quality of service, and many others. Virtualization solutions allow instances of operating systems to run on top of another operating system. The base operating system offers an interface very similar to the hardware it runs on. The base OS or application that offers the hardware-like interface is called a virtual machine monitor (VMM) or hypervisor. Depending on the virtualization solution, the hypervisor may be an operating systems running directly on top of the bare hardware (e.g. Xen, Hyper-V) or an application running within an operating system (e.g. VMware Workstation, OpenVZ). The existence of the VMM, just as the kernel mode/user mode separation, is possible in modern processors because of the existence of a supervisor mode allowing for privileged instructions. Mod- ern virtualization solution will try to execute as much of the virtual machine’s instructions as possible on native hardware. Recent updates to x86 architecture, such as Intel-VT and AMD-V have enabled proper hardware assisted virtualization allowing high performance. Performance of virtual machines had been considered by Bratanov et al [9]; they created a testing and benchmarking environment and considered aspects such as overhead, precision and scope.

3.1.1 Benefits of Virtualization

Virtualization solutions have made their way in day-to-day uses as they provide important benefits, especially regarding costs. A specific hardware system may be able to run different instances of CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 31 operating systems, while, in the absence of virtualization, more systems would be required. Three important benefits have been identified [22] for virtualization: • consolidation; • reliability; • security. Consolidation, a particularly important aspect in the business world, allows the “unification” of resources on top of a few physical platforms. The ever increasing CPU power, I/O speed and memory capacity may now be used to provide sufficient resources for multiple virtualized operating systems. Modern data centers must be plugged even when nothing is really running, resulting in infrastructure waste. The use of virtualization solutions and migration techniques allows the unification of multiple virtual servers on shared hardware. Consolidation also allows old applications to run on legacy systems. Migrating an application to another server results in compatibility issues. With the help of virtualization, applications will run on the same virtualized platform that may be migrated to a different physical platform. Not least, development and test environments may be easily deployed and decommissioned. As development and test environments do not imply perennial use, such as production environments, one may allocate several virtual machines, consolidate workload on a few physical servers and then scrap or disable those virtual machines and regain virtual servers. This allows both flexibility of the development and testing environment and infrastructure savings. Virtualization solutions provide reliability through isolation. A failure on a given virtual machine will not affect another virtual machine. The “partitioning” employed by a virtualization solution means that each virtual machine is running on a dedicated specialized simulated hardware. The isolation and partitioning allow dynamic allocation of new systems whenever required without the need to acquire additional hardware. Just as typical data centers provide failover servers to maintain system availability, virtualization may be used to provide software based provisioning of additional systems when required. In a typical cluster, a failover node has to be active and online. Using virtualization, failover nodes may be collocated on fewer physical hosts reducing hardware investments. Isolation provided by virtualization technology is a key feature for ensuring security of virtual ma- chines. If a given virtual machine faults or has been compromised, it does not affect other systems and, in need, it may be rapidly disabled. This would be very difficult to achieve on a physical infrastructure. Each virtual machine possesses its own security settings and configuration directives. This means that each virtual machine may be administered separately: an administrative (root) account on a given virtual machine will have no impact on another machine. Each administrator would configure its own virtual machine by need, leaving the physical platform to be administered by the solution provider. Virtualization thus allows delegation and isolation of full administrative rights on the same hardware system.

3.1.2 Types of Virtualization Solutions

As mentioned before, virtualization, in the general sense, may refer to hardware/server virtualiza- tion, software/application virtualization, storage/memory/network virtualization. As this work uses a particular type of server/hardware virtualization (namely operating system level virtualization), focus will be given to this. The most dominant form of virtualization, server virtualiza- tion, allows the coexistence of multiple operating systems on the same physical platform. A Virtual CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 32

Machine Monitor (VMM) or hypervisor provides a hardware like interface for virtual machines and the operating systems running within those. Although various classifications exist, we will present the types described by Dittner et al. [22]: • full virtualization; • paravirtualization; • native virtualization; • operating system virtualization. Full virtualization is a virtualization technique that provides complete simulation of the underlying hardware. The result is a system in which all software capable of execution on the raw hardware can be run in the virtual machine. Full virtualization has the widest range of support in guest operating systems. Due to the specifics of the x86 architecture, full virtualization may not be implemented in its pure form. Examples of full virtualization solutions are host-based virtualization solutions such as VMware Workstation, Virtual Box, Microsoft Virtual PC. Paravirtualization is a virtualization technique that provides partial simulation of the underlying hardware. Most but not all of the hardware features are simulated. The key feature is address space virtualization, granting each virtual machine its own unique address space. However, without proper hardware support, operating systems running in paravirtualized virtual machines have to be modified in order to run, which may pose problems to the solution provider. Example of such solutions are Xen, Microsoft Hyper-V, VMware ESX(i), Parallels Workstation. Native virtualization is the newest to the x86 group of virtualization technologies. Often referred to as hybrid virtualization, this type is a combination of full virtualization or paravirtualization and I/O acceleration techniques. Similar to full virtualization, guest operating systems can be installed with- out modification. It takes advantage of the latest CPU technology fo x86, Intel VT and AMD-V. Linux KVM1 is the most well known implementation of native virtualization with VMware Workstation, Xen and Microsoft Hyper-V enabling it in recent versions. Operating system-level virtualization is a virtualization technique where the operating system kernel allows multiple isolated (jailed) user-space instances. Virtual machines are commonly called containers, virtual environments (VEs) or virtual private servers (VPS). OS-level virtualization allows all processes to run on top of the basic operating system kernel thus enabling minimal overhead. While in normal virtualization solutions, a request would pass through the virtual machine kernel and then, if required, through the hypervisor, an OS-level virtualization solution required passage through the base kernel. Example solutions include OpenVZ, LXC, BSD jails, Linux V-Server. A graphical depiction of OS-level virtualization solutions is presented in Figure 3.1. Due to its low overhead, OS-level virtualization is the preferred form of virtualization for virtual private servers and virtual hosting. Its good isolation, low overhead and efficiency allows the rapid creation and deployment of a large number of containers. Its main disadvantages are the lack of flexibility in running different kinds of operating systems (only an operating system that may use the base kernel can be run) and the fact that a kernel error is common to all containers and base OS. The advantages of OS-level virtualization have boosted it as the number one solution for creating the P2P network infrastructure described in this work. With low overhead and easy deployment, OS-level virtualization allows the creation of a high number of virtual machines and required re- sources. The infrastructure has been deployed and put to use and results are excellent [17]; while some issues still persist (such as the inability to use traffic shaping tools within a containers) the benefits of using this technology outweigh the disadvantages. A case study of the scalability of OS-virtualization solutions for P2P environments has been presented by Bardac et al. [3].

1http://www.linux-kvm.org/ CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 33

Figure 3.1: Operating System-Level Virtualization

3.1.3 OpenVZ

As OS-level virtualization had proven to be the way to go for building up a Peer-to-Peer infrastruc- ture, we had to chose between existing solutions such as OpenVZ, LXC, V-Server. As we have good experience with Linux, the solution had to be a Linux-based solution. We had to chose be- tween OpenVZ, LXC, V-Server. Though LXC currently possesses the advantage of being active in the mainline kernel, it was still in its infancy in the time we started building the infrastructure. Apart from that, documentation is quite scarce. Between OpenVZ and V-Server, we chose OpenVZ due to its high quality documentation and peer recommendation and experience. Just as other OS-level virtualization solutions, OpenVZ is a series of patches to the Linux kernel that enhances process and resource isolation among containers. Major Linux distributions typically ship an OpenVZ able kernel that is easily installable. A series of user space tools allow installation and configuration of virtual machines. Typically, most commands and keywords start with the vz prefix. OpenVZ is an open-source project provided by Parallels Inc. with the help of an online community. Parallels provides a commercial product, Virtuozzo, as an extended version of the OpenVZ project. OpenVZ uses specific nomenclature and identifiers for physical platforms and containers. The physical platform is typically called the hardware node, and is referred through HN. As it is the con- tainer manager it is also referred as CT0 or VE0. Virtual machines are typically called containers (CTs) or virtual environments (VEs) (prefered term is container). Each container is identified by a number, called VEID or CTID (container identifier). CTID 0 is reserved for the hardware node. User space tools (such as vzctl) are used for creating and configuring OpenVZ containers. From a simplistic point of view, a container is a Linux filesystem structure. The creation of a container re- quires creating a complete Linux filesystem and then creating an OpenVZ specific configuration file. The Linux filesystem is typically distributed in tarballs also called OpenVZ templates. A template is uncompressed and deployed as a container. For each OpenVZ container a user may configure: • hostname; CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 34

• network interfaces; • mount point; • quotas; • modules; • resource limits. The ability to limit resources is one of the most important features for OpenVZ with respect to building a testing environments. OpenVZ uses user beancounters (UBCs) dubbed as resource management feature. The interface is accessible in OpenVZ containers as the /proc/user_- beancounters entry, soon to be changed to the /proc/bc/ folder. The resource management interface of OpenVZ allows the limitation of a variety of resources such as memory, number of processes, number of sockets, number of open files, number of terminal, size of sender/receiver buffers for TCP and UDP communication. With respect to networking, OpenVZ provides two types of network interfaces: venet and veth. The venet interface is a basic interface that enables the host network to act as a router for the con- tainer. The container uses a virtualized direct-link on an interface dubbed venet0. A more flexible configuration, also used in our infrastructure, is the veth interface. This interface is presented as a virtualized Ethernet interface in the container and enables bridging between different containers, even those residing on different hardware nodes. As an OS-level solution, with easy network configuration and a flexible resource limit configuration interface, OpenVZ was considered to be the most suitable solution for deploying a virtualized net- work infrastructure. The later chapters will provide insight on the process and tools employed for creating the infrastructure and deploying Peer-to-Peer swarm scenarios.

3.2 Virtualized Peer-to-Peer Infrastructure Setup

Creating a virtualized environment requires the hardware nodes where virtual machines will be deployed, the network infrastructure, a set of OpenVZ templates for installation and a framework that enables commanding clients inside the virtual machines. Each virtual machine runs a single BitTorrent application that has been instrumented to use an easily-automated CLI. OpenVZ’s advantages are low-resource consumption and fast creation times. As it is using the same kernel as the host system, OpenVZ’s memory and CPU consumption is limited. At the same time, OpenVZ file-system is a sub-folder in the host’s file-system enabling easy deployment and access to the VE. Each VE is thus a part of the main file-system and can be encapsulated in a template for rapid deployment. One simply has to uncompress an archive, edit a configuration file and setup the virtual machine (hostname, passwords, network settings). OpenVZ’s main limitation is the environment in which it runs: the host and guest system must both be Linux. At the same time certain kernel features that are common in a hardware-based Linux system are missing: NFS support, NAT etc. due to innate design. Despite its limitations, OpenVZ is the best choice for creating a virtualized environment for evaluat- ing BitTorrent clients. Its minimal overhead and low-resource consumption enables running tens of virtual machines on the same hardware node with little penalty.

3.2.1 Overall View

Through experimentation, we concluded that a virtualized testing environment based on OpenVZ would provide similar testing capabilities as a non-virtualized cluster with at most 10% of the cost. CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 35

Figure 3.2: Infrastructure Overview

Our experimental setup consisting of just 10 computers is able of running at least 100 virtualized environment with minimal loss of performance. Figure 3.2 presents a general overview of the BitTorrent Testing Infrastructure. The infrastructure is design to run commodity hardware systems. Each system uses OpenVZ vir- tual machine implementation to run multiple virtual systems on the same hardware node. Each virtual machine contains the basic tools for running and compiling the BitTorrent clients. BitTorrent implementations have been instrumented for automated command and also for outputting status and logging information required for subsequent analysis and result interpretation. As the infras- tructure aims to be generic among different implementation, the addition of a new BitTorrent client means adding the required scripts and instrumentation. Communication with the virtual machines is enabled through the use of DNAT and iptables. tc (traffic control) is used for controlling the virtual links between virtual systems. Each virtual ma- chines uses a set of scripts to enable the starting, configuration, stopping and result gathering for BitTorrent clients. A test/command system can be used to start a series of clients automatically through the use of a scripted interface. The command system uses SSH to connect to the virtual machines (SSH is installed and communication is enabled through DNAT/iptables) and command the BitTorrent implementations. The SSH connection uses the virtual machines local scripts to configure and start the clients. The virtualized environment is thus a cheaper and more flexible alternative to a full-fledged cluster with little performance loss. Its main disadvantage is the asymmetry between virtualized environ- ments that run on different hardware system. The main issue is network bandwidth between VEs running on the same hardware node and VEs running on different hardware nodes. This can be corrected by using traffic limitation ensuring a complete network bandwidth symmetry between the VEs.

3.2.2 Virtualized Network Configuration

To simulate real network bandwidth restrictions we use Linux traffic control (the tc tool) or client- centric options to limit peer upload/download speed. As virtualized systems are usually NAT-ed, CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 36 iptables is also used on the base stations. tc rules are applied both to OpenVZ containers and to physical systems in order to ensure bandwidth limitation. As all stations use common scripts and the same BitTorrent clients, important parts of the filesystem are accessed through NFS (Network File System). Thus, in case of 100 virtualized systems, only one of them is actually storing configuration, executable and library files; the other systems use NFS. Easy system administration has been ensured through the use of cluster-oriented tools such as Cluster SSH or Parallel SSH. As each container runs on top of a hardware node (base system), most connections use the hard- ware node as a switch/router. The basic venet interface allows the container to use the hardware node as a gateway, with the help of the iptables tool. Connections through the “hardware node gateway” are thus NAT-ed. This poses some configuration overhead as peers must enable upload slots that must pass through the gateway. In order to discard the configuration overhead for configuring NAT, a different approach may be undertaken: the use of veth interfaces and host bridges. veth interfaces allow the integration of new interfaces for the containers. veth interfaces may be bridged together using the brctl on the host system. The host system acts as a switch for containers. At the same time, all containers, even when residing on different systems, are able to communicate with each other, as part of the same LAN. All host systems and the physical interconnecting switch act as a switched network topology. This makes it very easy to achieve rapid connectivity among peers running in different OpenVZ containers. In order to ensure rapid communication of information among containers, NFS (Network File Sys- tem) has been employed. NFS is used to share scripts and repositories among physical systems and among containers. Each update is instantly available among all systems. Apart from the update improvement, space is also saved by unifying similar content in a single storage area. The NFS server is running on one of the base systems/hardware nodes; mount points are defined on each of the other base systems and on every container. A specialized bin/ folder stores useful scripts for the virtualized network infrastructure. Various scripts account for various functionalities, either for the base system or the containers on top of that: • creating and destroying OpenVZ containers; • basic network configuration for each container; • check Internet connectivity for OpenVZ containers; • check container-to-container connectivity using host bridges and veth interfaces; • run a command on each container; • copy a file on each container; • apply and test tc-based bandwidth limitation rules.

3.2.3 Virtualized Hardware Configuration

The physical infrastructure is currently hosted in the UPB NCIT cluster1 and consists of 10 identical hardware systems. A thin OpenVZ virtualization layer allows easy multiplication of base systems. We are able to safely deploy 100 virtualized systems; most scenarios use a virtual environment as a sandbox for running a single BitTorrent client. Tools such as brctl, iptables or tc have been employed for ensuring proper network configuration between virtual environments.

1http://cluster.grid.pub.ro/ CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 37

Our current setup consists of 10 computers each running 10 OpenVZ virtual environments (VEs). All systems are identical with respect to the CPU power, memory capacity and HDD space and are part of the same network. The network connections are 1Gbit Ethernet links. At the time of this writing, the BitTorrent Testing Infrastructure consists of 8 hardware systems running Debian GNU/Linux 5.0 (Lenny). Any other Linux distribution can be used as the tools used in the framework are common among distributions. The hardware specifications are: • HDD – SATA Western Digital, WDC WD3000JD-98K, 300GB (279GiB) • Intel Pentium 4 CPU 3.00GHz, dual-core, clock 800MHz, 64 bits • 2GiB RAM, DIMM 533MHz • L1 cache – 16KiB, L2 cache – 2MiB • 82545GM Gigabit Ethernet Controller, driver=e1000, speed=1Gbit/s • Motherboard, D2151-A1, FUJITSU SIEMENS All systems are running the same operating system (Debian GNU/Linux Lenny/testing) and the same software configuration. With 5 VEs active and running a BitTorrent client, memory consumption is 180MB-250MB per sys- tem. With no VE running the memory consumption is around 80MB-110MB. The BitTorrent client used pervasively is hrktorrent1, a libtorrrent-rasterbar2 based implementation. hrktorrent is a light addition tot the libtorrent library so memory consumption is quite small while still delivering good performance. On average each virtual machine uses 40MB of RAM when running hrktorrent in a BitTorrent swarm. The current partitioning scheme for each system leaves 220GB of HDD space for the VEs. However, one could upgrade that limit safely to around 280GB. Each basic complete VE (that is one with all clients installed) uses 1.7GB of HDD space. During a 700MB download session, each client outputs around log files using 30-50MB of space. Processor usage is not an issue as BitTorrent applications are mostly I/O intensive. In order to install a new virtual machine, one should use an OpenVZ template allowing for easy installation. A template contains all basic clients and tools necessary for running automated tests. The deployment of a new virtual machine is enabled through the use of the vzctl create command. We recommend the use of the makevz and destroyvz scripts from the repository that also configure all necessary aspects for the virtual machines (hostname, memory, disk space, iptables rules etc.): p2p-next-09:~/bin# ./makevz 106 p2p-next-09-01 Creating VE private area (debian-4.1-amd64-p2p-clean) Performing postcreate actions VE private area was created Saved parameters for VE 1066 Saved parameters for VE 106 Saved parameters for VE 106 Saved parameters for VE 106 Saved parameters for VE 106 Saved parameters for VE 106 Saved parameters for VE 106 Saved parameters for VE 106 Starting VE ...

1http://50hz.ws/hrktorrent/ 2http://www.rasterbar.com/products/libtorrent/ CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 38

VE is mounted Adding IP address(es): 172.16.10.6 Setting CPU units: 1000 Configure meminfo: 131072 Set hostname: p2p-next-09-01 File resolv.conf was modified VE start in progress... p2p-next-09:~/bin# vzlist VEID NPROC STATUS IP_ADDR HOSTNAME 101 7 running 172.16.10.1 p2p-next-09-01 102 7 running 172.16.10.2 p2p-next-09-02 103 7 running 172.16.10.3 p2p-next-09-03 104 7 running 172.16.10.4 p2p-next-09-04 105 7 running 172.16.10.5 p2p-next-09-05 106 7 running 172.16.10.6 p2p-next-09-06 p2p-next-09:~/bin# ls /home/p2p/ve/ 101 102 103 104 105 106 p2p-next-09:~/bin# ./destroyvz 106 Stopping VE ... VE was stopped VE is unmounted Destroying VE private area: /var/lib/vz/private/106 VE private area was destroyed p2p-next-09:~/bin# vzlist VEID NPROC STATUS IP_ADDR HOSTNAME 101 7 running 172.16.10.1 p2p-next-09-01 102 7 running 172.16.10.2 p2p-next-09-02 103 7 running 172.16.10.3 p2p-next-09-03 104 7 running 172.16.10.4 p2p-next-09-04 105 7 running 172.16.10.5 p2p-next-09-05 p2p-next-09:~/bin# ls /home/p2p/ve/ 101 102 103 104 105

Most OpenVZ functionality is enabled through the use of the vzctl parent command. Apart from create and destroy subcommands, other commands are useful: • enter – for entering a virtual machine with the identifier; • start | stop | restart – for starting, stopping or restarting a virtual ma- chine; • exec command – for executing a certain command on a virtual machine; • set [parameters] --save – for altering the configuration of a virtual machine. p2p-next-09:~/bin# vzctl enter 101 entered into VE 101 p2p-next-09-01:/# hostname p2p-next-09-01 p2p-next-09-01:/# logout exited from VE 101 p2p-next-09:~/bin# vzctl exec 101 ’ps -ef’ UID PID PPID C STIME TTY TIME CMD root 1 0 0 Feb27 ? 00:00:12 init [2] root 340 1 0 Feb27 ? 00:00:01 /sbin/syslogd 102 346 1 0 Feb27 ? 00:00:00 /usr/bin/dbus-daemon --system root 353 1 0 Feb27 ? 00:00:00 /usr/sbin/sshd avahi 360 1 0 Feb27 ? 00:00:00 avahi-daemon: running [p2p-next-09-01.local] avahi 361 360 0 Feb27 ? 00:00:00 avahi-daemon: chroot helper CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 39 root 376 1 0 Feb27 ? 00:00:08 /usr/sbin/cron root 30709 0 0 09:57 ? 00:00:00 ps -ef

A useful command is vzlist, allowing the listing of all virtual machines installed on the hardware node: p2p-next-09:~/bin# vzlist VEID NPROC STATUS IP_ADDR HOSTNAME 101 7 running 172.16.10.1 p2p-next-09-01 102 7 running 172.16.10.2 p2p-next-09-02 103 7 running 172.16.10.3 p2p-next-09-03 104 7 running 172.16.10.4 p2p-next-09-04 105 7 running 172.16.10.5 p2p-next-09-05

3.3 Evaluating Virtualization

With a plethora of virtualization solutions, a infrastructure for deploying Peer-to-Peer scenario has to take into account the adequacy of each solution. Through empirical studies we have chosen OpenVZ as a suitable solution for our needs, with the lack, however, of a formal evaluation method. Considering the use of virtualization solutions for our purpose we consider three important dimen- sions: • efficiency (scalability) – how many virtual machines/containers may be deployed on a virtual host and allow proper simulation of an environment; • isolation – how well are virtual machines’ resources separated; • reliability – how many software crashes happen for a given solution; this may be due to implementation or to resourse overuse/abuse. Similar metrics have been proposed by Ismail et al. [32]. The have experimentaly measured and defined overhead, variance and isolation metrics in virtualization. They had used KVM for testing and experimenting. Our proposed efficiency metric is similar to their variance metric, while their isolation metric is mostly concerned with CPU and memory fairness among various VMs. The variance metric introduced is mainly concerned with the resources in place used by each virtual machine; the efficiency/scalability metric introduced by us refers to the number of virtual machines that may be deployed. Wood et al. [74] have investigated VMware ESX and Xen and created an I/O model and profile. An approach similar to ours was undertaken by Soltesz et al. [65]. They considered Xen and Linux Vservers as the representatives for para- and paene-virtualization. Their metrics included performance, isolation and scalability with similar meanings ar to ours. Efficiency is a sheer measure of deploying large numbers of virtual machines and containers on top of a given hardware system. Padala et al. [55] have shown that OpenVZ possesses a smaller overhead compared to Xen. However, no formal method has been employed and their study doesn’t take into account other OS-level virtualization solutions. A formal measurement for virtualization efficiency/performance should consider three aspects: • hardware resources; • software implementation (in our case, BitTorrent client implementation); • virtualization solution. CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 40

As such, we formalize efficiency as a function of the three dimensions above:

Eff = f(HW, SW, V S) (3.1) Eff = f(RAM,HDD,CPU,NET,OS,PS,BT,VS,NVM) (3.2) VMB Eff = (3.3) HNB where:

• Eff: efficiency/performance • NET: networking features • HW: hardware resources • OS: operating system implementation • SW: software implementation • PS: basic container processes • VS: virtualization solution • BT: BitTorrent implementation • RAM: system RAM • NVM: number of virtual machines • HDD: I/O space • VMB: virtual machine behavior • CPU: processor power • HNB: hardware node behavior

Measuring efficiency must the consider the behavior of the system and how similar is a container/vir- tual machine to an actual system. For that one may take into account resource usage, resource contention, responsiveness. These may be measured through experimental means and subse- quently an empirical approximate formula, taking into account all above mentioned variables, may be determined. Isolation is a means of stating how well is a virtual machine separated from another virtual machine and from the base system. As OpenVZ VEs are using the same kernel, it is obvious the isolation value is less than that of Xen of KVM. Isolation serves as an important dimension for measuring virtualization solution adequacy. The ability to completely isolate a virtual machine and properly specify its resource usage allows better resemblance to a base hardware system. While we find it difficult to properly formalize isolation we consider achievable a “virtualization isola- tion scale” in which virtualization solutions may be compared against each other. A rather relative value than an absolute one. Thus we consider it safe to state that:

Iso(normalprocesses) < Iso(chroot) < Iso(OpenV Z, LXC) < Iso(Xen, KV M) (3.4)

A careful analysis of virtualization solutions, facilities employed, how well is separation achieved and resource limitation enforced must be undertaken. It will allow proper relative scale placing of virtualization solutions. Reliability must be considered when dealing with a heavily used infrastructure when evaluating multiple virtualization solutions. In our experience we have encountered several software problems such as disk inconsistency, operating system level errors and lack of responsiveness due to the high (ab)use of hardware resources. Our OpenVZ based infrastructure must be evaluated. Fortunately, there are a number of well defined and thoroughly analysed measures such as failure rate of mean distance between failures that can be taken into account. These have to consider the evaluating environment, as with the efficiency dimension: CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 41

Rel = f(HW, SW, V S) (3.5) Rel = f(RAM, HDD, CP U, OS(filesystem),PS,BT,VS,NVM) (3.6)

Properly defined and evaluated dimensions such as efficiency, isolation and reliability provide an overall view of virtualization solutions and their adequacy for our specific purpose. We consider there is no “silver bullet” solution but rather one that will provide suitable for a class of necessities. The weight one maps to each of the dimensions would depend on his/her needs and environment specifics. More than this, it may not be feasible (nor possible) to extract a formula taking into account all input data. Rather consider a numeric value for each metric and ensure a comparison between these values. Such that one may conclude that affecting one input variable into one direction (increase or decrease) will have a certain impact on the metric. This is the similar approach provided by Ismail et al. [32]. We have experimented both with OpenVZ, LXC and Xen-based environment in order to create a proper experimental testbed for the above metrics. In OpenVZ’s case, our study focused on efficiency/scalability. Active VEs running the hrktorrent client use about 70 to 170MB of RAM. This gives a rough estimate of about [20MB; 40MB] mem- ory consumption per VE. The basic system also uses at most 120MB of RAM. From the HDD perspective, the basic system can be tuned to use 20GB of space with no major constraints on the software configuration. Each complete VE able to run all CLI-based BitTorrent clients uses 1.7GB of space. At the same time, 1GB of space should be left for each system for testing and logging purposes and 5GB for file transfer and storage. This means that 8GB of space should be reserved for each VE. The above values are very lax. A carefully tuned system would manage to use less resources. However, we aim to show that given these high values, an average PC could still sustain a significant amount of VEs with little overhead and resource penalty. Table 3.1 gives a low limit for the number of OpenVZ virtual environments able to run on a basic PC. Italic means limitation is due to RAM capacity, while bold means limitation is due to HDD space.

Memory HDD 1GB 2GB 4GB 8GB 16GB 80GB 7 7 7 7 7 120GB 12 12 12 12 12 200GB 22 22 22 22 22 300GB 22 35 35 35 35 500GB 22 47 60 60 60 750GB 22 47 91 91 91 1TB 22 47 97 122 122

Table 3.1: Scalability in OpenVZ (Number of Containers)

In case of LXC, we’ve run a scenario involving the use of the Unix gzip command, used for file compression. This allowed us to measure both CPU and hard-disk related values concerned with scalability. Tables below present CPU usage, I/O operations per second disk busy percentage. Experimentation for LXC shows a linear growth of CPU usage and disk utilization. Memory space is always consumed to maximum available such that it isn’t taken into account. The LXC is able to do linear scaling for a heavily used application with the possibility of more than linear scaling providing not all containers are heavily used at the same time. CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 42

No. containers Average Maximum Minimum Time (%) (%) (%) (s) 2 19.2 21.3 18.7 2626 4 39.0 44.0 35.4 2786 8 82.3 95.6 70.6 2401

Table 3.2: LXC CPU Usage for gzip

No. containers Average Maximum Minimum Time (ops/s) (ops/s) (ops/s) (s) 2 41.5 104.4 17.0 2626 4 39.0 174.2 29.2 2786 8 176.6 275.4 84.6 2401

Table 3.3: LXC I/O operatinons for gzip

In case of Xen we’ve used four virtual machines, each using 256MB of RAM and running on a single core of the base system. These virtual machines were running ffmpeg, an application for video conversion. Results of CPU and disk usage are presented in Table 3.5. The evolution of various metrics is slightly over-linear in case of disk usage and close to linear in case of running time. Xen scaling is similar to LXC in case of running resource intensive applications such as ffmpeg. However, each virtual machine uses a predefined value for memory which is generally higher than an operating system-level virtualization solution due to the new kernel memory and hypervisor overhead. The usage of virtualization metrics eases comparison among virtualization solutions and between use cases of a given solution. One may choose among various situations where to deploy a virtual- ization solution and may tune it to provide the desired outcome. The proposed metrics are difficult to be written down in a mathematic formula but are best used as comparison. Note that a solu- tion may be suitable when considering one metric, but unsuitable when considering the other. The user/experimenter is the one deciding what is the most relevant aspect to be taken into account.

3.4 Automating Deployment and Management of Peer-to-Peer Clients

In order to keep up with recent advances in Internet technology, streaming and content distribu- tion, Peer-to-Peer systems (and BitTorrent) have to adapt and develop new, attractive and useful features. Extensive measurements, coupled with carefully crafted scenarios and dissemination are important for discovering the weak/strong spots in Peer-to-Peer based data distribution and ensur- ing efficient transfer. On top of the virtualized infrastructure, we developed a framework for running, commanding and managing BitTorrent swarms. The purpose is to have access to an easy-to-use system for deploy- ing simple to complex scenarios, make extensive measurements and collect and analyze swarm information (such as protocol messages, transfer speed, connected peers). The swarm management framework [62] is a service-based infrastructure that allows easy configu- ration and commanding of BitTorrent clients on a variety of systems. A client application (comman- der) is used to send commands/requests to all stations running a particular BitTorrent client. Each CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 43

No. containers Average Maximum Minimum Time (%) (%) (%) (s) 2 11.8 32.3 5.3 2626 4 26.1 65.9 12.4 2786 8 61.3 85.4 36.2 2401

Table 3.4: LXC Disk Busy for gzip

Metric 1 VM 2 VMs 3 VMs CPU Usage (%) 18.49 22.07 26.85 Disk writes (bytes/s) 1008.11 1820.96 5074.22 Running Time (s) 483 908 1808

Table 3.5: Xen Metrics for ffmpeg station runs a dedicated service that interprets the requests and manages the local BitTorrent client accordingly. The framework is designed to be as flexible and expandable as possible. As of this point it allows running/testing a variety of scenarios and swarms. Based on the interest of the one designing and running the scenario, one may configure the BitTorrent client implementation for a particular station, alter the churn rate by configuring entry/exit times in the swarm, add rate limiting constraints, alter swarm size, file size etc. Its high reconfigurability allows one to run relevant scenarios and collect important information to be analyzed and disseminated. Through automation and client instrumentation the management framework allows rapid collection of status and logging information from BitTorrent clients. The major advantages of the framework are: • automation – user interaction is only required for starting clients and investigating their current state; • complete control – the swarm management framework allows the user/experimenter to spec- ify swarm and client characteristics and to define the context/environment where the scenario is deployed; • full client information – instrumented clients output detailed information regarding the inner protocol implementation and transfer evolution; information is gathered from all clients and used for subsequent analysis. Depending on the level of control of the swarm, we define two types of environments. A controlled environment, or internal swarm uses only instrumented controlled clients. We have complete con- trol over the network infrastructure and peers. A free environment or external swarm is usually created outside the infrastructure, and consists of a larger number of peers, some of which are the instrumented controlled clients. Our experiments so far have focused on controlled environments; we aim to extend our investigations to free environment swarms.

3.4.1 Instrumenting Peer-to-Peer Clients

For our experiments [16] we selected the BitTorrent clients that are most significant nowadays, based on the number of users, reported performance, features and history. We used Azureus, , Transmission, Aria, libtorrent rasterbar/hrktorrent, BitTornado and the mainline client (open source version). All clients are open source as we had to instrument them to CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 44 use a command line interface and to output verbose logging information. Azureus, now called , is a popular BitTorrent client written in Java. We used Azureus version 2.3.0.6. The main issue with Azureus was the lack of a proper CLI that would enable automa- tion. Though limited, a “Console UI” module enabled automating the tasks of running Azureus and gathering download status and logging information. Tribler is a BitTorrent client written in Python and one of the most successful academic research projects. Developed by a team in TU Delft, Tribler aims at adding various features to BitTorrent, increasing download speed and user experience. We used Tribler 4.2. Although a GUI oriented client, Tribler offers a command line interface for automation. Extensive logging information is enabled by updating the value of a few variables. Transmission is the default BitTorrent client in the popular Ubuntu Linux distribution. Transmission is written in C and aims at delivering a good amount of features while still keeping a small mem- ory footprint. The version we used for our tests was transmission 1.22. Transmission has a fully featured CLI and was one of the clients that were very easy to automate. Detailed debugging infor- mation regarding connections and chunk transfers can be enabled by setting the TR_DEBUG_FD environment variable. Aria2 is a multiprotocol (HTTP, FTP, BitTorrent, Metalink) download client. Throughout our tests we used version 0.14. aria2 natively provides a CLI and it was easy to automate. Logging is also enabled through CLI arguments. Aria2 is written in C++. libtorrent rasterbar/hrktorrent is a BitTorrent library written in C++. It is used by a number of BitTorrent clients such as , BitTorrent and SharkTorrent. As we were looking for a client with a CLI we found hrktorrent to be the best choice. hrktorrent is a lightweight implementation over libtorrent-rasterbar and provides the necessary interface for automating a BitTorrent transfer, although some modifications were necessary. Rasterbar libtorrent provides extensive logging infor- mation by defining the TORRENT_LOGGING and TORRENT_VERBOSE_LOGGING_MACROS. We used version 0.13.1 of libtorrent-rasterbar and the most recent version of hrktorrent. BitTornado is an old BitTorrent client written in Python. The reason for choosing it to be tested was because of a common background with Tribler. However, as testing revealed, it had its share of bugs and problems and it was eventually dropped. BitTorrent Mainline is the original BitTorrent client written by in Python. We used version 5.2 during our experiments, the last open-source version. The mainline client provides a CLI and logging can be enabled through minor modifications of the source code.

3.4.2 Framework Setup

XML Configuration Files OpenVZ Container bootstrap Server (SSH) Tribler

command Server Hrktorrent

ack/error Transmission

Commander Station

Figure 3.3: Software Service System Overview

The software service infrastructure was designed with the goal of remotely controlling BitTorrent clients. Its architecture (see Figure 3.3) is built on a client-server model, with a single client ad- CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 45 dressed as Commander and multiple servers. The BitTorrent clients reside in OpenVZ virtual containers and are controlled only through the Server service, by interacting with the Comman- der interface. A SSH connection is used by the Commander for the initial bootstrapping, in case the service is not active. The BitTorrent scenarios are defined using XML configuration files which can be considered as input to the Commander. These files contain information not only about each container that should be used, but also about the torrent transfers, like file names and paths. As we wanted to make it as easy as possible to deploy new BitTorrent swarms, we designed our architecture to support two XML configuration files: one for physical nodes configuration and one for BitTorrent swarms configuration. The nodes XML file describes the physical infrastructure configuration. It stores information about: • physical nodes/OpenVZ containers IP addresses and NAT ports; • SSH port and username; • server and Bittorrent clients paths.

141.85.224.201 10169 eth0 141.85.224.201 10122 p2p 10104 /home/p2p/cs-p2p-next/autorun/server/ Server.py /home/p2p/p2p-clients/transmission/cli /home/p2p/p2p-clients/hrktorrent /home/p2p/p2p-clients/tribler

The swarm XML file is used to describe the swarm configuration. It maps a BitTorrent client to a physical node from the nodes XML configuration file, and contains the following information: • torrent file for the experiment (same path on all containers); • BitTorrent client upload/download speed limitations; • output options (download path, logs paths). The speed limitations are enforced using the tc Linux tool or internal client bandwidth limitation options.

/home/p2p/p2p-meta/test.torrent CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 46

1 tribler 512 256 /home/p2p/p2p-dld/tribler /home/p2p/p2p-dld/tribler tribler-test.log /home/p2p/p2p-log/tribler tribler-test.out

3.4.3 Client Startup and Management

The Server application represents a daemon that listens for incoming connections and manages BitTorrent clients. Upon start-up, the server receives as input from the Commander the IP ad- dress on which to bind itself for socket connections. The port on which it listens is predefined in a configuration file visible to both Server and Commander. Similar to the Commander application, the language chosen for the implementation is Python, which offers several C-like functionalities, like the socket module for communication and the subprocess for process spawning (the server is responsible for starting and stopping the BitTorrent clients). The BitTorrent swarm analysis system described in Section 4.4 is also entirely implemented in Python, and the Server uses its status file parsers in order to obtain the latest information about a transfer status. The Server is separated from the BitTorrent clients using a thin layer of classes, implemented for each client, which provide the interface needed for commanding their execution and establishing their input parameters. The system design implies that BitTorrent clients reside on remote machines and are managed through a Server application, which runs as a daemon on their system. This Server is remotely controlled, being started, restarted and stopped using SSH commands initiated through the Com- mander application. Once the Server is started, the Commander acts as its client, communicating with it in order to control the BitTorrent applications. Our protocol implies that each BitTorrent client started by the Server is associated with only one torrent file. Currently, the software service infrastructure supports the following messages: • START-CLIENT – the server will start a client with the given parameters; • STOP-CLIENT – the server will stop a client with the given identifier; • GET-CLIENTS – the server replies with a list of running clients; • GET-OUTPUT – the server replies with information about clients output (running or not); • ARCHIVE – the server creates archives with the files indicated in the message, and deletes the files; • GET-STATUS - returns information about an active transfer; • CLEANUP - removes files, extendible to other file types. The dictionary maps the types of the files that need to be removed, in the current version of the implementation it supports the following keys: CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 47

• ALL – if True, erases all files related to the experiment; • DOWN – if True, erases all downloaded files; • VLOGS – if True, erases all verbose log files; • SLOGS – if True, erases all status log files; • ARCHIVE – if True, erases all archives related to the experiment. The Commander initiates transfers by starting a client with a specific torrent file and options (down- load path, log files paths and names), and the Server returns a corresponding ID, which can be used to check the transfer status. The status information is retrieved from the status log files, and currently supports the following parameters: download speed, upload speed, downloaded size, up- loaded size, eta (estimated time of arrival), number of peers. In the reply message body, each parameter uses a string identifier (parameter_name) and is followed by its corresponding value. The use-cases involved swarms with different numbers of peers on torrent files of different sizes (ranging from tens of MB to a few GB for Linux distribution ISO files). In a typical scenario, the Commander loads the XML configuration files representing the swarm and then it bootstraps the Server daemons from all the virtual containers. Immediately afterwards, the Commander uses the Server daemons to start BitTorrent clients on a target torrent. Once the swarm is formed, the Commander then periodically checks the state of the BitTorrent clients in the swarm. It also stops/restarts some of the clients. The logs generated by the BitTorrent clients contain lines of the following form: • Hrktorrent status log: ps: 32, dht: 21 <> time: 17-08-2010 12:40:05 <> dl: 2.57kb/s, ul: 3.09kb/s <> dld: 0mb, uld: 0mb, size: 3125mb <> eta: 5d 46h 23m 8s • Tribler status log: 03-06-2010 12:19:04 sample1.mpeg DLSTATUS_DOWNLOADING 93.89% None up 0.00KB/s down 5440.21KB/s eta 1.67072531777 peers 2 • Hrktorrent verbose log: Jun 08 22:20:48 <== HAVE [ piece: 839] • Tribler verbose log: 14-11-2009 23:11:13 connecter: Got HAVE( 14 ) from 141.85.37.41

3.5 Simulating Connection Dropouts

In order to create a realistic behavior of a Peer-to-Peer swarm, the experimenter must take into account the way connections are created and then destroyed; in other terms, churning must be considered. We define simulating churn as simulating connection dropouts [15]. These techniques have to be taken into account to provide valuable updates to the virtualized network infrastructure. We analyse two solutions for reproducing realistic network dropouts in simulation environments and compare them with an induced network failure. Considering infrastructures where complete control exists over the client processes and machines, the peer connections to the swarm are dropped and resumed by stopping and restarting the client, suspending and resuming the client and disabling and enabling the network interface. The client behavior in the first two solutions is compared to the third network dropout solution from the swarm’s point of view. CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 48

3.5.1 Reproducing Network Behavior

One of the main difficulties in simulating network environments is reproducing network unreliability. Given the heterogeneous nature of both end-user computers and ISP equipments and policies, a real-life deployment of Peer-to-Peer systems encounters multiple types of underlying network issues: dropped packages, connection delays, connection dropouts. Each of the above mentioned network behaviors may have an influence on application behavior. If connection delays and dropped packages are most of the times covered by TCP functionality, connection dropouts have a direct influence on the application level (clients joining and leaving are a fundamental process of Peer-to-Peer system [27]). Considering one simple example of a swarm composed of an initial seeder and six leechers, if the seeder periodically leaves and rejoins the swarm, the nodes will require a longer period of time for the data transfer to be completed. BitTorrent clients create and maintain reputation traces for peers with whom data was exchanged. In case a peer possesses an unreliable behavior, it will not be preferred for opening new connection slots and thus it will experience a decreased level of performance from the rest of the swarm. Connection dropouts also contribute to peer population churn [67]. As nodes are disconnected from the swarm, the rest of peer population may have an improved or diminished performance, depending on the swarm state. Previous studies have estimated and analysed the impact of churn on the behavior of Peer-to-Peer protocols [7]. However, in most cases, the analysis was based on simulations of protocols ([44], [34], [54]) rather than using real implementations. From the point of view of the swarm, a client connection dropout is equivalent to the client abruptly leaving the system. In this case neither the BitTorrent connections, nor the TCP links are closed in a graceful manner and swarm peers will experience multiple timeouts before declaring the connec- tions closed. Such behavior can be reproduced using three solutions: stopping and restarting the clients, sus- pending them and disabling the network interface. Clients can be instantly stopped using the POSIX SIGKILL signal. When a process receives this signal, it is immediately stopped and all its open connections are closed by the operating system. In order for the client to be resumed, it has to be restarted. When a process is suspended, it is placed in a temporary inactive state. The operating system does not close its connections and open files. A process can easily be suspended by sending it a POSIX SIGSTOP signal. To resume the process (and place it in an active state), the POSIX SIGCONT signal has to be sent to the process. The solutions for separating the client from the swarm are different from the point of view of the TCP connection management: the stopped clients have all their connections closed, while the suspended clients can resume their connections if a timeout has not been reached at the moment of their resume. The client connection dropout can be induced by disabling the network interface. Using this ap- proach the client will continue to run without any intervention from the operating system while its TCP connections will be closed. Disabling the network interface closely reproduces network dropouts occurring in the Internet.

3.5.2 Experimental Setup

Practical demonstration and evaluation of the connection dropout features have employed the virtu- alized infrastructure and shte scripted framework running on top of it. The infrastructure allowed us to deploy up to 80 peers – each running in a single virtual machine. All peers have been grouped in pairs to evaluate proposed solutions for simulating connection dropout features. The thin scripted CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 49 framework was also responsible for collecting output from all peers (in the form of log files) and for parsing it. The infrastructure is constructed on top of several commodity hardware systems in the NCIT clus- ter from UPB. In order to deploy a large number of peers, the thin virtualization layer employed OpenVZ. OpenVZ is a lightweight solution that allows rapid creation of virtual machines (also called containers). As an operating system-level virtualization solution, OpenVZ enables creation of a large number of virtual machines (30 virtual machines are sustainable on a system with 2GB of RAM memory). In our infrastructure each container is running a single BitTorrent client instance. All hardware systems used were identical with respect to hardware and software components: 2GB RAM, 3 GHz dual-core CPU, 300GB HDD, 1Gbit NIC, running Debian GNU/Linux 5.0 Lenny. The deployed experiments used a single OpenVZ container for each peer taking part in a swarm. The virtualized network allowed a direct link layer access between systems – all systems are part of the same network; this allows easy configuration and interaction. A separate hardware system, also called Commander, was used to start and handle scenarios. The Commander uses SSH for communicating with the virtualized environment. Connections be- tween Commander and containers are handled through secondary specialized venet interfaces; the presence of the Commander connections and the virtualized network created an easy testing ground for disabling and enabling interfaces – the interfaces employed for dropping connections were those part of the virtualized network. The experiments made use of an updated version of hrktorrent, a lightweight application built on top of libtorrent-rasterbar. Previous experiments [14] have shown libtorrent-rasterbar outperforming other BitTorrent implementations leading to its usage in the current experiments. The hrktorrent has been updated to make use of bandwidth limitation facilities provided by libtorrent-rasterbar. Deployed scenarios have forced a 100 KB/s download limit. To evaluate the impact of the dropout, clients have been grouped into pairs (one seeder and one leecher), the leecher’s downloading capacity being temporary disabled. Together with the seeder and leecher, a tracker is started on the same container as the seeder to allow BitTorrent communica- tion between the two nodes. The use of a two-peer swarm reduces the impact of non-deterministic BitTorrent communication inside the swarm. At the same time, this setup emphasizes the activity of the target leecher by clearly targeting one BitTorrent connection. Placing the tested leecher in a larger swarm would make the results less accurate. The duration of the download has to be large enough to allow the experiments to avoid the BitTorrent- specific start-up behavior. Given the leecher’s bandwidth limitation of 100 KB/s, a 100MB file was chosen to be shared by the seeder. This creates a theoretical duration of 1000 seconds for each complete download. However, the size of the shared files has no effect on the connection dropout behavior: the leecher’s bandwidth will always be saturated at the moment of the dropout simulation. Time recovery is the interval of inactivity of a given peer and, thus, of a connection between two peers. Through various methods, the connection between peers is disabled and, after a given period of time, reenabled. Time recovery is the focus of our experiments and may on may not differ significantly from the connection interrupt timeout. The connection behavior of each of these methods provides insight on their suitability for various simulation scenarios. Each seeder-leecher pair is executing the following scenario schedule: • start tracker, create .torrent files • start seeder, leecher • wait a predetermined amount of time for swarm initialization • disable the leecher, in accordance to the respective connection dropout solution (suspend the process, terminate it or disable the network interface) CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 50

• wait a test-specific amount of time • enable the leecher • wait for swarm stabilization The amount of time before the leecher is disabled and after it is enabled have been used to ensure swarm stabilization and proper results. At this point, the “swarm initialization time” is 15 seconds, while the “swarm stabilization time” is 45 seconds. These time intervals have been chosen empiri- cally from experimentation and experience. A thorough analysis of the optimum values for the time intervals is set as further work. A leecher-seeder pair is terminated after each scenario and restarted in order to take part in the next one. A complete test suite for a given pair implies using increasing amounts of timeout between disabling and enabling the leecher. This is, for every situation, equivalent to an increase in the duration of the connection dropout and forms the basis for subsequent analysis. The values chosen for timeout intervals are measured in seconds in geometric progression: 4 seconds, 8 seconds, 16 seconds, 32 seconds, 64 seconds, 128 seconds, 256 seconds, 512 seconds. They cover a simulated dropout timeout ranging from a few seconds to close to 10 minutes.

3.5.3 Results

The virtualized infrastructure and scripted framework have been used as an evaluation suite testing- ground for the proposed dropout equivalent solutions. The evaluation suite has employed virtualized containers to create leecher-seeder pairs. Each pair is used to transfer a 100 MB file from an initial seeder to one leecher. A tracker is also started on the same container as the seeder to mediate communication. The leecher uses a 100 KB/s limitation. Information from both the leecher and the seeder is collected as log files and parsed subsequently. The accessibility of a high number of virtualized systems and the small swarm size (two peers, one leecher and one seeder) allows rapid deployment of scenarios and easy repeatability. A single running suite employs 39 swarm pairs, 13 pairs for each proposed solution. Through the scripted interface each pair is sequentially simulating a series of connection dropouts. Each series consists of dropping the connection for 4 seconds, 8 seconds, 16 seconds and so on until 512 seconds. A simulated dropout is equivalent to a connection interrupt for the given amount of time. Information used for analysis has been collected in the form of log files from BitTorrent clients. SSH, rsync and shell scripts have been glued together in order to collect and parse relevant information. Statistical processing has been employed in the form of R language scripts for mean values, stan- dard deviation, graphics, etc. The main goal of the employed experiments is to measure and compare the recovery time after each connection drop out for each of the three proposed solutions. By employing diverse timeout intervals we present similarities and differences between suspending and terminating clients or disabling network interfaces in order to simulate connection dropout. Table 3.6 summarizes the results of the employed experiments. The three main methods used for simulating connection dropouts are identified by ifdown, suspend and stop. For each method, the mean value of the recovery timeout and the relative standard deviation have been computed. The column dubbed pause signifies the timeout implied on the given process. The mean column is the mean of the measured values, measured in seconds, while the rsd column is the relative standard deviation (percentage). All three methods offer similar results, with recovery time values closer as the scheduled pause time increases. The suspend and stop are usually very close to the real pause value, while the ifdown CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 51

Table 3.6: Recovery Timeout for Different Scenarios

ifdown suspend stop pause(s) mean(s) rsd(%) mean(s) rsd(%) mean(s) rsd(%) 8 12.10 10.88 9.80 4.30 12.21 11.61 16 23.84 16.73 17.12 2.33 19.95 9.04 32 47.16 4.50 32.33 8.87 36.36 4.11 64 79.66 13.4 72.16 26.83 67.61 2.45 128 146.37 8.05 142.11 11.56 131.80 1.14 256 288.58 2.15 259.90 1.56 260.26 0.69 512 535.42 0.09 532.42 2.11 515.81 0.33 method is usually further from that value. Due to their similar results, we consider safe to assume that the suspend and stop methods may be used interchangeably in order to simulate connection dropouts. As a connection dropout is usually caused by a client process being terminated, the stop method is the most appropriate choice. If available or easier to deploy, the suspend method may be used with similar effects. Figure 3.4 shows a graphical representation of the recovery time for the solutions involving interface disable and terminating the process with respect to the leecher timeout interval. We have found the results regarding suspending the process to be inconclusive due to improper swarm initialization time and have been left out.

Time Recovery in Simulated Dropouts

500 Dropout types ifdown sigkill sigstop 400

300

200 Recovery time (s) Recovery

100

0

8 16 32 64 128 256 512 Leecher dropout time (s)

Figure 3.4: Time Recovery with Respect to Dropout Interval

As concluding remarks, the first solution (ifdown) brings down the network interface of the peer in order to simulate the end of the connection. Results have shown that recovery time, although in the same range, is higher than the other methods and we conclude that such a solution would be mostly suitable for simulating connection dropouts due to network failure. The second solution (suspend) suspends clients during the simulated dropout (using SIGSTOP) and resumes them afterwards (using SIGCONT). Although suspending peers is not a common action, results have shown that this method is similar to the stop method with respect to recovery time. In an environment where suspending peers is easier to be achieved than stopping them, such a solution could prove suitable and provide realistic results. CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 52

The third solution (stop) consists of stopping (using SIGKILL) and restarting the clients. This solu- tion (although aggressive) is considered to be the best approximation of a realistic behaviour of a connection dropout, as peers have typically high dynamics when entering and exiting a swarm.

3.6 Deployed Setup and Experimental Scenarios

The complete infrastructure, using automation, virtualization, connection dropout simultion, allows the automatic deployment and management of a wide variety of Peer-to-Peer scenarios. Each OpenVZ virtualized hosts runs a single BitTorrent client (or tracker) and collects relevant information for subsequent analysis. The commander station defines the configuration of the new scenario and then deploys it on top of the virtualized infrastructure. Bandwidth limitation, client types, ports to be used, churn rate, torrent files are setup for the given scenario.

3.6.1 Virtualized Configuration

Our current setup consists of 10 computers each running 10 OpenVZ virtual environments (VEs). All systems are identical with respect to the CPU power, memory capacity and HDD space and are part of the same network. The network connections are 1Gbit Ethernet links. At the time of this writing, the BitTorrent Testing Infrastructure consists of 8 hardware systems running Debian GNU/Linux 5.0 (Lenny). Any other Linux distribution can be used as the tools used in the framework are common among distributions. With 5 VEs active and running a BitTorrent client, memory consumption is 180MB-250MB per sys- tem. With no VE running the memory consumption is around 80MB-110MB. The BitTorrent client used pervasively is hrktorrent, a libtorrrent-rasterbar based implementation. hrktorrent is a light addition tot the libtorrent library so memory consumption is quite small while still delivering good performance. On average each virtual machine uses 40MB of RAM when running hrktorrent in a BitTorrent swarm. The current partitioning scheme for each system leaves 220GB of HDD space for the VEs. However, one could upgrade that limit safely to around 280GB. Each basic complete VE (that is one with all clients installed) uses 1.7GB of HDD space. During a 700MB download session, each client outputs around log files using 30-50MB of space. Processor usage is not an issue as BitTorrent applications are mostly I/O intensive. Experiments unsing a large number of peers have been employed using the virtualized infrastruc- ture. These experiments use virtualized peers to deploy realistic swarms, create a network topology and overlay and collect information for analysis.

3.6.2 Monitoring Experiment

One such experiment simulates swarms comprising of a single seeder and 39 initial leechers. 19 leechers are high bandwidth peers (512KB/s download speed, 256KB/s upload speed) and 20 leechers are low bandwidth peers (64KB/s download speed, 32KB/s upload speed). The total time of an experiment involving all 40 peers and a 700MB CD image file is around 4 hours. It only takes about half an hour for the high bandwidth clients to download it. We have been using Linux Traffic Control (tc) tool combined with iptables set-mark option to limit download and upload traffic to and from a VE. CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 53

Figure 3.5: Download Speed Evolution – Start Phase

Figure 3.6: Download Speed Evolution – End Phase CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 54

Figure 3.5 and Figure 3.6 are real time representations of download speed evolution using Mon- ALISA. The first figure shows the initial phase (first 10 minutes) of an experiment with the low bandwidth clients limited by the 64KB/s download speed line, and the high bandwidth clients run- ning between 100KB/s and 450KB/s. The second figure presents the mid-phase of an experiment when high bandwidth clients finished downloading. The two horizontal red lines are the download speed limitations for the two kinds of clients: 512KB/s for high speed clients and 64KB/s for low speed clients. Figure 3.5 shows the limitation of the low bandwidth peers while the high bandwidth peers have sparse download speed. Each high bandwidth client’s speed usually follows an up-down evolution, and an increasing median as time goes by. Low bandwidth clients are limited to 64KB/s while high bandwidth clients fill the range of 64KB/s and 512KB/s. As time goes the high bandwidth peers tend to stabilize. Figure 3.6 presents the end phase of high bandwidth peers. At around 13:05, the high bandwidth clients have completed their download or are completing it in the following minutes, while the low bandwidth clients are still downloading. The high bandwidth clients have a large speed interval, while the low bandwidth clients are “gathered” around the 64KB/s limitation.

3.6.3 Performance Evaluation Experiments

This experimental setup used only six hardware nodes from the infrastructure. Most of our ex- periments were simultaneous download sessions. Each system ran a specific client in the same time and conditions as the other clients. Results and logging data were collected after each client completed its download.

3.6.4 Results

Our framework has been used to test different swarms (different .torrent files). Most scenarios involved simultaneous downloads for all clients. At the end of each session download status in- formation and extensive logging and debugging information were gathered from every client. Fig- ure 3.7 and Figure 3.8 are comparisons between different BitTorrent clients running on the same environment in the same download scenario.

Test1 Test2 Test3 Test4 file size 908MB 4.1GB 1.09GB 1.09GB seeders 2900 761 521 496 leechers 2700 117 49 51 Client Download Time (seconds) aria2c 4620 3233 580 623 azureus 1961 2313 N/A 420 bittorrent 17580 3639 1560 840 libtorrent 581 913 150 134 transmission 2446 3180 420 300 tribler 2040 1260 N/A N/A

Table 3.7: Test Swarms Results

Table 3.7 presents a comparison of the BitTorrent clients in four different scenarios. Each scenario means a different swarm. Although many data were collected, only the total download time is featured in the table. CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 55

Figure 3.7: Test Swarm 1

Figure 3.8: Test Swarm 2 CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 56

The conclusions drawn after result analysis were: • hrk/libttorrent is continuously surpassing the other clients in different scenarios; • tribler, azureus and transmission are quite good clients but lag behind hrktorrent; • mainline and aria are very slow clients and should be dropped from further tests; • swarms that are using sharing ratio enforcement offer better performance; a file is down- loaded at least 4 times faster within a swarm using sharing ratio enforcement.

3.6.5 Deploying the Rendering Engine 1.0 DHT_PORT 2 A class of specific experiments0.8 made use of 40 virtualizedCANCEL 0 peers which were configured to use PIECE 638 bandwidth limitations.0.6 Half of the peers (20) were consideredREQUEST to be high-bandwidth721 peers, while the other half were considered to be low-bandwidth peers.BITFIELD The0 high-bandwidth peers were limited to 0.4 HAVE 39 512KB/s download speed and 256KB/s upload speedNOT_INTERESTED and the0 low-bandwidth peers were limited to 64KB/s download speed0.2 and 32KB/s upload speed. INTERESTED 1 UNCHOKE 1 0.0 CHOKE 0 500 ds Client 3 400 acc_ds Client 3 300 200 KB/s 100 0 100

00:00:05 00:00:10 00:00:15 00:00:20 time (s)

Figure 3.9: Download speed/acceleration evolution (libtorrent BitTorrent client)

1.0 DHT_PORT 2 0.8 CANCEL 0 PIECE 638 0.6 REQUEST 721 BITFIELD 0 0.4 HAVE 39 NOT_INTERESTED 0 0.2 INTERESTED 1 UNCHOKE 1 0.0 CHOKE 0 500 Figure 3.10: BitTorrent protocol messagesds Client (20 seconds) 3 400 Figure 3.9 displays a 20 seconds time-based evolution of theacc_ds download Client speed 3 and acceleration of 300 a peer running the libtorrent client. Acceleration is high during the first 12 seconds, when a peer reaches its maximum download speed of around 512KB/s. Afterwards, the peer’s download speed 200 is stabilized and its acceleration is close to 0. KB/s 100 0 100

00:00:05 00:00:10 00:00:15 00:00:20 time (s) CHAPTER 3. DEPLOYING PEER-TO-PEER NETWORK INFRASTRUCTURES THROUGH VIRTUALIZATION 57

All non-seeder peers display a similar start-up pattern. There is an initial 10-12 seconds bootstrap phase with high acceleration and rapid reach of its download limit, and a stable phase with the acceleration close to 0. Figure 3.10 displays messages exchanged during the first 20 seconds of a peer’s download ses- sion, in direct connection with Figure 3.9. The peer is quite aggressive in its bootstrap phase and manages to request and receive a high number of pieces. Almost all requests sent were replied with a block of data from a piece of the file. The download speed/acceleration time-based evolution graph and the protocol messages number- ing are usually correlated and allow detailed analysis of a peer’s behaviour. Our goal is to use this information to discover weak spots and areas to be improved in a given implementation or swarm or network topology.

3.7 Conclusion

In order to create an extensible, realistic and automated environment for deploying Peer-to-Peer trials, we have constructed a virtualized infrastructure. The infrastructure provides the means for creating Peer-to-Peer nodes and designing the network connection between them to an as-close-to reality as possible topology. Through the use of virtualization, we were able to reliably and efficiently use less than a dozen of commodity hardware systems to simulate a complete set of more than a hundred peers. OpenVZ was choosen as the virtualization solution. A lightweight operating system-level virtualiza- tion solution, OpenVZ provided the light overhead and easy deployment and interaction required, though problems were encountered when applying bandwidth limitation options. OpenVZ provides a low memory footprint and low overhead albeit at the price of less separation between various processes. Due to its low memory footprint it was fairly easy to deploy tens of virtual nodes on top of a modest commodity hardware system. A variety of tools have been employed to allow a realistic behavior and ensure automation. SSH has been used as the protocol for directly interacting with nodes and sending commands for building up the infrastructure or interacting with clients. iptables and tc have been employed for networking features such as firewalls and bandwidth limitations. Shell scripting has been used for automating various components of the infrastructure. A diversity of BitTorrent clients have been used as basis of comparison and have been integrated through scripting in the infrastructure. Some of these clients have been instrumented for easy automation and providing required information. Ranging from Tribler to Vuze, clients are configured to run within a given OpenVZ container and be part of the simulated swarm. Connection dropouts have been simulated through various means such as stopping the client, suspending it or disabling the interface. Connection dropout simulation has been integrated in the virtualized infrastructure in order to ensure churning in a Peer-to-Peer swarm. Result have signaled a similarity of the methods and supported the use of suspending the client due to its easy integration in the swarm management framework. Experiments and trials make use of the infrastructure such that it is able to simulate a swarm consisting of more than 100 clients. We differentiate between an internal swarm – consisting solely of simulated virtualized nodes, and external swarm – combining simulated virtualized nodes with real nodes from the Internet. A specific type of experiments have been directing at providing a performance overview and comparison between various clients. Chapter 4

Protocol Measurement and Analysis in Peer-to-Peer Systems

Based on BitTorrent’s success story (it has managed to become the number one protocol of the in- ternet in a matter of years), the scientific community has delved heavily in analysing, understanding and improving its performance. Research focus has ranged from measurements [58] to protocol improvements [71], from social networking [60] to moderation techniques [59], from content distri- bution enhancements [72] to network infrastructure impact [12]. We undertook a novel approach involving client-side information collection regarding client and protocol implementation. We have instrumented a libtorrent-rasterbar client1 and a Tribler2 client to provide verbose information regarding BitTorrent protocol implementation. These results are collected and subsequently processed and analysed through a rendering interface. Our aim is to measure and analyze protocol messages while in real-world environments. As de- scribed in Chapter3, a virtualized infrastructure had been used for realistic environments; apart from that, clients and trackers running in a real-world swarm have been used and instrumented to provide valuable protocol information and parameters. No simulators have been used for collecting, measuring and analyzing protocol parameters, rather a “keep it real as much as possible” approach. Information, messages and parameters are collected directly from peers and trackers that are part of a real-world Peer-to-Peer swarm. Simulator approaches may reach a complete simulation of the BitTorrent in a network emulator (OMNet++) [13]. The action chronology for measuring parameters had been collecting data, parsing and storing it and then subjecting protocol parameters to processing and analysis. The rest of this chapter presents the measured parameters, approaches to collecting, parsing and storing information into an “easy to be used” format and then putting it to analysis and interpretation.

4.1 BitTorrent Messages and Parameters

Analysis of BitTorrent client-centric behavior and, to some extent, swarm behavior, is based on BitTorrent protocol messages3. Messages are used for handshaking, closing the connection, re- questing and receiving data.

1http://www.rasterbar.com/products/libtorrent/ 2http://www.tribler.org/trac/ 3http://www.bittorrent.org/beps/bep_0003.html

58 CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 59

The BitTorrent client will generate at startup a unique identifier of itself, known as peer id. This is client dependent, each client encoding a peer id based on its own implementation.

4.1.1 Protocol Messages

Each torrent is exclusively identified by a 20-byte SHA1 hash of the value of the info key; this is stored in the torrent file dictionary as info hash. The peer id and info hash values are important in the TCP connection establishing and are typically logged by trackers. The handshake is the first sent message. It uses the format:

The protocol parameter represents the protocol identifier string and the length parameter represents the protocol name length. Reserved represents eight reserved bytes whose bits can be used to modify the behavior of the protocol. Standard implementations use this as zero-filled. info_hash represents the identifier of the shared resource that is desired by the initiator of the connection. peer_id represents the initiator’s unique identifier. The receiver of the handshake must verify the info_hash in order to decide if it can serve it. If it is not currently serving, it will drop the connection. Otherwise, the receiver will send its own handshake message to the initiator of the connection. If the initiator receives a handshake whose peer_id does not match with the expected one – it must keep a list with peers addresses and ports and their corresponding peer_id’s – then it also must drop the connection. Remaining protocol messages use the format:

The length prefix is a four byte big-endian value representing the sum of message ID and payload sizes. The message ID is a single decimal byte. The payload is message dependent. • keep-alive () The Keep-alive message is the only message without any message ID and payload. It is sent to maintain the connection alive if no other message have been sent for a given amount of time. The amount of time is about two minutes. • choke () The Choke message is sent when the client wants to choke a remote peer. • unchoke () The Unchoke message is sent when the client wants to unchoke a remote peer. • interested () The Interested message is sent when the client is interested in something that the remote peer has to offer. • not interested () The Not interested message is sent when the client is not interested in anything that the remote peer has to offer. • have () The piece index is a 4 bytes value representing the zero-based index of a piece that has just been successfully downloaded and verified via its hash value present in the torrent file. CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 60

• bitfield () The bitfield message may only be sent immediately after the handshake sequence has oc- curred and before any other message is sent. It is optional and need not be sent if a client has no pieces. The Bitfield payload has the length X and its bits represent the pieces that have been successfully downloaded. The high bit in the first byte corresponds to piece index 0. A set bit indicates a valid and available piece, and a cleared bit indicates a missing piece. Any spare bits are set to zero. • request () The Request message is sent when requesting a block. Index is the zero-based index of the piece containing the requested block, begin is the block offset inside the piece and length represents the block size. • piece () The Piece message is sent when delivering a block to an interested peer. Index is the zero- based index of the piece containing the delivered block, begin is the block offset inside the piece and block represents the X-sized block data. • cancel () The Cancel message is sent when canceling a block request sent before. Index is the zero- based index of the piece containing the requested block, begin is the block offset inside the piece and length represents the block size. • port () The Port message is sent by clients that implement a DHT tracker. The listen port is the port of the client’s DHT node listening on. Swarm measured data are usually collected from trackers. While this offers a global view of the swarm it has little information about client-centric properties such as protocol implementation, neighbour set, number of connected peers, etc. A more thorough approach has been presented by Iosup et al. [31], using network probes to interrogate various clients. A similar approach had been undertaken by Zhang et al. [75], who created an online available Peer-to-Peer trace archive, consisting of information from previous experiments, and using a diversity of Peer-to-Peer protocols. Our approach [19], while not as scalable as the above mentioned one, aims to collect client-centric data, store and analyse it in order to provide information on the impact of network topology, protocol implementation and peer characteristics. Our infrastructure provides micro-analysis, rather than macro-analysis of a given swarm. We focus on detailed peer-centric properties, rather than less- detailed global, tracker-centric information. The data provided by controlled instrumented peers in a given swarm is retrieved, parsed and stored for subsequent analysis. We differentiate between two kinds of BitTorrent messages: status messages, which clients provide periodically to report the current session’s download state, and verbose messages that contain pro- tocol messages exchanged between peers (chokes, unchokes, peer connections, pieces transfer etc.). Another type of messages are those provided by tracker logging [4]. Tracker-based messages pro- vide an overall view of the entire swarm, albeit at the cost of less-detailed information. Tracker logging typically consists of periodic messages sent by clients as announce messages. However, these messages’ period is quite large (usually 30 minutes – 1800 seconds) resulting in less de- tailed information. Their overall swarm vision is an important addition to status and verbose client messages. CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 61

4.1.2 Measured Data and Parameters

Data and parameters measured are those particular to BitTorrent clients and swarms, that provide support for evaluation and improvements at protocol level. The measured parameters are described in Table 4.1, Table 4.2 and Table 4.3, depending on their source (either status messages, verbose messages or tracker messages).

Table 4.1: Parameters from Status Messages

Parameter Explanation Download speed Current peer download speed – number of bytes received Upload speed Current peer upload speed – number of bytes sent ETA How long before the complete file is received Number of connections Number of remote peers currently connected to this client Download size Bytes download so far Upload size Bytes uploaded so far Remote peers ID IP address and TCP port of remote peers Per-remote peer download speed Download speed of each remote connected peer Per-peer upload speed Upload speed of each remote connected peer

Table 4.2: Parameters from Verbose Messages

Parameter Explanation CHOKE Disallow remote peer to request pieces UNCHOKE Allow remote peer to request pieces INTERESTED Mark interest in a certain piece NOT_INTERESTED Unmark interest in a certain piece HAVE Remote peer possesses current piece BITFIELD Bitmap of the file REQUEST Ask for a given piece PIECE Send piece CANCEL Cancel request of a piece DHT_PORT Present DHT port to DHT-enabled peers

Table 4.3: Parameters from Tracker Messages

Parameter Explanation Swarm size The number of peers in the swarm Client IP/port Remote peer identification (IP address and TCP port in used) Client type BitTorrent implementation of each client Per-client download size Download size for each client Per-client upload size Upload size for each client

4.1.3 Approaches to Collecting and Extracting Protocol Parameters

Peer-to-Peer clients and applications may be instrumented to provide various internal information that is available for analysis. This information may also be provided by client logging enabled for the client. Such data features parameters describing client behavior, protocol messages, topology updates and even details on internal algorithms and decisions. CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 62

We “aggregate” this information as messages and focus on protocol messages, that is messages regarding the status of the communication (such as download speed, upload speed) and those with insight on protocol internals (requests, acknowledgements, connects, disconnects). As such, there is a separation between periodic, status reporting messages and internal protocol messages that mostly related to non-periodic events in the way the protocol works. These have been “dubbed” status messages and verbose messages. Status messages are periodic messages reporting session state. Messages are usually output by clients at every second with updated information regarding number of connected peers, current download speed, upload speed, estimated time of arrival, download percentage, etc. Status mes- sages are to be used for real time analysis of peer behaviour as they are lightweight and periodically output (usually every second). Status messages may also be used for monitoring, due to their periodic arrival. When using logging, status messages are typically provided as one line in a log file and parsed to provide valued infor- mation. Graphical evolution and comparison of various parameters result easily from processing status messages log files. Verbose messages or log messages provide a thorough inspection of a client’s implementation. The output is usually of large quantity (hundreds of MB per client for a one-day session). Verbose information is usually stored in client side log files and is subsequently parsed and stored. Verbose information may not be easily monitored due to their event-based creation. When consider- ing the BitTorrent protocol, these messages are closely related to BitTorrent specification messages such as CHOKE, UNCHOKE, REQUEST, HAVE or internal events in the implementation. Verbose in- formation may be logged through instrumentation of client implementation or activation of certain variables. It may also be determined through investigation of network traffic. Apart from protocol information provided in status and verbose messages, one may also collect information regarding application behavior such as the piece picking algorithm, size of buffers used, overhead information. This data may be used to fulfill the image of the overall behavior and provide insight on possible enhancements and improvements. There are various approaches to collecting information from running clients, depending on the level of intrusiveness. Some approaches may provide high detail information, while requiring access to the client source code, while others provide general information but limited intrusiveness. The most intrusive approach requires placing hook points into the application code that provides information. This information may be sent to a monitoring service, logged, or sent to a logging library. Within the P2P-Next project, for example, the NextShare core provides an internal API for providing information. This information is then collected either through a logging service that collects all information or through the use of a monitoring service with an HTTP interface and MRTG graphics rendering tools. Section 4.2 presents a logging library with the purpose of collecting logs in an uniform way. Another approach makes use of logging information directly provided by BitTorrent clients. There are two disadvantages to this approach. The first one is that each client provides information in its own way and a dedicated message parser must be enabled for each application. The second one is related to receiving verbose messages. In order to be able to receive verbose messages, one has to turn on the verbose logging. This may be accomplished through a startup option, an environment variable or a compile option. It may be the case that non-open source applications possess none of these options and cannot provide requested information. Finally, a network-oriented approach requires a thorough analysis of network packets similar to deep packet inspection. It allows a in-depth view of all packets crossing a given point. Its main advantage is ubiquity: it may be applied to all clients and implementation regardless of access to the source code. The disadvantage is the difficulty in parsing all packets, extracting required information CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 63

(specific to the BitTorrent protocol) and, perhaps more pressing, the significant processing overhead introduced. The messages and information collected are concerned with client behavior. As such the appli- cations in place work at the edge of the P2P network on each client. No information is gathered from the core of the network, inner routers or the Internet. In order to provide an overall profile of the swarm or P2P network, information collected from all peers must be aggregated and unified. While having only edge-based information means some data may be lacking it provides a good perspective of the protocol internals and client implementation. We dub this approach client-centric investigation. Collected data may be either monitored, with values rendered in real time or it may also be archived and compressed for subsequent use. The first approach requires engaging parsers while data is being generated, while the other allows use of parsers subsequently. When using parsers with no monitoring, data is usually stored in a “database”. “database” is a generic term which may refer to an actual database engine, file system entries, or even memory information. A rendering or interpretation engine are typically employed to analyze information in the database and provide it in a valuable form to the user.

4.2 A Generic BitTorrent Protocol Logging Library

One approach to providing a unified model for collecting information is a standard for developing the logging implementation, so that if using a common easy to parse output format, all the information about the messages exchanged between the active participants can be centralized and followed by new implementations. A logging library takes into consideration all the features of the protocol and its official extensions, even if not all the current clients are using them. Taking advantage of the provided library APIs and its results, the client developer can discover the weaknesses of his/her project or if there is any peer participant class that does not go with the fairness of the protocol. The main components of the library are designed as modules, each of them having an exact role. The modules were created so that they can be used without major modifications for future library versions. Figure 4.1 shows the general structure of the logging library. The developer can choose to use one of the two provided APIs. The choice may depend on the BitTorrent client implementation. The second API takes into consideration the state of the peer whose message is being logged. It is based on the first API which does not save any of the peer information. The parameters used for calling functions of the first API are passed to the second one. Next, parameters are sent to the core module that will verify and process them. The core module may choose to pass some of the parameters to the DHT module. If so, the DHT module will pass the same parameters to the decoding bencoded strings module. The latter will create a bencoded structure that will be passed back to the DHT module. Based on the received result, the DHT module will create a DHT structure that will be also passed back to the core module. The core will send all the prepared data to the buffering module which will give in return a buffered string. Finally, the string will be printed out to standard output, standard error or a file.

4.2.1 Library APIs

The library provides two APIs that can be used by the developer. Designing the library structure involved a trade-off between easing the use of library functions and gaining all the message infor- mation supported by the BitTorrent protocol. CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 64

Figure 4.1: Logging Library Architecture

In first phase, a set of API functions were developed considering only the latter aspect. This phase of developing had as a main aim building an interface that offers a true and valid functionality. It does not take into consideration the peer state and information and that is why it can be classified as a stateless API. The aim of the next phase represented the making of a more developer-friendly interface. This was based on the first set of functions from two reasons: first, allowing the program- mer to use the first set too if it stretches more easily on his/her implementation, and the second one, the reuse of the code. Functions in the second set use a structure that gathers the peer information needed for logging such as peer address, peer client, torrent hash and name. The structure is initialized when the client builds its own peer corresponding structure. A reference to the structure representing the peer logging information will be saved in the associated peer structure used by the client. This approach is recommended because first API functions need a longer list of arguments containing the peer parameters besides the particular information provided by the BitTorrent message, while the second API uses for the peer information only the reference to the corresponding peer logging structure.

4.2.2 Library Initialization and Cleanup

The library provides initialization and exiting operations. At initialization, the library saves the config- uration parameters given as arguments at the corresponding function call. The output configuration may be hard coded by the programmer or set by the user via client’s GUI. It may be a given format using the list of output variables or an XML format. Destination for the output must be specified. It may be the standard output or the standard error, or it may be a file. The destination may be specified using an environment variable set with the name “BTLOG” or a string parameter given as argument at initialization function call. The initialization supplies an extra option for using a list of peers. The developer may choose this option if he/she wants to leave the responsibility of freeing the allocated memory necessary for peer structures to the library’s perspective. If the chosen API is the peer based one, then in order to log message send to / received from a peer, a structure peer_- log_t must be created prior to this. Also, in this case the peer address will be passed using CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 65 the structures provided by the Linux platforms (struct in_addr and struct in6_addr). On each call of the functions, the peer address will be converted to string by the library. When the client session ends, the library exit function must be called. If the set output destination is a file, it is closed at the exit. Also, if the text output format mode is set, the array of strings created by the core module for printing the information is also cleared. In the case that the developer chose to use the peer logging structures list provided by the library, the exit function will release the memory allocated for them.

4.2.3 Peer Logging Structure

The peer_log_t structure is handled in a object oriented manner. Variables of these type are cre- ated using a function as the constructor for the object oriented classes. Modifications can be made using a set of functions just as the setters are used in object oriented paradigm. This approach spares the developer from allocating resources for data needed in logging. The peer_log_t instance should be created immediately after a successful handshake is re- ceived. According to the protocol, the connection between peers is established after the initiator of the handshake receives the handshake message sent as response. The message contains the peer’s identifier and the SHA1 hash associated with the torrent. The IP and the port of the peer must be known whenever a message will be sent or received on the TCP connection. The name of the torrent is known after the checking of the info_hash parameter of the handshake. If a peer sends a handshake, it is implicitly admitted that it knows the torrent name. A handshake receiver will get the torrent name after checking its availability in the client’s list of active torrents. A peer_log_t instance may need modifications along the session from different reasons. One reason may be if the client cannot complete the fields of the structure immediately after the hand- shake message is sent or received. This happens in two situations. First, if the processing of the handshake involves an analysis of each parameter in different contexts. For example, if different functions decode and verify each parameter, without saving or passing it. Second, if a protocol extension negotiation occurs and the handshake is held on more steps. For connection encryption extension, the steps involved follow the Diffie-Hellman key exchange protocol. Another reason is if a “port” message is received. A “port” message is sent by client versions that implement a DHT tracker. The listen port is the port the peer’s DHT node is listening on. Only then, the peer’s port can be set. When modifications must be brought to the peer_log_t structure, the developer should use the setters provided by the logging library. The peer-state aware set of functions are build on the basic set of functions of the first API. The peer_log_t reference variable given as parameter is translated into the parameters used for calling the basic functions. The address of the peer will be translated into a string using the standard representation. For IPv4 will be used the dotted decimal notation, concatenated with a colon in front of the ASCII represen- tation of the port. For IPv6, will be used the hexadecimal notation inside brackets, concatenated with a colon in front of the ASCII representation of the port.

4.2.4 Core Module

When a basic function of the first API is called, the arguments of the message are sent to the core of the library. The library core is the part where the message string is created. First of all, the core verifies the parameters validity in order to assure a well formatted and coherent message output string. Depending on the parameters and the session state, the core decides on how the printed message will look like. The string is buffered into an inner library structure called btl_buffer_t. CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 66

The message is composed of different parts. For the text output mode, each part represents one of the following: • a format variable – one of the symbols of the format prefixed with the percent (‘%’) character • a padding string – a component of the string, other than a format variable When a message component is created, it is added by the core to the buffer. Inside it, the buffer constructs the string containing the whole message, expanding its allocated memory size whenever is needed. All the format variables, excepting the %msg variable will be printed in the same way for every message. The %msg variable will be expanded in a different manner, depending on message specific characteristics. The messages belonging to BitTorrent protocol, Fast Peers extension and Connection Encryption extension are all treated the same. Excepting the handshake message, the BitTorrent protocol mes- sage parameters are integer values. They represent blocks coordinates relative to pieces indexes. The handshake message has a string parameter which contains the protocol name and eight re- served bytes indicating the supported extensions of the initiating client. The Fast Peers messages have also integer parameters corresponding to the blocks being advertised. For the Connection Encryption, parameters are rather few. Only the crypto_provide and crypto_select mes- sages have integer parameters which are printed for the output as hexadecimal values, according to the extension specification. An exceptional case is when Distributed Hash Table extension messages are sent or received. Since DHT messages can have a significant number of parameters, it would have been difficult not only to implement a different API function for every type, but to use them too in BitTorrent client development. Thus, the chosen solution was to create just two functions for incoming or outgoing DHT messages, whose bencoded string is decoded in the library core. The bencoded string is passed as argument by the developer. The core passes the bencoded string to the DHT module that has the role of returning a DHT structure based on the given encoded string. The DHT module calls the decoding bencoded module who is able to analyze and verify the bencoded string. If the string is well encoded, a bencoded structure is returned as response, else the response is null.

4.2.5 Buffering Module

The message is buffered using the functions provided by the buffering module. The buffering mod- ule has the role of creating a btl_buffer_t structure which contains the characters that will be printed out. It provides functions for adding new information depending on the output format type, text or XML, and the parameter type. For each new add, the buffer is expanded if the remained space is not enough for the information to be inserted. After completing the buffer, the characters are written to the output destination: standard output, standard error or the set file. After the writing process ends, all the allocated resources are released.

4.2.6 Providing Logging Output

The logging output destination and format can be configured by both developer or user to suit one’s needs. The output can be redirected to standard output, standard error or a set file, given as parameter. The format of the output can be text mode, each line representing a message, or XML format, each message representing a main node in XML tree structure. For a XML format, details regarding the message are given in own tags. CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 67

Text Mode

When XML enabling parameter is omitted, the output format is considered. The output format is a string that must contain the logging specific variables. A logging specific variable is a variable from Table 4.4 that has to be inserted in format string in order to be printed. The variable will be expanded into its corresponding value. All the other characters will be printed as they are given in the format string, having the role of variables separators or paddings.

Table 4.4: Library Logging Variables

Variable Description %time Time %address Peer address %client Peer client %torrname Torrent name %torrhash Torrent hash %inout Message direction %type Message type %msg Message content

The format is processed at logging initialization and represented by an array of strings, each string being associated with a format variable or a padding string. When all the data for printing a message is available, the library core browses the array in order to print its value. If an array element rep- resents a format variable, its corresponding logging data item will be printed. Else, if it represents another string, the respective string will be printed. Each message is printed on a single line. Thus when the logging output will be parsed, a read line-by-line approach will be fairly even. Example for a “request” message: • user provided format:

[%time] %address <%client> "%torrname" (%inout) %type: %msg

• library output:

[15:54:47.287] 24.132.173.70:54404 <µTorrent 1.8.1> "Ubuntu_8.10" (Send) Request: index=1532 offset=262144 size=16384

As we can see, the time is shown between brackets, the address has the IPv4 standard representa- tion, the peer’s client name is “µTorrent 1.8.1” and the torrent name is “Ubuntu_8.10”. The message is being sent, just as the Send label indicates, and message contains the index of the piece that has the requested block, the offset of the block inside the piece and the block size. The output respects the given format as it is.

XML Mode

The XML output file contains an well-formed tree-based information. Messages are logged as XML elements, each message argument being represented as a nested element. The arguments order is hard coded, thus it cannot be configured or reorganized. CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 68

The XML format was properly chosen so that output analysis and parsing could be easily processed. The output is well-formed and easy to read. Messages have a clear structure, each argument being constructed as a XML element. The message parameters are gathered in a XML params element. Depending on the message specific data, message elements may vary. The root element is named btlog and holds as attributes the starting date and time, and the logging client name. Each message is represented by an message element nesting the arguments. Arguments are also XML elements, each one being named after the argument itself. The SHA1 20- byte hashes are printed as 40-hexadecimal digits, 2 digits per hashing byte. A printed message may look like this:

1 2 3

82.121.44.31:22831
4 "Transmission 1.71" 5 "Ubuntu 8.10" 6 3132333435363738393031323334353637383930 7 Send 8 9 10 Listing 4.1: XML Output Message

As we can see in the example above, a KeepAlive message has no parameters as it is specified in the BitTorrent protocol. Time is expressed in hours, minutes, seconds and milliseconds. The address is a concatenation between the standard IPv4 representation, colon separator and the TCP port. IPv6 are also supported. The dir element may contain the values Send or Got, determining message direction. The params element nests the message specific parameters, which are included in the BitTorrent message payload. For example, for a request message the parameters are printed like this:

1 2 49 3 12 4 1002 5 Listing 4.2: params Element in XML Output Message

index indicates the piece containing the requested block, begin – its piece offset, and length – its block size.

4.2.7 Clients Instrumented to Use the Library

The library design aim was to cover an wide range of BitTorrent clients needs. Inserting API func- tions into existent BitTorrent clients source code may be a difficult task. That is why the library provides functions that can help the developer to adjust the stored items such as peer information or message arguments to library demands. As long as a developer does not include a detailed code specification or enough comments, it is difficult for a third party developer to modify the whole code so that all the exchanged messages to be captured. There are a few wide used BitTorrent libraries: libtorrent rakshasa, libtorrent rasterbar, MonoTorrent and BTSharp. Excepting the latter, the rest of them are released under open source license. They CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 69

are not used by a vast majority of clients. Some of the clients are not keen on saving information in order to be logged later. The two library APIs provide sufficient functions for a good function- ality. This does not necessarily imply developing convenience because it depends on the client implementation. In order to provide a valuable use of the library, we updated two such instances: the libtransmission component of the Transmission client and libtorrent-rakshasa, linked against the client.

rtorrent

rTorrent is a C++ client written by Jari Sundell and it uses as back end the free (under GNU GPL) libtorrent-rakshasa library1. The rakshasa library does not logs the messages exchanged, it only outputs error messages for debugging. For testing the btlog library, modifications to the rakshasa library were required. The modified files of the libtorrent library are in src/ directory. It is initialized in the initialize function from the torrent/torrent.cc file. For setting the btlog library, a wrapper function was created as seen in Listing 4.3.

1 static void btlogInitialization(void) 2 { 3 const char * format = NULL; 4 const char * xml_env = getenv("XML_ENABLED"); 5 int xml_enabled = 0; 6 7 if(xml_env) 8 xml_enabled = atoi(xml_env); 9 else 10 format = getenv("BTLOG_FORMAT"); 11 12 btlogInit(PACKAGE_STRING, 1, format, xml_enabled); 13 } Listing 4.3: Initializing Logging Library for libtorrent-rakshasa

The output format is given by the BTLOG_FORMAT environment variable. The enabling XML output flag is also set as a XML_ENABLED environment variable. If XML is enabled, then the text format it is not passed, nor read. Rakshasa uses a PeerInfo class for every peer that it exchanges messages with. The peer logging structure reference was linked as a field of this class. When an object of this class is instantiated then the peer logging structure is also created, but only by setting the address and the port. The other fields are set at the handshake successfully end. This is done in the insert function from torrent/peer/connection_list.cc. The wrapper used for setting the other fields is shown in Listing 4.4

1 void initPeerLog(PeerInfo * peerInfo, DownloadInfo * downloadInfo) 2 { 3 const char * peer_id = peerInfo->id().c_str(); 4 const char * name = downloadInfo->name().c_str(); 5 const char * hash = downloadInfo->hash().c_str(); 6 char torrHash[HASH_SIZE]; 7 8 memset(torrHash, 0, HASH_SIZE);

1http://libtorrent.rakshasa.no/ CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 70

9 memcpy(torrHash, hash, HASH_SIZE); 10 11 if(!peerInfo->peer_log) 12 peerInfo->peer_log = newPeerLog(peerInfo->socket_address(), 13 peerInfo->listen_port(), 14 peer_id, name, torrHash); 15 else { 16 logSetPeerClientFromId(peerInfo->peer_log, 17 peerInfo->id().c_str()); 18 logSetPeerTorrName(peerInfo->peer_log, name); 19 logSetPeerTorrHash(peerInfo->peer_log, torrHash); 20 } 21 }

Listing 4.4: Library Logging in libtorrent-rakshasa

The PeerInfo class saves the peer ID, while the DownloadInfo class contains the torrent hash and name. The peer ID is translated by the library in a human readable client name.

Transmission

Transmission1 is a BitTorrent client written in C. It has its own BitTorrent library called libtransmis- sion. It was the first tested client for the logging library and it influenced the first implementation changes. Transmission already has a strong logging system, but most of the logged data contains information on inner events. Thus, the logging files have enormous dimensions, while the logging library output contains only information regarding the BitTorrent protocol. The logging library is used by the Transmission client as a bundled static library. Its sources are saved in the third-party directory and are compiled once with the other third-party used modules. This was possible by adding a Makefile.am file for the automake tool which is used for automated client compilation. It contains the file names of the library and the built library name: libbtlog.a. Library initialization is made in the tr_sessionInit function of the session.c source file. The peer logging structure reference (peer_log_t) was hooked in the tr_peerIo structure of the client, which is used for every event regarding a valid connection with a peer. The peer logging structure is initialized in the tr_peerIoNew function from peer-io.c as described in Listing 4.5.

1 static tr_peerIo * tr_peerIoNew(..., 2 const tr_address * addr, 3 tr_port port, 4 const uint8_t * torrentHash, 5 ...) 6 { 7 tr_peerIo * io; 8 9 ... 10 11 /* BTLOG create peer log */ 12 if( addr->type == TR_AF_INET ) 13 io->peer_log = newPeerLogV4(&addr->addr.addr4, port, 14 NULL, NULL, torrentHash); 15 else 16 io->peer_log = newPeerLogV6(&addr->addr.addr6, port,

1http://www.transmissionbt.com/ CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 71

17 NULL, NULL, torrentHash); 18 19 ... 20 }

Listing 4.5: Library Logging in Transmission

Only the address, the port and the torrent hash are known when the structure is created. The other fields are completed using the setter functions in further contexts. Transmission processes the client and the torrent names merely after the handshake is successful. This is done by the myHandshakeDoneCB function of the peer-mgr.c source file. The API message logging func- tions are called in the peer-msgs.c source file functions.

4.3 Log Collecting for BitTorrent Peers

The log collection approach implying a less intrusive activity but providing a great deal of protocol parameters is the use of logging information from clients. Each client typically presents a status in- formation (the dubbed status message) consisting of periodic information such as download speed, upload speed, number of connection and, if enabled, a set of enhanced pieces of information (the dubbed verbose message). Types of messages and their content have been thoroughly described in Section 4.1. All or most of BitTorrent clients provide status messages but some sort of activation or instrumentation is required to provide verbose messages.

4.3.1 Instrumenting BitTorrent Clients for Logging Support

Throughout experiments we have used multiple open-source clients. All of them provided basic status information, while some were updated or altered to provide verbose information as well. Transmission, Aria2, Vuze, Tribler, libtorrent-rasterbar and the mainline client had been used to pro- vide status parameters, while Tribler and libtorrent-rasterbar had also been instrumented to provide verbose parameters. Several approaches had been put to use to collect status information, depending on the client implementation: • The main issue with Azureus was the lack of a proper CLI that would enable automation. Though limited, a “Console UI” module enabled automating the tasks of running Azureus and gathering download status and logging information. • Although a GUI oriented client, Tribler does offer a command line interface for automation. • Transmission has a fully featured CLI and was one of the clients that were very easy to automate. Detailed debugging information regarding connections and chunk transfers can be enabled by setting the TR_DEBUG_FD evironment variable. • aria2 natively provides a CLI and it was easy to automate. Logging is also enabled through CLI arguments. • hrktorrent is a lightweight implementation over libtorrent-rasterbar and provides the neces- sary interface for automating a BitTorrent transfer, albeit some minor modifications have been necessary. • BitTorrent Mainline provides a CLI and logging can be enabled through minor modifications of the source code. CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 72

In order to examine BitTorrent transfer parameters at a protocol implementation level, we propose a system for storing and analysing logging data output by BitTorrent clients. It currently offers support for hrktorrent/libtorrent1 and Tribler2. Our study of logging data takes into consideration two open-source BitTorrent applications: Tribler and hrktorrent3 (based on libtorrent-rasterbar). While the latter needed minimal changes in order to provide the necessary verbose and status data, Tribler had to be modified significantly. The process of configuring Tribler for logging output is completely automated using shell scripts and may be reversed. The source code alterations are focused on providing both status and verbose messages as client output information. Status message information provided by Tribler includes transfer completion percentage, download and upload rates. In the modified version, it also outputs current date and time, transfer size, estimated time of arrival (ETA), number of peers, and the name and path of the transferred file. In order to enable verbose message output, we took advantage of the fact that Tribler uses flags that can trigger printing to standard output for various implementation details, among which are the actions related to receiving and sending BitTorrent messages. The files we identified to be responsible for protocol data are changed using scripts in order to print the necessary information and to associate it to a timestamp and date. Since most of the protocol exchange data was passed through several levels in Tribler’s class hierarchy, attention had to be paid to avoid duplicate output and to reduce file size. In contrast to libtorrent-rasterbar, which, at each transfer, creates a separate session log file for each peer, Tribler stores verbose messages in a single file. This file is passed to the verbose parser, which extracts relevant parts of the messages and writes them into the database. Unlike Tribler, hrktorrent’s instrumentation did not imply modifying its source code but defining TORRENT_LOGGING and TORRENT_VERBOSE_LOGGING macros before building (recompiling) libtorrent-rasterbar. Minor updates had to be delivered to the compile options of hrktorrent in order to enable logging output. Although our system processes and stores all protocol message types, the most important mes- sages for our swarm analysis are those related to changing a peer’s state (choke/unchoke) and requesting/receiving data. Correlations between these messages are the heart of provisioning in- formation about the peers’ behaviour and BitTorrent clients’ performance.

4.3.2 Storage of Logging and Processed Data

Logging information is typically stored in log files. In libtorrent-rasterbar’s case, logging is using a whole folder and logging information for each remote peer is using a single file in that folder. Usually information is redirected from standard output and error towards the output file. As in a given experiment, logging information occupies a large portion of disk space, especially ver- bose messages; as such, files and folders are compressed in archive files. There would generally be a log archive for each client session. When information is to be processed, logging archives are going to be provided to the data processing component. A log archive contains both status messages and verbose messages. Logging information may be stored in archive files for subsequent use or it may be processed live – that is parsing and interpreting parameters as log files are being generated. When running a

1http://www.rasterbar.com/products/libtorrent/ 2http://www.tribler.org/trac 3http://50hz.ws/hrktorrent/ CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 73 live/real-time processing component, compressing logging information may not be required. How- ever, in order to still preserve the original files, some experimenters may choose to retain access to the log archives. The usefulness of a live processing component is based primarily on relieving the burden of space consumption, in case archiving is disabled. Most of the logged information is not useful, due to the fact that some peers may not be connected to other peers and status information, though provided, consists of parameters that are equal to zero – no connections means 0 KB/s download speed, 0 KB/s upload speed and others. On a certain occasion, a log file that had been used for more than 3 weeks, occupied more than 1GB of data but resulted in just 27KB valuable information. Either when using live parsers or subsequent analysis, parameters are parsed for rapid use. The post-parsing storage is typically a relational database. The advantage of such a storage facility is its rapid access for post processing. When inquiring about given swarm parameters, the user would query the database and rapidly obtain necessary information. If that wouldn’t be enabled, each inquiry would require a new parsing activity, resulting in large overhead and CPU consumption. Database storage is the final step of the logging and parsing stage. Parameter analysis, interpre- tation and advising activities would not be concerned of logging information, but only query the database.

4.3.3 Peer-to-Peer Log-Centric Experiments

In order to collect information specific to a swarm, one must have access to all clients and logging information from those clients. As such, either all clients are accessible to the experimenter, or users would subsequently provide logging information to the experimenter. Some remote information may be replaced by that provided by a tracker log file. Tracker log infor- mation regarding the overall swarm view, albeit its periodicity is quite large (typically 30 minutes – 1800 seconds). An intermediate approach to collecting logging information is a form of aggregation of information on the client side. This information may be either sent to a logging service or stored to be subsequently provided to the user. The former approach it taken by the Logging Service within the P2P-Next project. Typical experiments are those that allow full control to the user and provide all information rendered by clients. These experiments are using the virtualization infrastructure described in Chapter3. Deployment, log activation, log collection/archiving and even parsing are accomplished in full au- tomation. One would create a configuration file and run experiments. Log archive files typically result from experimentation and, after gathering all, data may be subject to analysis. The inclusion of tracker information had been enabled in the use of UPB Linux Distro Experiments1. Tracker log files are parsed live and provide overall swarm parameters. Various information from tracker log files are provided as graphic images that show the evolution of swarm parameters. Tracker logs have also been enabled in LivingLab experiments as described in Section 7.3.4. These experiments rely on extensive logging information (verbose messages) provided by seeders and tracker information. The lack of complete access to the all clients in the swarm is balanced out by the usage of verbose logging on the seeders’ side. However, remote peers’ intercommunication is not logged in anyway. Such that a form of aggregation and collection of remote peers’ intercommu- nication messages is still required.

1http://torrent.cs.pub.ro/ CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 74

4.3.4 Approaches to Processed Data Analysis

Log processing, as described in Section 4.4 refers to parsing and interpreting BitTorrent protocol parameters. Data is parsed into an easy to be accessed database that is provided to the user. As described above, one may choose to store logging information and then enable analysis. We dub this approach post-processing. The other approach is for live analysis of the provided parameters, resulting in client and swarm monitoring. The two approaches may, of course, be combined: while doing parsing of information, it is also stored in a database while various parameters are also monitored. An overview of a typical architecture for data processing is presented in Figure 4.5. Separate parsers are used for live parsing and classical parsing. Classical parsing results in a database “output”, while live parsing results in both a database “output” and the possibility of deploying live client and swarm monitoring.

4.4 Protocol Data Processing Engine

As client instrumentation provides in-depth information on client implementation, it generates ex- tensive input for data analysis. Coupled with carefully crafted experiments and message filtering, this will allow the detection of weak spots and improvement possibilities in current implementations. Thus it will provide feedback to client and protocol implementations and swarm “tuning” sugges- tions, which in turn will enable high performance swarms and rapid content delivery in Peer-to-Peer systems. Due to various types of modules employed (such as parser implementations, storage types, ren- dering engines) a data processing framework may provide different architectures. A sample view of the infrastructure consists of the following modules: • Parsers – receive log files provided by BitTorrent clients during file transfers. Due to differ- ences between log file formats, there are separate pairs of parsers for each client. Each pair analyses status and verbose messages. • Database Access – a thin layer between the database system and other modules. Provides support for storing messages, updating and reading them. • SQLite Database – contains a database schema with tables designed for storing protocol messages content and peer information. • Rendering Engine – consists of a GUI application that processes the information stored in the database and renders it using plots and other graphical tools.

SQLite Database

Status Log Status Parser Database Access Module

BitTorrent Swarm Verbose Log Verbose Parser Rendering Engine

Figure 4.2: Logging System Overview CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 75

As shown in Figure 4.2, using parsers specific to each type of logging file, messages are sent as input to the Database Access module that stores them into an SQLite database. In order to anal- yse peer behaviour, the Rendering Engine reads stored logging data using the Database Access module and outputs it to a graphical user interface. Once all logging and verbose data from a given experiment is collected, the next step is the analysis phase. The testing infrastructure provides a GUI (Graphical User Interface) statistics engine for inspecting peer behaviour. The GUI is implemented in Python using two libraries: matplotlib – for generating graphs and Trait- sUi – for handling widgets. It offers several important plotting options for describing peer behaviour and peer interaction during the experiment: • download/upload speed – displays the evolution of download/upload speed for the peer; • acceleration – shows how fast the download/upload speed of the peer increases/decreases; • statistics – displays the types and amount of verbose messages the peer exchanged with other peers. The last two options are important as they provide valuable information about the performance of the BitTorrent client and how this performance is influenced by protocol messages exchanged by the client. A sample GUI screenshot may be observed in Figure 4.3 and Figure 4.4

Figure 4.3: Rendering Engine for BitTorrent Parameters: Client Analysis

The acceleration option measures how fast a BitTorrent client is able to download data. High accel- eration forms a basic requirement in live streaming, as it means starting playback of a torrent file with little delay. The statistics option displays the flow of protocol messages. We are interested in the choke/un- choke messages. The GUI also offers two modes of operation: Single Client Mode, in which the user can follow the behaviour of a single peer during a given experiment, and Client Comparison Mode, allowing for comparisons between two peers. CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 76

Figure 4.4: Rendering Engine for BitTorrent Parameters: Client Comparison

4.4.1 Post Processing Framework for Real-Time Log Analysis

The dubbed post processing framework is used for storing logging information provided by various BitTorrent clients into a storage area (commonly a database). An architectural view of the framework is described in Figure 4.5.

Figure 4.5: Post-Processing Framework Architecture

The two main components of the framework are the parser and the storage components. Parsers process log information and extract measured protocol parameters to be subject to analysis; stor- ers provide an interface for database or file storing – both for writing and reading. Storers thus provide an easy to access, rapid to retrieve and extensible interface to parameters. Storers are invoked when parsing messages – for storing parameters, and when analyzing parameters – for retrieving/reading/accessing parameters. Within the parser component, a LogParser module provides the interface to actual parser imple- mentations. There are two kinds of parsers: log parsers and real time log parsers. The former are used for data already collected and subsequently provided by the experimenter. Another approach involves using parsers at the same time as the client generates logging information. This real time parsing approach possesses three important advantages: monitoring may be enabled for status CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 77 messages, less space is wasted as messages are parsed in real time, and processing time is re- duced due to the overlapping of the parsing time and the storing time. The disadvantage of a real time parser is a more complex implementation as it has to consider the current position in the log file and continue from that point when data is available. At the same time, all clients must be able to access the same database, probably located on a single remote system. The storage component is interfaced by the SwarmAccess module. This module is backed by database-specific implementations. This may be RDMBS systems such as MySQL or SQLite or file-based storage. Parameters are stored according to the schema described in Figure 4.6. Configuration of the log files and clients to be parsed is found in the SwarmDescription file. All data regarding the current running swarm is stored in this file. Client types in the description file also determine the parser to be used. Selection of the storage module is based on the configuration directives in the AccessConfig file. For an SQLite storage, this contains the path to the database file; for a MySQL file, it contains the username, password, database name common to database connection inquiries. The user/developer is interfaced by a GlueModule that provides all methods deemed necessary. The user would call methods in the GlueModule for such actions as parsing a session archive, a swarm archive, for updating a configuration file, for retrieving data that fills a given role.

Parameter Storage Database

The database schema as shown in Figure 4.6 is used for relational database engines such as MySQL or SQLite.

Figure 4.6: Database Schema

The database schema provides the means to efficiently store and rapidly retrieve BitTorrent protocol parameters from log messages. The database is designed to store parameters about multiple swarms in the swarms table; each swarm is identified by the .torrent file its clients are using. Information about peers/clients that are part of the swarm are stored in the client_sessions table. Each client is identified by its IP address and port number. Multiple pieces of information such as BitTorrent clients in use, enabled features and hardware specifics are also stored. Three classes of messages result in three tables: status_messages, peer_status_messages and verbose_messages. The peer_status_messages table stores parameters particular to remote peers connected to the current client, while the status_messages stores parameters specific to the current client (such as download speed, upload speed and others). Each line in the *_messages tables points to an entry in the client_sessions table, identifying the peer it belongs – the one that generated the log message. CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 78

Logfile-ID Mapping

When parsing log files, one has to know the ID of the client session that has generated the log file. In order to automate the process, there needs to be a mapping between the log file (or log archive) and the client session ID. At the same time, the client session ID needs to exist in the client_sessions table in the database, together with information such as BitTorrent client type, download speed limitation, oper- ating system, hardware specification etc. This information needs to be supplied by the experimenter in a form that is both easy to create (by the experimenter) and parse. A swarm description file is to be supplied by the experimenter. This file consists all required swarm and peer information including the name/location of the log file/archive. As we consider the INI format to be best suited for this, as it is fairly easy to create, edit and update, it was chosen to populate initial information. The experimenter may easily create an INI swarm description file and provide it to the parser together with the (compressed) log files. The swarm description file is to be parsed by the experimenter and SQL queries will populate the database. One entry would go into the swarms table and a number of entries equal to the number of peers in the swarm description file would go into the client_sessions table. As a result of these queries, swarm IDs and client sessions IDs are going to be created when running SQL insert queries (due to the AUTO_INCREMENT options). This IDs are essential for the message parsing process and are going to be written down in the Logfile-ID-Mapping-File. The swarm description file parser is going to parse that file and also generate a logfile-id mapping file. The parser is responsible for three actions: • parsing the swarm description file • creating and running SQL insert queries in the swarms and client_sessions tables • create a logfile-id mapping file consisting of mappings between client session IDs and log/file A logfile-id mapping file is to be generated by the swarm description parser and will subsequently be used by the message parser (be it status messages or verbose messages). The mapping file simply maps a client session ID to a log file or a compressed set of log files. A sample file is stored in the repository. The message parser doesn’t need to know client session information; it would just use the mapping file and populate entries in the *_messages tables. The message parser is going to use the logfile-id mapping file and the log file (or compressed set of log files) to populate the *_messages tables in the database (status_messages, peer_- status_messages, verbose_messages). The workflow of the entire process is highlighted in Figure 4.7. There is a separation between the experimenter – the one running trials and collecting information and the parser – the one interpreting the information. Trials are run and the experimenter provides a log file or set of log files or archive of log files (the data) and a swarm description file (INI format) consisting of characteristics of clients in the swarm, the file used and the swarm itself (the metadata). The swarm description file is used to provide an intermediary logfile-id mapping file, as described above. This file may be provided as a file system entry (typically INI), as an in memory information or it may augment the existing swarm description file (only the client session ID needs to be added). The logfile-ID mapping, the swarm description file and the log file(s) are then used by the mes- sage parser and the description parser to provide actual BitTorrent parameters to be stored in the database. The parsers would instantiate a specific storage class as required by the users and store the information there. CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 79

Figure 4.7: Workflow of Log Parsing considering ID Mapping

4.4.2 Monitoring Peer-to-Peer Swarms using MonALISA

Client monitoring is enabled through the use of MonALISA1. MonALISA-specific scripts are used to provide required information to the central repository. The script is typically invoked every 5 sec- onds. The service station collects required information and sends them to the monitoring repository. Figure 4.8 is an architectural view of the monitoring infrastructure. The core component of the monitoring framework is the MonALISA service. The service is able to discover other monitoring services and to be discovered by interested clients. Each service registers itself with a set of Lookup Services (LUSs) as part of a dedicated monitoring group and it publishes a set of dynamic attributes that describe it. This way any interested application can request specific services based on a set of matching attributes. The service consists of an ensemble of multi-threaded subsystems, which carry out several func- tions in an independent fashion, being able to operate autonomously. Some of the most important functions it performs are: • monitoring of a large number of entities (hosts running BitTorrent test scenarios) using si- multaneously several modules which interact actively with the entities or passively listen for information, or interface with other existing monitoring tools; • filtering and aggregation of the monitoring data, providing additional information based on several data series; • storage for short periods of time of the monitoring information, in memory or in a local database; this way, clients can also get history of the monitoring information they are in- terested in, not only the real-time monitored values; web services for direct access to the monitored data, from the monitoring service; local clients can get the information directly without following the whole chain; • triggers, alerts and actions based on the monitored P2P information, being able to react im- mediately when abnormal behaviour is detected; controlling using dedicated modules allow

1http://monalisa.cern.ch/monalisa.htm CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 80

Figure 4.8: Monitoring with MonALISA

performing more complicated actions, that cannot be taken only to the local flow of informa- tion. This way, the service can act based on monitoring information from several services at the command of a controlling client, or direct or indirect users request. The repository collecting monitoring information from the service is basically a database. Informa- tion can be rendered through a web interface, or by using the MonALISA interactive client interface. The monitoring data repository offers a long history for a few preconfigured parameters related to the BitTorrent testing environment. The time series for these parameters are stored in a database for long-term viewing purposes. Due to the large amount of data (from all the services in a community), the typical usage is for storing aggregated and summary values. For those, it offers a set of predefined charts in a web application format that allow selecting among the set of time series and the time interval to plot. The data repository client has been extended to support automated actions and alerts based on some configurable triggers. The two main functions of the data repository are adding new values and querying the storage for data matching a given predicate. Particular implementations are available for different purposes: plain text file logging of data or database back ends, values averaged in time or keeping original data as it was produced, database structures optimized for keeping arbitrary parameters or structures optimized for a well-known limited set of parameters. All database-backed storage structures have the option to re-sample the values on the fly, to provide a uniform time distribution of values. New storage implementations that fit particular needs can be easily added. The default database back end for all services is PostgreSQL. Both the Service and the repository come with a precom- piled PostgreSQL package that is used by default, but the configuration parameters allow pointing to a different back end. In addition to any permanent storage for the monitoring values a volatile memory buffer is kept internally, serving as cache for recent data. The size of this memory buffer is dynamically adjusted function of how much memory was allocated to the Java Virtual Machine and CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 81 how much of it is free. This volatile storage is enough for running a light monitoring service, being able to serve history requests in the limit of how much data the service was able to keep in memory. While the communication mechanism is shared with the interactive client and the storage layer is the same as the one in the basic MonALISA Service, the repository web interface is the particularity of this component. Built around a Tomcat servlet engine and the JFreeChart charting library, the repository offers a few powerful servlets that cover most of the known use cases: matrix views of the values, bar, pie and histogram charts, geographical maps, scatter plots etc. These views are easily configurable either by editing text files or by using the web interface itself to create new views. The repository can also act as a global aggregator. It has the same support as the Service for running filters on the collected data, producing derived values that can be stored along or instead of the original values. Custom filters can be implemented to intercept received data (or particular cuts in it) and act on the values. A particular kind of filters is the automatic actions framework. Triggered by the monitored data values (comparison with predefined thresholds, absence of data, arbitrary correlations or custom pieces of code) automatic action be taken.

4.5 Evaluation Based on Peer-to-Peer Protocol Measurements

The virtualized infrastructure and automated framework provides the necessary platform for ex- perimental evaluation of Peer-to-Peer implementations and formal models. Our interest resides in evaluating BitTorrent clients, which mostly translates to sheer download speed; the higher the download speed, the better the performance. Obviously, we have to take into account both peer download speed and swarm download speed. A peer with high download speed that more or less “free rides” on top of the other peers is not desired in a given swarm. In order to properly consider a formal evaluation mechanism for performance evaluation of a Peer- to-Peer ecosystem, several questions must be answered: • What do we measure? • What do we consider for evaluation? • What do we vary? What is influencing the measurement? • How do the above correlate? • What do we consider to be “better”? The answer to the first question is given by data acquired from status and verbose log files: protocol messages, number of connections, download and upload speed, resource usage. This may be acquired through measurements, monitoring and log files. As mentioned above, we will consider download speed as an evaluation unit for performance. We consider the other measurable unit to correlate to the download speed. The units we may vary include: • hardware resources – base or virtualized system resources such as RAM, CPU, I/O and networking; • peer/system characteristics – download speed limitation, upload speed limitation, connection limitation, existence of a firewall; • Peer-to-Peer implementation – software application and protocol design; • swarm and network characteristics – number of peers, network topology, peer connectivity, network bandwidth, etc. CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 82

The varying units influence the measured units. Though we will focus mostly on download speed, the other units also provide influence. If a given peer receives a boost in its upload speed that means it will also boost download speed of sever other peers. If a given varying unit would directly influence upload speed, that will also provide influence over download speed, though it may not happen directly. It may be easier to detect influence of varying units over secondary units, such as protocol messages. As such, we consider a correspondence between varying units and measured units:

Eval(hw, sys, impl, swarm, net) = (protomsg, speed, conn, ruse) (4.1)

Our goal is to maximize download speed translates in defining and/or adjusting the most suitable values for the varying units. A proper implementation, deployed in a proper environment will ensure increased performance. This leaves answering the last question: What do we consider “better”?. High download speed roughly translated to low download time. In order to achieve good/better performance, we required minimizing download time. Download time is, however, a static unit: we only measure it at the end of the given scenario. As it is influenced by download speed evolution that is, in its turn, influenced by other factors, we aim to correlate download time with the evolution of download speed. If we were to continuously monitor download speed, a pure mathematical formula for a given peer would be:

Z DT FS = DS(t) dt (4.2) 0 where: • FS – file size • DT – download time • DS – download speed (evolution) As we only periodically monitor a peer, the formula translates to:

DT X FS = DSt (4.3) t=0

This provides necessary correlation between download speed evolution and download time, with the file size being well known. In order to correlate download speed with other units, we have to consider download speed as the interaction with other peers. A peer may only download if other peers upload. We formalize this as a download matrix that may be built for each interval of monitoring:

  ds1,1 = 0 ds1,2 ··· ds1,NP  ds2,1 ds2,2 = 0 ··· ds2,NP  PDS =   (4.4) t  . . .. .   . . . .  dsNP,1 dsNP,2 ··· dsNP,NP = 0 where: CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 83

• dsi,j – download speed of peer j from peer i; • PDS – peer download speed matrix at time t; • NP – number of peers in swarm; Peer to peer download speed is easily measurable and as it fairly easy to correlate it to varying units. Using an array of such matrices, a matrix element for each time slice, one possesses detailed measured information regarding peer download speed and, by summing up either rows, columns or the whole matrix, provide the ability to observe peer upload speed, peer download speed and swarm download speed. It follows that the upload matrix is the transpose of the download matrix. The download speed of peer j from peer i (dsi,j) is equal to the upload speed of peer i from peer j (usj,i). That is:

T PUSt = PDSt (4.5)

The instant download speed of a peer is the sum of all download speeds of peer transfers. Similarly, the instant upload speed of a peer is the sum of all upload speeds of peer transfers. In other words, the instant download speed of peer k (DSk) is the sum of items in column k in the PDS matrix, while the instant upload speed of peer k (USk) is the sum of items in row k in the PDS matrix. That is:

NP X DSk = dsi,k (4.6) i=1

NP X USk = dsk,i (4.7) i=1

The presence of the download speed matrix (actually, the array of download speed matrices) allows three major points of action: • The discrete analysis of peer and swarm behavior. • Discovering of weak spots or points of improvement; situations of download speed or upload speed drop may provide hints to where enhancements are required. • Prediction of download and upload speed based on evolution of items in the matrix. Please take into account that the matrix is a reporting tool. It does not provide any configuration op- tion for the functioning of a swarm; it is a read-only database that is updated as the swarm evolves. Evolution of peers and swarm is made possible by peer interaction, network connections, network bandwidth availability and limitation; these are the action points that have to be updated/enhanced; the matrix will subsequently provide reporting information. Nevertheless, the matrix may be used to model ideal situations to provide an insight on the behavior of a protocol. For example, consider the ideal situation of a swarm where: • The transferred file size is FS. • There are NP peers in the swarm. • All peers have equal bandwidth B. CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 84

• Bandwidth is shared equally among peer connections. • There is only one seeder (consider the case of live streaming). • All peers start at the same time. B In this situation, all peer connections use NP – this is the download speed and upload speed among any two peers. In the ideal situation, the seeder sends piece 1 to peer 1, piece 2 to peer 2 and so on (considering the bandwidth usage). In the next round it would send piece 1+NP to peer 1, piece 2+NP to peer 2 and so on. In this round, all peers would connect to each other and deliver the piece they have to the other peers and receive their piece in return. That is, at the end of each round, all peers (except for the seeder) have the same data – in a full mesh network. As such all peers fully use their bandwidth; all connections to other peers are fully utilized. That B means that the matrix is filled with the value NP , except for the first column (the seeder download column) filled with zeros. As such, the peer download speed is:

NP X B DS = = B (4.8) k NP i=1

The file size formula for all peers is thus equal to:

DT X FS = B (4.9) i=1

It follows that the download time is:

FS DT = (4.10) B

However, one has to consider the upload slot limitation of each peer. Consider NS the number of upload slots of each peer. In order to create a full mesh network such as the one above, only NS +1 peers have to be present (one of which is considered the seeder). For these peers, it results in the same download time as above:

FS DT = (4.11) B

The next stage uses each previous seeder as a leecher. That is (NS + 1) × NS new leechers will get the full file. In the next phase, ((NS + 1) × NS + 1) × NS new leechers will get the full file and so one. The above approach poses two important disadvantages. Firstly, not all peers receive content from the very beginning, which is problematic for streaming. Secondly, it provides a high turnaround time, that is peers completing their download later on. An alternate approach is using an empty slot in the initial peers. When seeder is started pieces are sent to NS peers. Those peers use only NS − 1 slots, since they are unable to provide anything to the seeder. The empty slot may be used to provide data to another peer. That is there will be another NS peers that will be on round behind that will receive all pieces delivered by the seeder in the previous round. This goes on until all peers receive information. Such that, the download time would be: CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 85

FS • For the first NS peers: DT = B FS • For the next NS peers: DT = B + t FS • For the next NS peers: DT = B + 2t • And so on. t is the time required for transferring one piece or item of the file. This approach is highly adequate for streaming, allows a good turnaround time and is also highly similar to the BitTorrent swarm model allowing interactions with many other peers. The model has its limitations, in not mentioning peer arrival time, the fact that “upper” peers are benevolent seeders for “lower” peers and the unchanged bandwidth among peers. The download matrix and other formal metrics provide the context for formal analysis of Peer-to- Peer protocols. Unaware of the way information is collected, the download matrix is the discrete evolution read-only tool storing transfer information among peers in a swarm.

4.6 Conclusion

In order to provide thorough analysis of Peer-to-Peer protocols and applications, realistic trials and careful measurements were presented to be required. Clients and applications provide necessary parameters (such as download speed, upload speed, number of connections, protocol message types) that give an insight to the inner workings of clients and swarms. Protocol analysis, centered around BitTorrent protocol, relies on collecting protocol parameters such as download speed, upload speed, number of connections, number of messages of a certain type, timestamp, remote peer speed, client types, remote peer IDs. We consider two kinds of messages, dubbed status messages and verbose messages that may be extracted from clients and parsed resulting in the required parameters. Various approaches to collecting messages are presented, with differences in the method intrusive- ness and quantity and quality of data: certain methods may require important updates to existing clients and, as such, access to the source code, while others may only need access to information provided as log files. A more intrusive, albeit highly customizable approach is provided through the design and implemen- tation of a generic logging library. The library provides an API that may be “hooked” in well-defined places in BitTorrent clients source code. The API allows logging of all BitTorrent parameters and events – both status messages and verbose messages. Protocol parameters are output either in text or XML format depending on the choice of the user. The library has been tested against popular BitTorrent implementations such as Transmission and rtorrent (based on libtorrent-rakshasa). Collection, parsing, storage and analysis of logging information is the primary approach employed for protocol parameter measurements. A processing framework has been designed, implemented and deploy to collect and process status and verbose messages. Multiple parsers and multiple storage solutions are employed. Two types of processing may be used: post-processing, taking into account a previous collection of logging information into a log archive and real-time processing when data may be monitored as it is parsed in real time. Protocol parameters are presented to the user through the use of a rendering engine that provides graphical representation of parameter evolution (such as the evolution of download speed or upload speed). The rendering engine makes use of the database results from the processing framework and provides a user friendly interface to the experimenter. MonALISA has been employed for monitoring protocol parameters where available. CHAPTER 4. PROTOCOL MEASUREMENT AND ANALYSIS IN PEER-TO-PEER SYSTEMS 86

A formal evaluation of the performance of a client has been created, considering the less download time to be a performance metric. A matrix is used to store all traffic information pertaining to a given moment in a swarm, such that the complete speed evolution of all BitTorrent clients in the swarm may be described by an array of matrices. Chapter 5

A Novel BitTorrent Tracker Overlay for Swarm Unification

While Gnutella is a fully decentralized Peer-to-Peer protocol, BitTorrent rests on a dedicated service, called tracking, to ensure bootstrapping. The tracker is a metadata storage facility that provides “who’s sharing what” information to peers. The tracker is contacted by peers to provide information about other peers sharing the same file. Based on this information, peers subsequently contact other peers and transfer pieces. The tracker only fills the role of piece possession mediator among peers. The prime disadvantage of this architecture is the single point of failure role of the tracker. If the tracker fails further contacts among peers are disabled – peers may still communicate with peers they are already acquainted to. A simple solution to this problem is the introduction of multiple trackers in a swarm; this is easily achieved by adding multiple tracker announce URLs to the .tor- rent file. In case a tracker fails, the peer may contact another tracker. Note that tracker are note synchronized among themselves. Still, new contacts may be provided through the new tracker. As a further update, the .torrent file may be simplified through the use of magnet links1. The role of the tracker, and the problem of it being a single point of failure, may be circumvented through the use of PXE (Peer Exchange) and DHT (Distributed Hash Table)[2]. These new tech- nologies are new ways of discovering peers, without having to rely on the BitTorrent tracker. There needs to be an initial bootstrap phase; control information is provided from one peer to another peer. Most implementations nowadays provide PXE or DHT, coupled with the use of a tracker. In many cases, such as libtorrent-rasterbar, the implementation may simultaneously use both DHT and the classical tracker data. The role of extending tracker functionality is to provide redundancy and generally simplify the archi- tecture. Our goal of building a tracker overlay for swarm unification aims to enhance swarms, and allow the presence of multiple peers and seeders in a swarm, with the potential benefit of increased swarm speed. As the health of a swarm not only relies on the number of peers, but also on the ratio of seeders versus leechers, approaches such as building private swarms or providing moderation are important. These approaches provide the means for extending seeder availability within the swarm and providing a constant flow of data. Trackers are server-only entities. They are contacted by peers to update internal information and provide information to clients. The tracker database stores information about peers taking part in a swarm and statistical information about peers, such as download size. This information have been used heavily by Bardac et al. [4] in building a tracker monitoring service. Our proposal is to

1http://news.softpedia.com/news/BitTorrent-Magnet-Links-Explained-132536.shtml

87 CHAPTER 5. A NOVEL BITTORRENT TRACKER OVERLAY FOR SWARM UNIFICATION 88 enhance existing tracker messages and allow trackers to interconnect – that is to communicate with each other, on top of the existing communication with peers in the swarm.

5.1 Context and Objective

In this chapter, we address the issue of unifying separate swarms that take part in a session shar- ing the same file. We propose a tracker unification protocol that enables disparate swarms, using different .torrent files, to converge. We define swarm unification as enabling clients from different swarms to communicate with each other. The basis for the unification is a “tracker-centric conver- gence protocol” through which trackers form an overlay network and send peer information to each other. Communication of peers in a swarm is typically mediated by a BitTorrent tracker or several trackers which are defined in the .torrent file. It is periodically contacted by peers to provide information regarding piece distribution within the swarm. A peer would receive a list of peers from the tracker and then connect to these peers in a decentralized manner. As it uses a centralized tracker, the swarm may suffer if the tracker is inaccessible or crashes. Several solutions have been devised, such as PEX (Peer EXchange)1 or DHT (Distributed Hash Table) [2]. The tracker swarm unification protocol presented in this article enables redundancy by integrating multiple trackers in a single swarm. The goal of the tracker swarm unification protocol is the integration of peers taking part in different swarms that share the same file. These swarms, named common swarms, use the same content but different trackers. We have designed and implemented a tracker network overlay that enables trackers to share infor- mation and integrate peers in their respective swarms. The overlay is based on the Tracker Swarm Unification Protocol (TSUP) that allows update notification and delivery to trackers from the overlay. The protocol design is inspired by routing protocols in computer networks. At this point, as proof of concept, the tracker overlay is defined statically. All trackers know be- forehand the host/IP addresses and port of the neighboring trackers and contact them to receive required information. The integration of dynamic tracker discovery is set as future work. Each tracker may act as a “router”, sending updates to neighboring trackers that may themselves send them to other trackers. The protocol proposed in this article, named Tracker Swarm Unification Protocol (TSUP), renders possible the unification of swarms which share the same files, by employing a tracker network overlay. A tracker which implements this protocol will be referred here as an unified tracker. Torrent files created for the same content have the same info_hash. As such, in swarms that share the same file(s) (common swarms), peers will announce to the tracker the same info_hash. Therefore, TSUP capable trackers can “unify” them by communicating with each other in order to change information about peers which contribute to shared files with the same info_hash. In order to accomplish this, unified trackers send periodic updates to each other, containing information about the peers from the network.

5.2 Swarm Unification Protocol

As mentioned above, TSUP is the acronym for Tracker Swarm Unification Protocol. TSUP is respon- sible with the communication between trackers for peer exchange information in common swarms.

1http://wiki.vuze.com/w/Peer_Exchange CHAPTER 5. A NOVEL BITTORRENT TRACKER OVERLAY FOR SWARM UNIFICATION 89

Figure 5.1: Unified trackers

5.2.1 Protocol Overview

For transport layer communication, the protocol uses UDP (User Datagram Protocol) to reduce resource consumption. A tracker already possesses a lot of TCP (Transport Control Protocol) connections with other peers. Adding more TCP connections to each neighboring tracker would increase TSUP overhead too much for a big tracker overlay. The messages passed from one tracker to another do not need TCP’s flow control and need a lower level of reliability than TCP as it will be explained below. The simplicity of the UDP protocol gives the advantage of a smaller communication overhead. In TSUP’s operation the following processes may be identified: • Virtual connections establishment process: A three-way-handshake responsible with es- tablishing a UDP “connection” between two linked trackers at the application layer. The pro- cess is started by a SYN packet (synchronization packet). • Unification process: The trackers exchange unification packets (named SUMMARY pack- ets) during a three-way-handshake in order to find out which are the common swarms. • Updating process: The trackers exchange peers from common swarms, encapsulated in UPDATE packets. • Election process: The establishment of a swarm leader which is responsible with receiving all updates from the neighboring linked trackers, aggregating them and sending the results back. The unification process includes an updating process in its three-way-handshake, such that the two operations, unification and update, are run in pipeline. Whenever a new torrent file is registered to the tracker, a new swarm is created. The unification process is triggered and a SUMMARY packet is immediately sent to each neighboring tracker, informing the others of the new swarm. The updating process is started periodically, such that UPDATE packets are sent at a configurable interval of time to each tracker in a common swarm. A typical update interval is 30 seconds. In order to maintain the virtual connections between trackers, HELLO packets are sent periodically, acting as a keep-alive mechanism. A typical HELLO interval is 10 seconds, but its value may CHAPTER 5. A NOVEL BITTORRENT TRACKER OVERLAY FOR SWARM UNIFICATION 90 be changed from protocol configuration. If no HELLO packet is received during a configurable interval, called disconnect interval, the virtual connection is dropped and the virtual connection establishment process is restarted for that link by sending a SYN packet. Some packets, such as UPDATE packets, must be acknowledged. If no answer or acknowledge- ment is received, the packet is retransmitted. For example, UPDATE packets are resent at each HELLO interval until an acknowledgement is received. It is not a problem if some UPDATE packets are lost and arrive later to destination. However they need to be acknowledged and they are retransmitted in order to increase the probability of their arrival. TCP, by offering reliability, provides a faster delivery of updates in case of a network failure which is not needed in the case of TSUP. Lower overhead is considered here more important that fast retransmission. Thus TSUP implements a timer-driver retransmission, as opposed to data- driven used by TCP. Periodically sent packets, the keep-alive mechanism, acknowledgements and retransmissions con- tribute to the low overhead needed in TSUP. They help exceed the drawbacks of the UDP transport protocol, and also give a more efficient communication than a TCP one.

5.2.2 Tracker Awareness

Tracker communication is conditioned by awareness. For this purpose, in the current version of the protocol, each tracker is configured statically with a list of other communicating trackers. Each element of the list represents a link which is identified by the tracker host name (URL or IP address) and port. Other parameters for the link may be configured; some of them may be specific to the implementation. If the virtual connection establishment process is successful, the link becomes a virtual connection, which is conserved with keep-alive packets (HELLO packets). A future version of the protocol will incorporate the design of a tracker discovery mechanism capable of generating the list of communicating trackers for each tracker dynamically, with the benefit of scalability and reduction of the administrative overhead.

5.2.3 Tracker Networks

To improve TSUP’s scalability, trackers may be grouped together in networks named tracker net- works. Connections in all tracker networks are full mesh. Two networks are connected with the aid of border trackers (see Figure 5.2). To configure a topology which contains multiple networks, each link of each tracker must be set as an internal link or an external link (see Figure 5.2). Trackers connected with internal links are part of the same tracker network; trackers connected with external links are part of other net- works. However, the ones from the first category may also be classified as belonging to an internal network and the ones from the second category as being from an external network. A complete graph, using internal links as edges is an internal network, and a complete graph with external links as edges is an external network (see Figure 5.2). A tracker which has both internal and external links is a border tracker. Peer information received from an internal network originates from internal peers while information received from an external network originates from external peers. Peers connected to a tracker with TCP, via the BitTorrent protocol, are called own peers. The links between trackers in a network, whether internal or external, must be full mesh. In order to use a scalable and low resource consuming communication within a network, trackers are organized in groups depending on the unified swarm. Therefore a tracker may belong to more than one group at the same time, the number of groups it belongs being equal with the number of swarms present on that tracker. Each group contains a swarm leader responsible for sending CHAPTER 5. A NOVEL BITTORRENT TRACKER OVERLAY FOR SWARM UNIFICATION 91

Figure 5.2: Tracker Networks peer updates to peers in the group. The other group members, instead of sending updates to other members on a full mesh graph, it sends updates to the swarm leader on a tree graph, reducing updating overhead. These updates propagate to other peers in the group. In accordance to graph theory, the number of updates sent in a swarm without the swarm leader mechanism (full mesh) is computed by using the formula below:

n(n − 1) UP DAT ES = 2 · = n(n − 1) (5.1) fullmesh 2

The number of updates sent within a swarm using the swarm leader mechanism (tree) are:

UP DAT ESswarmleader = 2(n − 1) (5.2)

As may be observed from the formulas above the complexity decreases from O(n2) in a full mesh update scheme to O(n) with the swarm leader scheme. Each swarm contains two swarm leaders, one for the internal network (which sends updates through internal links), called internal swarm leader and one for the external network (which com- municates updates through external links), named external swarm leader. As connections are full mesh in an internal network, the internal peers (received from other trackers from the internal network) are distributed to other internal trackers only by the internal swarm leader and in no other circumstance by another tracker. Through analogy, in an external network, peers (received from other trackers from the external network) are distributed to other external trackers only through the external swarm leader. On the other hand, internal peers are distributed to external trackers and external peers to the internal trackers. Own tracker peers are distributed both to the internal and the external network. Swarm leaders are automatically chosen by trackers during the election process which is started periodically. There are metrics used in order to choose the most appropriate leader. The first and most important one prefers as swarm leader a tracker which possesses the smallest number of swarm leader mandates. The number of mandates is the number of swarms where a tracker is swarm leader. This balances the load of the trackers – as the number of mandates of a tracker increases, its load also increases. In the current version of the protocol the grouping of trackers into CHAPTER 5. A NOVEL BITTORRENT TRACKER OVERLAY FOR SWARM UNIFICATION 92

networks and the selection of border trackers is done manually (statically) by the system adminis- trator.

When two network trackers A1 and A2 are connected indirectly through other network trackers Bj, if A1 and A2 use a common swarm and Bj doesn’t use this swarm, then the Ai trackers cannot unify unless the border trackers are specified in the configuration. This happens because the configured border trackers must unify with any swarm, although they do not have peers connected from that swarm. Grouping trackers in networks increases system scalability, but also network convergence time. The update timers can be set to a lower value for border trackers to limit convergence overhead. The system administrator should opt between scalability and convergence and adapt the protocol parameters to the specific topology.

5.3 Protocol Implementation

TSUP is currently implemented in the popular XBT Tracker 1, implemented in C++. The extended TSUP capable tracker was dubbed XBT Unified Tracker. The original tracker implements an experimental UDP BitTorrent protocol known as UDP Tracker. Because TSUP also uses UDP and communication takes place using the same port, TSUP-specific packets use the same header structure as the UDP Tracker, enabling compatibility. XBT Tracker uses a MySQL database for configuration parameters and for communication with a potential front end. XBT Unified Tracker adds parameters for configurations that are specific to TSUP and uses a new table in order to remember links with other trackers and their parameters. Tracker awareness, as described in Section 5.2.2, is implemented in the database. Besides the HTTP announce and scrape URLs, the original tracker uses other web pages for in- formation and debugging purpose. The unified tracker adds two extra information web pages for monitoring. The trackers web page shows details about every link and the state of the connection for that link. For every swarm, the swarms web page shows the list of peers and the list of trackers connected to that swarm. As any server, XBT tracker uses an infinite loop waiting for the arrival of packets from clients. Depending on the host system configuration, the server uses select or epoll. In order to receive UDP packets, the recv function is being used. This function is called when and event is generated through select or epoll. Our methods are used to process UDP packets; they are dubbed process_... and send_.... The protocol sends packets to other trackers either as a reply to some packets received, or events generated, such as timer expiry. Both select and epoll are unblocked after 5 seconds, in case of no event being generated on the socked. Due to this, any protocol interval may consume at most 5 seconds, adding extra possible delays of the running code. A sample of the TSUP timer code is shown in Listing 5.1.

1 /** 2 * Check TSUP timers and if some of them expired take the appropriate actions. 3 */ 4 void Cserver::tsup_timers() 5 { 6 ... 7

1http://xbtt.sourceforge.net/tracker/ CHAPTER 5. A NOVEL BITTORRENT TRACKER OVERLAY FOR SWARM UNIFICATION 93

8 //*_ On update timeout build the updates and mark this event 9 if(time() - m_update_time > m_config.m_update_interval) { 10 b_update_timeout = true; 11 m_update_time = ::time(NULL); //*_ Reset update timer 12 } 13 else 14 b_update_timeout = false; 15 ... 16 }

Listing 5.1: TSUP Timer

The Cserver class, the main class of the implementation, had been used to implement the protocol. The structure in Listing 5.2 defines components of another trakcer to which there is a connection and stores the state of the link in the status field.

1 /** 2 * Information about a tracker. 3 */ 4 struct t_tracker 5 { 6 t_tracker(); 7 void clean_up_sent_peers(); 8 9 void clear(); 10 11 // From DB 12 13 int tid; 14 char name[256]; 15 in_addr_t host; 16 string str_host; 17 int port; 18 int nat_port; 19 char description[256]; 20 bool nat; 21 bool external; 22 int status; 23 time_t recv_time; 24 int retry_times; 25 int reconnect_time; 26 long long connection_id; 27 ... 28 int mandates; 29 };

Listing 5.2: Tracker Structure

A list of such structures forms the set of links to other trackers. Original data structures have veen update to satisfy the new protocol needs. For example, the t_file structure, shown in Listing 5.3 is used to define information about a .torrent file. It stores a list of pointers to t_tracker files. This structure also stores swarm leaders: the internal network (swarm_leader) and the external network (swarm_leader_external).

1 struct t_file CHAPTER 5. A NOVEL BITTORRENT TRACKER OVERLAY FOR SWARM UNIFICATION 94

2 { 3 //* TSUP members: 4 std::set trackers; 5 t_tracker *swarm_leader; 6 t_tracker *swarm_leader_external; 7 bool new_arrivals; 8 };

Listing 5.3: File Structure

Another important structure that was updated was t_peer consisting of information about a swarm peer. New peers had been added when announced from another tracker. The CServer class has also implemented a set of methods required for the protocol. Such meth- ods are unify_swarm and unify_swarms for enablingunifications, update, build_update, build_updates, append_update_peer.

5.4 Experimental Setup

TSUP testing activities used a virtualized infrastructure and a Peer-to-Peer testing framework run- ning on top of it. We were able to deploy scenarios employing various swarms, ranging from a 4-peer and 1-tracker swarm to a 48-peer and 12-tracker swarm. Apart from testing and evalution, the infrastructure has been used to compare the proposed tracker overlay network with classical swarms using a single tracker and the same number of peers. We will show that a unified swarm has similar performance when compared to a single tracker (classical) swarm. In order to deploy a large number of peers we have used a thin virtualization layer employing OpenVZ1. OpenVZ is a lightweight solution that allows rapid creation of virtual machines (also called containers). All systems are identical with respect to hardware and software components. The deployed experiments used a single OpenVZ container either for each tracker or peer taking part in a swarm. A virtualized network has been build allowing direct link layer access between systems – all systems are part of the same network; this allows easy configuration and interraction. The experiments made use of an updated version of hrktorrent, a lightweight application built on top of libtorrent-rasterbar. Previous experiments [14] have shown libtorrent-rasterbar outperforming other BitTorrent experiments leading to its usage in the current experiments. The experiments we conducted used a limitation typical to ADSL2+ connections (24 Mbit download speed limitation, 3 Mbit upload speed limitation). An automatically-generated sample output graphic, describing a 48 peer session (12 seeders, 36 leechers, 12 trackers) sharing a 1024 MB file is shown in Figure 5.3. The image presents download speed evolution for all swarm peers. All of them are limited to 24 Mbit download speed and 3 Mbit upload speed. All peers use download speed between 2 Mbit and 5 Mbit on the first 2000 seconds of the swarm’s lifetime. As the leechers become seeders, the swarm download speed increases rapidly as seen in the last part of the swarm’s lifetime, with the last leechers reaching the top speed of 24 Mbit.

1http://wiki.openvz.org CHAPTER 5. A NOVEL BITTORRENT TRACKER OVERLAY FOR SWARM UNIFICATION 95

Figure 5.3: Sample Run Graphic

5.5 Experimental Results and Evaluation

In order to test the overhead added by TSUP to BitTorrent protocol, we have made a set of test scenarios which compare the average download speed for a swarm with unified trackers and for another swarm with just one non-unified tracker, but the same number of leechers and seeders. Each test scenario is characterized by the shared file sizes, the number of peers and, in the case of tests with unified trackers, by the number of trackers. We shared 3 files of sizes 64MB, 256MB and 1024MB. In the test scenarios with unified trackers for each file we tested the swarm with 1, 2, 4, 8 and 12 trackers. On each tracker there were connected 4 peers, from which 1 is a seeder and 3 are leech- ers. So, for example, in a scenario with 8 trackers there are 8 seeders and 24 leechers, totalizing 32 peers. Having 3 files and 12 trackers in the biggest scenario we needed to create 36 .torrent files, because for each shared file we made a .torrent file for each tracker. In the corresponding test scenarios with non-unified trackers, there is just one .torrent file for each shared file. We varied the numbers of seeders and leechers connected to the tracker so that they have the same cardinality with the corresponding unified trackers scenarios. Each test scenario has been repeated 20 times in order to allow statistical relevance. The average download speed was calculated as an average value from the 20 sessions. Results may be seen in Table 5.1, which depicts the results for each file size, in the top (64MB), middle (256MB) and bottom part of it (1024MB), respectively. For each of this two situations the mean download speed (“mean dspeed”) and relative standard deviation (“rel.st.dev.”) is depicted. In the right part, titled “perf. Decrease” (performance decrease), shows the percent of download speed decrease induced by the overhead of the TSUP. In the left side of the table the number of seeders and leechers for each scenario is shown. The percentage value for download speed decrease is computed using the standard formula:

ds − ds d = 100% · SingleT racker UnifiedT rackers , (5.3) dsSingleT racker where ds is an abbreviation from mean download speed. All positive percentage values from the “perf. decrease” header mark a decrease of performance CHAPTER 5. A NOVEL BITTORRENT TRACKER OVERLAY FOR SWARM UNIFICATION 96 caused by TSUP overhead. A negative download speed decrease percentage shows that there is an increase instead of a decrease. The unification which takes place in XBT Unified Tracker introduces an overhead in the BitTorrent protocol in comparison to XBT Tracker. In theory the performance decrease must always be pos- itive. But there are situations where percentages are negative, which could suggest that TSUP increases the speed, thus reducing the download time. But this performance increase is not due to TSUP, but is caused by another fact. In all scenarios, in the Single Tracker experiments, at the beginning, all peers are started almost simultaneously, creating a flash crowd. In this situation, communication between peers will start immediately, but when multiple trackers are unified, the TSUP imposes a delay before each peer finds out of all the others, because of the convergence time. It is known that sometimes it is better when peers enter the swarm later [6], explaining the presence of negative values for the performance decrease. From results in Table 5.1 several conclusions are drawn. The TSUP overhead decreases, on the first hand, when the number of peers increases (and proportionally the number of seeders) and, on the other hand, when the file size increases. When the overhead is insignificant, the percentages have lower values. TSUP convergence time causes the avoidance of a flash crown at the beginning of each scenario, thus inducing a small performance increase. Starting from 8 seeders and more TSUP performance decrease becomes smaller than the performance increase caused by avoiding the flash crowd. That is why some performance decrease values are negative. The relative standard deviation is generally increasing with the number of peers, but is decreasing when the file size increases. The values can be considered normal, taking into account the number of peers that are part of a swarm. Due to the small values of performance decrease and relative standard deviation, we concluded that TSUP overhead is insignificant for small to medium-sized swarms (less than 50 peers) which share big files (1GB). BitTorrent is generally used for sharing large files and TSUP allows the increase of swarms size; these two factors come as an advantage for this technology. Swarm unification increases the number of peers for a shared file, but this fact does not always grant a larger download speed, as it can be seen in Table 5.1. However, having a swarm with a bigger number of peers has three advantages. First of all, it increases the chance that more seeders will later be available and a big proportion of stable seeders increase download speed. The second reason is that bigger swarms increase a shared file’s availability by creating redundancy. The third reason is that a bigger swarm is more attractive for new users, giving the possibility of creating a big network which may be used for features such as social networking.

5.6 Conclusion

A novel overlay network protocol on top of BitTorrent, aiming at integrating peers in different swarms, has been presented. Dubbed TSUP (Tracker Swarm Unification Protocol), the protocol is used for creating and maintaining a tracker network enabling peers in swarms to converge in a single swarm. Each initial swarm is controlled by a different tracker; trackers use the overlay protocol to communicate with each other and, thus, take part in a greater swarm. We have used an OpenVZ-based Peer-to-Peer testing infrastructure to create a variety of scenarios employing an updated version of the XBT Tracker, dubbed XBT Unified Tracker. The protocol incurs low overhead and overall performance. The unified swarm is close to the performance of single- tracker swarms consisting of the same number of seeders and leechers with the benefit of increased number of peers, which boosts download speed. The increased number of peers provides the basis for improved information for various overlays (such as social networks) and allows a healthier swarm – given enough peers, if some of them decide to leave the swarm, some peers will still take part in the transfer session. CHAPTER 5. A NOVEL BITTORRENT TRACKER OVERLAY FOR SWARM UNIFICATION 97 Table 5.1: Single Tracker vs. Unified Trackers Comparison Single Tracker Unified Trackers 1248 3 6 121 24 337.732 472.274 475.138 476.10 3 6 12 1.28 24 358.72 0.87 1.87 476.59 477.56 6.06 492.40 0.45 1 0.26 6 0.45 4 6.64 8 335.11 396.55 463.42 497.35 1 6 4 2.87 8 16.64 356.55 12.60 407.55 11.10 477.45 500.56 0.78 16.03 1.07 2.46 14.67 -4.46 10.73 8.49 0.6 14.49 0.02 -1.66 1248 3 6 12 24 365.27 477.47 477.54 423.71 0.19 0.15 0.18 6.19 1 6 4 8 365.57 437.09 466.03 418.69 0.31 9.12 5.93 8.73 -0.08 8.46 2.41 1.18 12 3612 470.53 36 9.61 486.04 12 8.33 496.48 12 11.48 494.93 -5.52 12.38 -1.83 12 36 407.71 4.47 12 418.76 6.27 -2.71 MB MB MB seeders leechers mean dspeed (KB/s) rel.st.dev. (%) trackers mean dspeed (KB/s) rel.st.dev. (%) perf. decrease (%) = 64 = 256 = 1024 size size size CHAPTER 5. A NOVEL BITTORRENT TRACKER OVERLAY FOR SWARM UNIFICATION 98

Experiments have involved different swarms, with respect to the number of peers and trackers, and different file sizes using simulated asymmetric links. The table at the end of the chapter shows a summary of results, providing insight to the fact that the protocol has a low overhead. At this point each tracker uses a statically defined pre-configured list of neighboring trackers. One of the main goals for the near future is to enable dynamic detection of neighboring trackers and ensure extended scalability. We are currently considering two approaches: the use of a tracker index where trackers’ IP/host addresses and ports are stored or the use of a completely decentralized tracker discovery overlay similar to DHT’s discovery methods. As proof of concept, our test scenarios have focused on homogeneous swarms. All peers in swarms are using the same implementation and the same bandwidth limitation. All peers enter the swarm at about the same time, with some delay until swarm convergence, in case of the unified tracker protocol. We plan to create heterogeneous swarms that use different clients with different charac- teristics. The number of seeders and leechers in initial swarms are also going to be altered and observe the changes incurred by using the unification protocol. Chapter 6

Designing an Improved Multiparty Protocol in the Linux Kernel

6.1 The swift Multiparty Protocol

The swift protocol is a generic multiparty transport protocol. Its mission is to disseminate content among a swarm of peers. Basically, it answers one and only one request: ’Here is a hash! Give me data for it!’. Such entities as storage, servers and connections are abstracted and are virtually invisible at the API layer. Given a hash, the data is received from whatever source available and data integrity is checked cryptographically with Merkle hash trees [46]. If you need some data it is somewhat faster and/or cheaper downloading it from a nearby well- provisioned replica, but on the other hand, this process requires that multiple parties (e.g. con- sumers, the data sources, CDN sites, mirrors, peers) have to be coordinate. As the Internet content is in a continuous increasing nowadays, the overhead of peer/replica coordination becomes higher then the mass of the download itself. Thus, the niche for multiparty transfers expands. Current and relevant technologies are still tightly coupled to a single use case or even infrastructure of a partic- ular corporation. These are the reasons of the swift protocol appearance with its primary goal to act as a generic content-centric multiparty transport protocol that allows seamless, effortless data dissemination on the big cloud represented by the Internet. Most features of the swift protocol are defined by its function as a content-centric multiparty trans- port protocol. A significant difference between swift and the TCP protocol is that TCP possesses no information regarding what data it is dealing with, as data is passed from user-space, while the swift protocol has data fixed in advance and many peers participate in distributing the same data. Because of this and the fact that for swift the order of delivery is of little importance and unreliability is naturally compensated for by redundancy, it entirely drops TCP’s abstraction of sequential reli- able data stream delivery. For example, out-of-order data could still be saved and the same piece of data might always be received from another peer. Being implemented over UDP, the protocol does its best to make every datagram self-contained so each datagram could be processed separately and a loss of one datagram must not disrupt the flow. Thus, a datagram carries zero or more messages, and neither messages nor message interdependencies should span over multiple datagrams. The verification of data pieces is realized using Merkle hash trees [46], integrated as an extension in BitTorrent1. That means that all hashes necessary for verifying data integrity needs to be put

1http://bittorrent.org/beps/bep_0030.html

99 CHAPTER 6. DESIGNING AN IMPROVED MULTIPARTY PROTOCOL IN THE LINUX KERNEL100 into the same datagram as the data. For both use cases, streaming and downloading, an unified integrity checking scheme that works down to the level of a single datagram is developed. As a general rule, the sender should append to the data some meta-data represented by the necessary hashes for the data verification. While some optimistic optimizations are definitely possible, the receiver should drop data if it is impossible to verify it. Before sending a packet of data to the receiver, the sender inspects the receiver’s previous acknowledgments to derive which hashes the receiver already has for sure. The data is acknowledged in terms of binary intervals, with the base interval of 1KB "packet". As a result, every single packet is acknowledged a logarithmic number of times. This mechanism provides some necessary redundancy of the acknowledgements and sufficiently compensates the unreliability of the datagrams. The only function of TCP that is also critical for swift is the congestion control. To facilitate delay- based congestion control an acknowledgment contains besides the dimension of the file received from its addressee a timestamp. Our main objective is to integrate swift as a transport protocol in the Linux kernel networking stack. This will provide notable performance improvement regarding data transfer. We intend to do this with minimal intrusion effect in the Linux kernel and also to change as little as possible the current swift implementation. Another goal is to provide a transparent API between the kernel and the user space. A developer will use a socket-like interface when building an application on top of the swift protocol. In order to achieve this goal we have implemented an intermediary step. We have simulated the kernel part in the user-space using raw sockets. This has the advantage of providing means to have modular functionality tests.

6.2 Designing a Multiparty Protocol

While designing our system, we have tackled a few different ideas, each with its strengths and weaknesses. We present some of those preliminary ideas that lead to the our current desing choice. The first approach we thought of was to include all of the swift protocol into kernel space. This approach had the advantage of simplicity and would have implied minimal architectural changes. The current user space implementation could have been ported to a kernel module. Though simple, this approach could not be implemented because of the restriction of memory size in the kernel. For the integrity check the swift protocol relies on Merkle hash tree. Keeping this tree in the kernel space memory is not scalable. The Internet content is too large to be stored in kernel space. Even if the tree retains only hashes of the data disseminated, the space is insufficient. The second approach of the swift implementation is represented in the Figure 6.1. The swift trans- port should have been a new kernel interface allowing the creation of specialized swift sockets. It should have implemented the multiparty protocol allowing piece transport to/from other hosts in a Peer-to-Peer fashion. That implementation should have had specialized "request queues", metadata queues, to/from user space. A specialized system call API should have allowed user space applications to interact with the above mentioned queues and, thus, with the multiparty transport protocol implementation. Innate differences from a classical one-to-one communication such as UDP or TCP means the sys- tem call API shouldn’t have followed the classical send/receive paradigm. In order to compensate this and to provide a rather "friendly" interface to user space applications, a library was designed. Peer and piece discovery should have been the responsibility of the user space application. The SWIFT Library may also provide wrappers over a UDP-based channel for discovery. CHAPTER 6. DESIGNING AN IMPROVED MULTIPARTY PROTOCOL IN THE LINUX KERNEL101

Application which uses SWIFT Library

User Space

Swift Library ( I n f ( o D S r U a m w D t i a a P f t

t

t

i S S r o o a o n n c c

k u s k e f p e e t d t r a ) t e )

Swift UDP Transport Layer Transport Layer

IP 1 IP N IP … Kernel Space

Data Link 1 Data Link N Data Link …

Physical Physical Physical Layer Layer 1 … Layer 1

Internet

Figure 6.1: Preliminary Architecture

Merkle hashes should have stored and computed in user space. This approach couldn’t be imple- mented because of the restriction of the library implementations (e.g. a users application design would be more restrictive). Moreover the kernel implementation should have been like an UDP which support multicast transfer. The third approach of the swift implementation was to detach the transport layer from the original swift implementation and to manage it. When we started to implement this we found a lot of in- convenience like our code duplicate a lot of application code, we cannot implement the discovery protocol, and again our kernel implementation should have been like a multicast-UDP. This approach also couldn’t be implemented because of the complexity of the transport layer man- agement, moreover we didn’t find strengths to confirm that our implementation could be better than original implementation.

6.3 Raw Socket Wrapper Implementation

6.3.1 Raw Socket-Based Architecture

In this section we present a proposed architectural design along with the motivation of choices. We are also going to detail the protocol and the packet structure used. In figure Figure 6.2 we see the main conceptual modules: Application module, wrapper library, peer discovery overlay and the swift transport protocol layer. The Application module represents the remaining part of the old swift implementation. This is the part that remains in user space and contains the file management and hash management features. The wrapper library module defines a socket-like API for the user space applications. A regular program will use those calls instead of the normal socket ones to use the multiparty sockets. For the moment those calls are simulated system calls that initially are resolved with the socket raw CHAPTER 6. DESIGNING AN IMPROVED MULTIPARTY PROTOCOL IN THE LINUX KERNEL102

Figure 6.2: Overview Architecture

Figure 6.3: Detailed Architecture implementation (still in user space). In the future this wrapper library will represent entry points into the kernel. The peer discovery overlay will remain unchanged. It is still going to work based on UDP sockets and link the same levels in the swift implementation as before. The peer discover will be part of the application implementation and it will be the developer’s choice how to implement and how to manage it. The multiparty protocol is implemented for now at user space level through a raw socket layer to validate our architecture. This has the advantage of simulating the real design modularization but also permits an easier debugging and testing procedure of the integration. In the next step this part will be represented by a kernel patch that will communicate through custom made system calls with the wrapper library. This two phases are described in Figure 6.3. A socket is one of the most fundamental technologies of computer networking. Sockets allow applications to communicate using standard mechanisms built into network hardware and operating systems. Raw mode is basically there to allow you to bypass some of the way that your computer han- dles TCP/IP. Rather than going through the normal layers of encapsulation/decapsulation that the TCP/IP stack on the kernel does, you just pass the packet to the application that needs it. No TCP/IP processing – so it’s not a processed packet, it’s a raw packet. The application using the packet is now responsible for stripping off the headers, analyzing the packet, all the stuff that the TCP/IP stack in the kernel normally does for you. Raw socket implementation will support all system calls and it will be a copy of the kernel imple- mentation. This implementation will use the same API and behavior as the kernel implementation. Still, in the first implementation, a swift socket will be available to act as only a seeder or a leecher; only one will be explicitelly supported: transmit data or receive data. CHAPTER 6. DESIGNING AN IMPROVED MULTIPARTY PROTOCOL IN THE LINUX KERNEL103

Figure 6.4: Receiver Conceptual Model

Figure 6.5: Sender Conceptual Model

In the future implementation the swift protocol will be developed in kernel space, and it will be accessible through a datagram socket that will support all socket syscalls. It will intend to support both data operations (receive and send) over a single socket. Figure 6.4 presents the conceptual model of the Leecher. The Leecher is the one that requests data data. In order to do this it must connect to the multiparty protocol by creating and binding to a multiparty socket. When it binds to a socket, it uses the hash as a parameter to find a connection with a peer that accesses the respective file. This discovery is done through the separate peer discovery overlay. The Leecher then waits for packets from seeders. Figure 6.5 presents the conceptual model of the Seeder. The Seeder is the one that serves data to other Leechers. In order to do this it must connect to the multiparty protocol by creating, binding and listening to a multiparty socket. When binding, the Seeder uses the hash as a parameter. This means that for every file hashed there will be a socket on which the seeder that may receive and serve requests. The Seeder then waits for requests and sends data packets as requested.

6.3.2 Testing of Raw Socket Implementation

The protocol is a generic multiparty transport protocol. Its mission is to disseminate content among a swarm of peers. Given a hash, the data is received from whatever source available and data integrity is checked cryptographically with Merkle hash trees. Our main focus when modifying the swift implementation is to have an impact on time performance. With a communication protocol the greatest latency is usually generated by waiting for the results from the network. The multiparty communication model already takes care of this, so the next best thing is to enhance the application time. We are doing this by decreasing time penalties due to context switches between user space and kernel space. The main idea is to reduce the number of system calls made from user space into the kernel. This implicitly reduces the number of preemption moments. CHAPTER 6. DESIGNING AN IMPROVED MULTIPARTY PROTOCOL IN THE LINUX KERNEL104

First, we have implemented a test suite for every socket system call, which tests all possible cases. These unit tests are created to ensure the code quality, and to validate if the system call behave in the same mode as other similar system call. Secondly, we have implemented a functional test suite to validate a simple workflow. For this purpose we tested a peer that acts as the seeder and a receiver which behaves as a leecher. A small file is transfered between the two entities to ensure proper delivery. The test suite is mainly implementing using arrays of function pointers or function pointer structures. A top level structure defines test suites for each function exported by the implementation. Each test suite is a series of methods that test a variant of the call of the function. The top-level test suite is defined by the test_fun_array as seen in Listing 6.1.

1 static void (*test_fun_array[])(void) = { 2 NULL, 3 test_dummy, 4 socket_test_suite, 5 bind_test_suite, 6 getsockname_test_suite, 7 getsockopt_test_suite, 8 sendto_test_suite, 9 recvfrom_test_suite, 10 sendmsg_test_suite, 11 recvmsg_test_suite, 12 close_test_suite, 13 }; Listing 6.1: Test Suite Top-Level Array

Each member of the array is a test suite for the functions exported by the implementation. Exported functions are basic socket API functions. Each such test suite is a sequentiall caller of individual test methods, as shown in Listing 6.2.

1 void recvfrom_test_suite(void) 2 { 3 start_suite(); 4 recvfrom_dummy(); 5 recvfrom_invalid_descriptor(); 6 recvfrom_descriptor_is_not_a_socket(); 7 recvfrom_socket_is_not_bound(); 8 recvfrom_after_sendto_ok(); 9 recvfrom_after_sendmsg_ok(); 10 } Listing 6.2: Test Suite for recvfrom Call

Each individual method is used to test a single aspect of the test suite. For example, the recvfrom_- socket_is_not_bound shown in Listing 6.3 tests the implementation for sending out an error when receiving from a socket that is not bound (that is no bind call was provided). .

1 static void recvfrom_socket_is_not_bound(void) 2 { 3 int sockfd; 4 struct sockaddr_sw local_addr; 5 struct sockaddr_sw remote_addr; CHAPTER 6. DESIGNING AN IMPROVED MULTIPARTY PROTOCOL IN THE LINUX KERNEL105

6 ssize_t bytes_recv; 7 char buffer[BUFSIZ]; 8 9 sockfd = sw_socket(PF_INET, SOCK_DGRAM, IPPROTO_SWIFT); 10 DIE(sockfd < 0, "sw_socket"); 11 12 fill_sockaddr_sw(&local_addr, &remote_addr, "127.0.0.1", "myHash ", "127.0.0.1"); 13 bytes_recv = sw_recvfrom(sockfd, buffer, BUFSIZ, 0, (struct sockaddr *) &remote_addr, sizeof(remote_addr)); 14 15 test( bytes_recv < 0 && errno == EAFNOSUPPORT ); 16 }

Listing 6.3: Basic Method for Testing recvfrom Call

We will implement performance tests that will validate our design and implementation. These tests will compare the number of system calls for both implementations (original swift implementation and our kernel based implementation).

6.4 Kernel Framework for Multiparty Protocol Implementation

The introduction of a multiparty protocol in the Linux kernel was a challenge, because of its par- ticularities versus common protocol implementations. A multiparty transport protocol uses multiple points in a communication, unlike a traditional communication protocol that allows a sender end- point and a receiver endpoint. A traditional Transport Layer protocol header uses a port to differentiate between various processes that take part in a communication and use Network Layer address information for host information. For a multiparty Transport Layer protocol, the port information would be substituted by a file. While traditional protocols use “data” (either streams – TCP, or datagrams – UDP), a multiparty protocol is concerned about sending and receiving parts of a given file, which significantly alters its design. Currently, the swift protocol1 that forms the basis of the current design is implemented in user space, using UDP sockets. There is no need for TCP, as ordered delivery offered by TCP is of no importance to swift. Packets may be received out of of order and the same piece of information may be received multiple times from different peers. Redundancy is important to swift: the first arriving piece is added to the “stream” while the rest are discarded. The main feature of the swift protocol is that it’s content-centric, unlike TCP that possesses no knowledge of the transported data. Data possesses meaning for swift. Checking pieces that are received through swift is accomplished through the use of Merkle hash trees. This means that all information regarding integrity checking is already placed in the same data packet.

6.4.1 Transport Protocol Implementation on the Linux Kernel

In order to implement o transport protocol in the Linux kernel, several design phases must be established:

1http://libswift.org CHAPTER 6. DESIGNING AN IMPROVED MULTIPARTY PROTOCOL IN THE LINUX KERNEL106

• Defining IPPROTO_$$. This macro will identify the transport protocol. This will be subse- quently used for creating a transport protocol socket. • Defining a transport header. The framework we used defined two 8 bit ports, a source port and a destination port and a 16 bit lenght field. The latter field is the data length, including the tranport protocol header. After the above have been completed, several data structures have to be defined, as mentioned below. A data structure that defines the new socket type. This is where we must save information regarding the socket state, such as the destination or the source port. struct swift_sock { struct inet_sock sock; /* swift socket speciffic data */ uint8_t src; uint8_t dst; };

The protocol definition, used by the socket, including its name and size are defined in a struct proto structure. static struct proto swift_prot = { .obj_size = sizeof(struct swift_sock), .owner = THIS_MODULE, .name = "SWIFT", };

The most important structure to be defined describes the operations that are supported by a socket of a given type. For a datagram sending socket, the implementation of release, connect, sendmsg and recvmsg functions is sufficient. static const struct proto_ops swift_ops = { .family = PF_INET, .owner = THIS_MODULE, .release = swift_release, .bind = swift_bind, .connect = swift_connect, .socketpair = sock_no_socketpair, .accept = sock_no_accept, .getname = sock_no_getname, .poll = datagram_poll, .ioctl = sock_no_ioctl, .listen = sock_no_listen, .shutdown = sock_no_shutdown, .setsockopt = sock_no_setsockopt, .getsockopt = sock_no_getsockopt, .sendmsg = swift_sendmsg, .recvmsg = swift_recvmsg, .mmap = sock_no_mmap, .sendpage = sock_no_sendpage, };

The new packtets header are setup in the net_protocol header. New packets that are received directly from the network will fill the protocol field in the IP header with the value of the imple- mented Transport Level protocol. CHAPTER 6. DESIGNING AN IMPROVED MULTIPARTY PROTOCOL IN THE LINUX KERNEL107

static const struct net_protocol swift_protocol = { .handler = swift_rcv, .no_policy = 1, .netns_ok = 1, };

A final data structure has to connect the new protocol to the operations on the its socket. This structure uses two pointers to the struct proto and struct proto_ops data structures: static struct inet_protosw swift_protosw = { .type = SOCK_DGRAM, .protocol = IPPROTO_SWIFT, .prot = &swift_prot, .ops = &swift_ops, .no_check = 0, };

As soon as the above structures are defined, the protocol defined will be added to the kernel. The sequence of operations used to establish this step is described below. proto_register(&swift_prot, 1); inet_add_protocol(&swift_protocol, IPPROTO_SWIFT); inet_register_protosw(&swift_protosw);

Before compiling and inserting the module in the kernel, several socket operations functions have to be filled; for swift (a datagram based protocol), this would mean release, bind, connect, sendmsg and recvmsg. A handler for packets received from the network must also be imple- mented. At protocol level, one must keep a mapping between a port and a socket, meaning that the socket is bound to that port. Each implemented socket operation function fills different roles: • release is used fo freeing resources (most likely memory) used by the socket. • bind checks the availability of the port and ties the socket to that port and a source IP address. • connect maps the current socket to a destination port and IP address. • sendmsg extracts the destination IP address and port, provided they were passed as argu- ments from user space. If they were not provided, the module checks whether the socket is connected (through a previous connect call), and, in case no socket is connected, error is returned. After establishing the receiving end (IP address and port), an skb structure is allocated, specifying the protocol header and the rest of the data. The swift header is filled, user space data are copied and the packet is routed. After the routing process, required data is copied and is queued for transmit using the ip_queue_xmit function call. • recvmsg is responsible for the reverse operation of sendmsg. A datagram is read from the receiving queue through the use of the skb_recv_datagram call. Serveral integrity checks are employed, after which data is copied from the skb structure to user space. At the time, in case the sender address (IP address and port) has been requested (as used by the recvmsg and recvfrom socket API call), required information is filled in the user space buffer. CHAPTER 6. DESIGNING AN IMPROVED MULTIPARTY PROTOCOL IN THE LINUX KERNEL108

In the end the skb structure is freed through the use of the skb_free_datagram call. • The function passed as a handler in the net_protocol structure is invoked when a packed is received. The protocol field in the packet IP header will be used to demultiplex the packet and call the required function. When the packet is received, several validations of this will take place. Subsequently, the destination port information will be extracted. A lookup operation returns the socked mapped to the destination port. The skb->cb field is initialized to information regarding the sender. Afterwards, the packed is added to the receive queue, through the use of ip_queue_rcv_- skb.

6.4.2 Testing the Kernel Protocol Implementation

Unit tests have been employed for testing the protocol. Among those are a few simple tests, to ensure basic functionality and then a set of new tests that check error codes and protocol per- formance. Protocol performance is related to ensuring scalability – multiple simulateneous client connections. Basic functional testing means the design and implementation of tests for each function exposed by the protocol. That is, there is a given test for each of bind, connect, close, sendto, recvfrom, sendmsg, recvmsg, send and recv. The test suite is used is the same one used for the raw socket implementation. Due to the identical API provided, we could easily use those test for the kernel implementation. Negative testing, for unwanted/unwelcome situations is similarly accomplished: each function ex- posed by the protocol is assigned a test. For example, the bind system call will always check an error being returned for a duplicate call using the same port. As is the case for the use of invalid IP addresses or invalid port numbers. A series of such tests will be employed for each functionality. Performance testing is completed through the use of multiple protocol sockets. Data sent from the socket is checked against data arriving at the other end; this actually means checking whether port and socket mapping is working according to the specifications. Another measure of protocol performance is the transfer speed on the newly created socket.

6.5 Conclusion

The swift protocol is a multiparty content-centric protocol that aims to disseminate content among a swarm of peers. This chapter proposes an approach for the optimization of the currently swift protocol. The integration of communication in the kernel space as a multiparty transport protocol that is solely responsible for getting the bits moving improves the overall protocol performance. It ensures maximum efficiency of data transfer by decreasing switches between user and kernel space and eliminating some performance penalties due to those switches. After we complete the implementation and the functional tests, we want to test extensively our new features in a real environment. We plan to do stress tests using a cluster. These tests will help create an overview about our implementation and allow comparing against the user-space implementation of the swift to determine the encountered performance. If results are satisfactory, optimization and feature addition are to be continued. It will be very useful to have a real application on top of the swift protocol. If not, one solution would be to port an application strictly for this task. This would give us the opportunity to extend and refine our implementation, and also to extend the library API. CHAPTER 6. DESIGNING AN IMPROVED MULTIPARTY PROTOCOL IN THE LINUX KERNEL109

The multiparty protocol is planned to be implemented in the Linux kernel, by adapting it to the framework described above. By bringing the protocol to the kernel, its performance should increase due to the decrease in the number of user space/kernel space swiches (as part of the decrease of system calls). Currently the framework lacks a packet fragmentation mechanism. The current implementation can’t send on a socket data larger then the Maximum Transfer Unit (MTU) (1500 bytes for Ethernet). A possible solution is borrowing the fragmentation mechanism used by the UDP implementation. Chapter 7

Measurements of Multimedia Distribution in Peer-to-Peer Networks

A large part of content distribution in the Internet is currently video content. Video content means large files, typically movie files of CD or DVD size. File size increases if the bitrate or resolution increases. Apart from movie files, sites such as YouTube, focusing on small videos that are published by average users are attracting millions of users. Content is accessed many times and transferred across the Internet. A Peer-to-Peer protocol and BitTorrent, in particular, is one of the most adequate solution for dis- tributing large content. Rapid creation of swarms, tit-for-tat, rarest-piece first allow BitTorrent to rapidly distributed files among peers. When thinking about video content, this mode is known as offline playback – the video file is downloaded and subsequently rendered. Separate approaches are to be undertaken for Video-on-Demand and Live Streaming as described in Section 7.1. Video-on-Demand refers to rendering video content that is already stored (in full size) on a given server or peer. Video-on-Demand content may be fast-forwarded or rewound to the desired position. Sites such as YouTube or Vimeo are providing VoD content. Live Streaming content is content that is generated in real time; typically this also incurs significant constraints on network latency as data has to be delivered as it is generated. Live Streaming uses cameras and video devices that capture real activities and delivery them to the network at the same time. Video files are actually a set of frames (images) that are played back at a given rate. Most of the time, this rate is 24 fps (frames per second) which is enough to “fool” the human brain into thinking it happens in real time. A movie file possesses multiple features such as video rate, resolution, container, codec. Recent years have seen the proliferation of HD content. HD – High Definition – characterizes video files possessing a large resolution. Common HD video resolution is 1280x720 pixels, also known as 720p, and 1920x1080 pixels, also known as 1080p. HD videos use wide screen aspect ratio (16:9). HD videos own their proliferation to ever increasing computing power (for decoding and rendering video files), storage space (for storing actual files) and network bandwidth (allowing transfer of information). Sites such as YouTube and Vimeo are able to provide HD videos. HD refers strictly to the resolution of the video file. It makes no assumption around the codec technology of video bitrate; one may

110 CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 111

find an HD asset to be of lower quality than that of SD (standard definition) video because of its poorer codec or lower bitrate. The proliferation of HD content has also meant the proliferation of HD-able video formats (HDCAM, D5 HD, AVCHD) and video devices. Modern video cameras are HD-able, the recent Blu-ray disc is typically using HD media, and TV sets nowadays are capable of rendering HD content. A specialized media device is the Set-top Box (STB). A set-top box connects to a TV and an external source of signal and generated required content for the TV set. Set-top boxes form a more generic device than that of a TV set, as a set-top box may integrated multiple features on its firmware. For example, the Pioneer NextShare STB that is used in the P2P-Next project integrates P2P facilities, social networking, per-user customization, remote connections and others. Set-top boxes are main components in IPTV networks as they may be easily connected to an IP network and decode video streaming provided through the network connection.

7.1 Video Content Creation and Processing

Video content forms the most heavily used traffic class in the Internet. HTTP, Peer-to-Peer, Flash and owe the greatest component of their bandwidth consumption to video content1. With the ever increasing bandwidth in the Internet video content demand is increasing itself. Although poised by legal issues, Peer-to-Peer traffic is currently made up of video content. Video content is, at its base, a series of frames that are rendered in a fast pace by a video player. Video files may be streamed, allowing the user to view content without having access to the com- plete file. In order to reduce the size of a video file, codecs are typically used. A codec will use lossy or lossless data compression to diminish the size of a video file. Video content/stream is provided together with audio stream inside a container. An An audio video container file is used to identify and interleave different multimedia data types. It usually stores different audio, video and subtitle streams. Video content is tipically streamed from a video source (server) to a series of video consumers. Classical television uses air or cable as support for delivering data from the transmitter station to averege users. We dub the source as video content producer and the users as video content consumers. With the advent of computers and large sized storage devices, users have been able to consume content in an offline mode; CDs, DVDs, Blu-ray disks and hard disk drives are able to store whole video assets that can be rendered on commoditiy hardware. A user would either use a specialized device such as a CD/DVD player or a PC using a video player. The devices or software applications implement required codecs and provide video content to the user. The offline mode is advantageous as the user has access to the whole video file at a given time. The time it takes for the video to be available doesn’t influence the playback or rendering speed; it may, however, affect the user’s patience. The are several disadvantages: • the user may not live stream – that is consume content that is being produced in real time; • disk space or number of CDs/DVDs increases as content is stored; • there is an “availability” overhead from discovering the video to rendering it; • the user is not involved in the process; he/she is a non-participatory consumer; she may not put up content, give out feedback or involve other users.

1Sandvine: Global Broadband Trends, http://www.sandvine.com/news/global_broadband_trends.asp CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 112

The use of streaming solves or aleviates some of the above problems. Streaming is the means by which the user would be able to consume video content immediately as the source delivers it, similar to classical television. The availability of streaming solutions has been enabled by the constant growth of network bandwidth. Users may find little problem in using sites such as YouTube or Vimeo or watch Live TV content delivered over the Internet. As network bandwidth is deemed sufficient, HD content is becoming more and more prevalent in nowadays Internet. When streaming video, the user may be able to consume content that is produced at that time, for example watching a sports event or a TV show. There is little or no overhead between discovering the video content and consuming it. He may not need to store the whole video, only pieces of information that are required for playback at that given amount of time. Disk space is thus saved; it required, he may stream video content again. A hybrid form between classical content distribution and network-based streaming is IPTV. IPTV is a system through which content is delivered using network links and methods particular to the IP stack. The consumer uses a specialized device, such as a Set-top Box to render content. The device possesses an Ethernet controller and firmware that is able to receive content, decode it and render it to a TV or monitor. IPTV is the technology allowing Video-on-demand to be delivered to TV sets – allowing a user to browse a film catalog and choose content he/she wants to play. An important advantage, given the structure and ubiquity of nowadays Internet is the possibility for a user to actively participate in the streaming network. She may rate movie files, provide feedback or event provide her own video content files that other users may stream. Sites such as YouTube have popularized the prosumer terminology, that is a user that is both able to consume and to produce video files. Users may easily form communities, share video files and comment on each other videos. The prosumer is an active component in a video-centric P2P streaming network. Amateur videos, such as those provided on YouTube are also dubbed UGC – User Generated Content. Considering the availability of the whole video stream to be rendered, content may be considered to be either Video-on-demand (VoD) or live streaming (classical streaming), as presented by Mol [51][50]. Live Video Streaming, in which the video stream is being generated at the same time as it is being downloaded and viewed by the users. All of the users will aim to watch the latest generated content. The end-user experience is thus comparable to a live TV broadcast or a webcam feed. The user needs a download speed at least equal to the playback speed if data loss is to be avoided. Video-on-Demand, in which the video stream is generated beforehand but the users will view the stream while they are downloading it. An on-line video store, for example, would be using this model to provide movies on-demand to its customers. The users typically start watching the video stream from the begin- ning. At any moment, all of the users will be watching different parts of the video. Video clips available on websites such as YouTube are examples of video-on-demand streaming. Similar to live video streaming, the user needs a download speed at least equal to the playback speed to avoid data loss. The simple approach towards providing streaming is using a dedicated server. This may store content that is to be provided on a VoD-bases or may be connected to a video camera or device that is producing video content live. Dedicated applications or devices enable servicing of video content to users. However, the use of a single dedicated server ignites the issue of scalability and fault tolerance. Considering the use of unicast protocols, each streaming request would require a new data channel between the client and the server. One solution is the use of multicast protocols. Other solutions to this problem is the use of Content Delivery Networks (see below) and Peer-to- Peer systems. Video-on-demand, a popular service on the Internet, is usually provided by Content Delivery Net- works (CDNs). A CDN is a client-server architecture, in which the server that provides the video stream is replicated at multiple sites. Websites that host video clips or live video streams, such as CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 113

YouTube, make use of a CDN in order to be able to deliver the bandwidth needed to serve all of their users. As with CDNs, Peer-to-Peer networks aim to solve the problem of scalability in a traditional single stream server architecture. A Peer-to-Peer approach makes use of the large potential bandwidth between peers and remove the connection bottleneck of th server. Peer-to-Peer based streaming is a hot topic in the last years [40] with several solutions aiming to integrated live streaming and Video-on-Demand in traditional file sharing networks. While Peer-to-Peer networks could potentially ensure unrivaled scalability and cost reduction, issues remain regarding Quality of Experience and buffering overhead. More on this topic is discussed in the next section.

7.2 Adding Video Streaming Support in libtorrent libtorrent-rasterbar1 is a popular solution implementing the BitTorrent specification2 and many of its enhancements. Written in C++ it uses advanced operating system operations allowing good performance. It is used by a large number of BitTorrent clients, including Deluge, , qBitTorrent, ZipTorrent, LimeWire. Due to its library format, libtorrent-rasterbar applications may be easily extended to use the plethora of features provided. In experiments at University POLITEHNICA of Bucharest, libtorrent-rasterbar has been the most heavily used implementation, together with its lightweight application – hrktor- rent3. Testing provided insight on the performance superiority of libtorrent-rasterbar-based applica- tions, as hrktorrent consistently performed better than other applications [16]. As, from a performance point of view, libtorrent-rasterbar possesses a high rank among BitTor- rent implementations, it was considered as one of the best choices for adding streaming support. libtorrent-rasterbar is used only as a classical BitTorrent distribution solution, not as a streaming so- lution. The goal was to provide video on demand (VoD) support by altering the required components of the implementation. For an analysis of the streaming solutions versus classical BitTorrent downloading, we are using the following parameters: • M – the number of pieces in a given file; • U – maximum number of upload slots for a given peer; • D – maximum number of download slots for a given peer; • C – bandwidth capacity of the peer link; • 1/µ – seeder resident time in the swarm; • T – total transfer time for a given file. Each file is split in M pieces. Each peer is allowed to use U simultaneous upload slots and D simultaneous download slots. Each connection has an average bandwidth capacity C. It takes T units of time for a new peer (one having no pieces of the file) to transfer the whole file – all M pieces – from swarm peers.

1http://www.rasterbar.com/products/libtorrent/ 2http://bittorrent.org/beps/bep_0003.html 3http://50hz.ws/hrktorrent/ CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 114

7.2.1 Employed Streaming Solutions

The piece selection strategy in a typical BitTorrent swarm is based on rarest-first. This means that each peer tries to acquire the rarest piece as the first one ensuring better performance that random packet selection [6][23]. An analysis has proven that rarest-first has the best performance when considering the efficiency vs. cost factor. The rarest piece first works as follows: each peer holds a list of the number of copies of each piece from its peer/neighbor set and uses that list to define its set of rare pieces. If m is the number of copies the rarest piece, then the index of that piece appears m times in the piece set. Each peer updates the piece set each time a piece copy is added or removed from its set of peers. The rarest piece selection policy aims at making each piece to be equally distributed among peers. When a peer connects to other peers, it tries to get the rarest piece (which he doesn’t already have), from the set of available pieces from neighboring peers. It has been computed [57] that average download time is:

1 1 T = − (7.1) UC µ

The above formula shows that download time isn’t affected by the arrival rate of new peers. As expected, download time improves for a high upload capacity and as seeder resident time improves. A streaming-friendly algorithm for piece selection is an updated version of the BitTorrent protocol, in which users download pieces of the file strictly in sequential order – “strictly in order”. Downloaders uses local knowledge to request pieces in sequential order from connecting peer. The download process takes place in ages. At each age, a download may request D sequential pieces. A subset of this requests is satisfied in a single age. Each peer enters the swarm at time t and completes the transfer at t+T. There are old and young peers depending on their swarm entry time. As a consequence of the particularities of the “strictly in order” piece selection algorithm, tit-for-tat can not be used within a single swarm. Peer relations are asymmetrical and a downloader may never serve pieces to an uploader. This means that piece requests are non-uniformly distributed among the whole swarm. Old peers receive more requests that younger peers but may only serve U of them. As presented in Figure 7.1, we define the sequential progress, that refers to the ability of a piece selection policy to request proper pieces in order to allow stream playback. Sequential progress is independent of download progress. The figure presents a comparison between the two policies (strictly in order and rarest first). As expected the random strategy results in poor sequential progress, albeit good download progress. Although not useful as a policy for Peer-to-Peer streaming, this policy makes out a good point of reference, since no other algorithm may have a poorer sequential progress. One may notice a wide space between the random strategy (poorest choice) and the strictly in order strategy (best choice). This suggest the possibility of strategies that may be developed to still allow good sequen- tial progress but do not compromise the inherent behavior of the BitTorrent protocol. Presuming, as it usually is the case, a connection that possesses a bandwidth capacity higher than the playback rate, an efficient streaming doesn’t need to employ strictly in order piece selection. It only needs a sufficient number of packages that allow content playback to happen at a consis- tent rate, while other pieces may be retrieved using other algorithms. A compromise needs to be negotiated between sequential progress and download speed, download latency and swarm health. In order to achieve that compromise, we define deadline pieces. This pieces use a deadline spec- ifying when the packet needs to be downloaded. Such a packet receives a priority in the pieces CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 115

Figure 7.1: Sequential Progress [57] selection if it’s close to its deadline. This way, some of the packets may be downloaded through rarest-first while others may be downloaded through strictly in order. Deadlines may be computed by taking into account the media bitrate and the number of pieces of the video asset.

7.2.2 Design and Implementation of a Streaming Component in libtorrent

The module implementing the piece selection is a central component in a BitTorrent implementation. It is optimized in libtorrent to rapidly find rarest pieces. It keeps a list of available pieces, sorted by rarity, and equally rare packets are mixed. The rarest-first model is the main strategy for piece selection. Other models are supported, though their use is particular to certain situations. The piece selector allows combining the availability of a given packet with a certain piece priority. These parameters determine the criterion used for sorting the piece list. Zero priority pieces will never be selected, an option used for selective downloading. libtorrent-rasterbar encapsulates important components of the implementation in several classes: • session – manages torrent sessions; • torrent – defines everything regarding a torrent and its file; • piece_picker – the piece selection algorithms; • peer_connection – characterization of connections to neighboring peers. The piece_picker::pick_pieces() function is invoked each time a piece is selected. For strictly in order implementation, a specialized option dubbed sequential is added to the piece picker and is checked inside the piece_picker::pick_pieces() function. A cursor variable is inserted that stores the index of the last downloaded piece. The interested block list is appended the block not already download that can be downloaded sequentially from the given peer. New blocks are added in the same manner until the requested block limit is reached or no more pieces are available. To be remarked that piece request is sequential but not necessarily piece delivery. In order to implement the piece deadline strategy, a new concept was added: time_critical_pieces; these pieces differ from normal pieces by having a deadline attached (piece_deadline). Within the library, these pieces are requested in a different manner than normal pieces. Usually, after a peer completes a request, a new piece request is sent to that peer. For deadline pieces, peer list is searched for peers possessing that piece. Peers are sorted by download speed and outstanding bytes. CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 116

(a) Download Size Evolution (b) Download Speed Evolution

Figure 7.2: Evolution of Different Piece Selection Policies

Several methods were added that allow setting up a deadline for pieces and requesting deadline pieces: torrent::request_time_critical_pieces(). The peer possesses a list of time critical pieces; the first piece is dequeued and is requested. A new option was added to the commanding interface to allow the deadline piece model. Both for the piece deadline strategy and the strictly in-order strategy, there are performance gains if partially available pieces are prioritized; that is pieces whose block are already available to the peer. Special thanks go to Arvid Norberg, the main developer behind libtorrent-rasterbar. He had provided useful tips, suggestions and support both and the IRC channel and the developer’s mailing list.

7.2.3 Experimental Evaluation of Streaming Support

In order to test the implementation, a specialized TUI interface had been designed and imple- mented. This interface prints out small bars for each piece information available. For a sequential algorithm, one can notice the ordered piece by piece availability to a peer. A sample graph comparing the rarest first (blue), sequential (red) and deadline piece (yellow) algo- rithms is shown in Figure 7.2. The two graphs show download evolution (in megabytes) and speed evolution (in KB/s). The swarm used consisted of several seeders and a high number of leechers. File size was 37 MB, and the file was spread into 578 pieces. During this experiment, the best algorithm was, as expected, the rarest first. We estimate the deadline piece was outperformed by the sequential algorithm, due to the file size being quite small, a large number of seeders and the implementation overhead of the former for a relatively small file.

7.3 Using Next-Share Technology for P2P Streaming

The P2P-Next project was started in 2008 as an EU FP7 project. It aims at delivering the next- generation Peer-to-Peer content delivery platform. The project takes into account the change in the audio video media landscape, with increased demand for content availability on the Internet and user participation in generation of content. P2P-Next takes into account existing challenges such as legal frameworks and business constraints to create a viable solution with widespread adoption. CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 117

Intelligence, resource management, and explicit memory are the foundations for the P2P-Next con- tent sharing platform called: Next-Share. The Next-Share system is a self-organizing system with complete decentralisation and hence lacks any central bottleneck or choke points that may hamper performance, induce setup cost, or require maintenance. Networking effects ensure that content, communities, communication, and commerce will flourish with more participants. UPB has been part of the P2P-Next project and has provided important effort towards implementing core features in the Next-Share technology and evaluating it through experiments and user-based trials. Part of our contribution in the project is highlighted in the thesis, with focus on the experimen- tal component in this chapter. As a LivingLab implementer, maintainer and experimenter, UPB has been one of the core partners for evaluation Peer-to-Peer streaming technology versus classical BitTorrent distribution. As the Next-Share technology is intrinsically intertwined with video technology, choosing the best technology and providing efficient resource usage are part of the goals. The technology is currently able to provide efficient access to high definition (HD) video assets. As the codec industry is heavily fragmented, there isn’t a single solution for encoding content. As such, the Next-Share technology relies on available implementations for content delivery being able to render videos using such codecs as H.264, Ogg Theora, VP8.

7.3.1 Next-Share Architecture and Features

NextShare is a complex architecture integrating features that are common to BitTorrrent applications but also those integrating streaming content. NextSharePC runs on PC, supporting Windows, Linux and MacOS X Operating Systems/distribu- tions. NextSharePC application is formed by two parts: • A back-end server called NextShare Agent (NSSA), which embeds the P2P core engine (NextShareCore). • A front-end part in charge of providing the user interface and content playback functionalities. The two parts typically run on the same PC and communicate through two local socket based interfaces: • The Control Interface: it permits the front-end to control the NSSA through an API that exports NextShareCore functionalities. • The AV Streaming Interface: the Control Interface permits to ask the NSSA to retrieve a media content from NextShare P2P network. The AV Streaming Interface is used by the NSSA to stream the retrieved content to the front-end. It is implemented with a local HTTP connection. The split between front-end and back-end let the P2P engine (NSSA) run in the background re- gardless if the front-end interface (GUI) is running or not. This enables the peer to share content within the NextShare network even if the user is not in front of the PC and/or is not interacting with NextSharePC. The stream reorganizer is a block which permits the delivery of SVC encoded video contents on top of codec agnostic mechanisms of NextShareCore. For the design and the implementation of the front-end part of the platform and in particular for the design of the GUI, the choice of the project was to give priority to technologies coming from the web which are based on HTML5 and JavaScript technology. The core presents two interfaces: a GUI and a web-based plugin. The GUI is the first version of the NextSharePC, while the web-based plugin is the second version. In both cases the front-end is rely- CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 118

Figure 7.3: Give-to-Get Piece Sets [51] ing on VideoLan Client to reproduce the content retrieved from NextShare network. Initial versions of NextSharePC integrate standard releases of VLC, in future planned support of SVC encoded video streams and adaptive play-out of media contents will require the integration of additional plug-ins and modifications which are currently being designed by the proof-of-concept prototyp- ing part of the integration. The use of SVC encoding would provide good performance to a P2P streaming network as signaled by Mirshokraie et al. [49] and Bernardini et al. [5].

Video-on-Demand in Next-Share

Initially an improved BitTorrent core aiming for improved performance and diverse features, the Next-Share core integrated streaming support. This has been mostly due to Jan David Mol’s work [51][50] in getting streaming support in Next-Share. Currently, Next-Share supports both Video-on-Demand and live streaming. Both stream features are enabled through the use of mesh- like overlays, regarded as swarm topologies, due to their working on top of BitTorrent. VoD support in Next-Share is enabled through the deployment of Give-to-Get, a novel approach to negotiating piece exchange. Give-to-Get features many similarities to BitTorrent’s own tit-for-tat protocol, but differs in certain areas which make it more suitable for video-on-demand. Specifi- cally, while in offline download mode (classical distribution), the tit-for-tat protocol marks a peer’s “altruism” by its history, Give-to-Get only checks its recent history. In VoD, a peer is interested in exchanging information from peers that can offer him current pieces, in order to sustain video playback; if a peer has been altruistic with early pieces if of little relevance to the current peer. As with classical BiTorrent, Give-to-Get splits the file in pieces and, based on its neighbor set, decides which neighbor is allowed to make requests and when requests may be unchoked. Give- to-Get implements a new unchoking algorithm based on ranking its neighbors, based on how well a neighbor is forwarding chunks. There are three set out of which a piece may be part of: the high-, mid- and low-priority set as presented in Figure 7.3. h is a parameter dictating the amount of clustering of chunk requests in the part of the video yet to be played back.

Live Streaming in Next-Share

While most implementation use a tree-based or multi-tree based approach for live streaming, Next- Share’s BitTorrent core required a swarm-based approach. Each peer will connect to a given set of peer and request pieces from those. Major updates had to be undertaken due to the differences between offline playback (classical distribution) and live streaming: the size of the file is unknown, there is no seeder, there are no hashes for integrity checking, pieces have to be retrieved real time. This required careful updates to the piece picking component of the BitTorrent engine and the selection of peers in the neighbor set. Three types of peers are defined: CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 119

Figure 7.4: Seeders in a Live Streaming Setting [50]

• injector – the video source • seeder – a peer which is always unchoked by the injector • leecher – a peer that is not an injector or seeder The classical definition for a seeder is no longer available as data is generated live. As BitTorrent is expecting the whole file to be available within a swarm, updates have to be under- taken to ensure live streaming. The video is thus assumed to be of “unlimited length”. As such, the injector obtains the video from a live source, such as a DV camera, and generates a .tstream file, which is similar to a torrent file but cannot be used by BitTorrent clients lacking our live stream- ing extensions. An end user (peer) which is interested in watching the video stream obtains the .tstream file and joins the swarm (the set of peers) as per the BitTorrent protocol. A seeder is redefined in a live streaming environment to be a peer which is always unchoked by the injector and is guaranteed enough bandwidth in order to obtain the full video stream.The injector has a list of peer identifiers (for example, IP addresses and port numbers) representing trusted peers which are allowed to act as seeders if they connect to the injector. The relationship between the injector, seeders and leechers is shown in Figure 7.4. The seeders and leechers use the exact same protocol, but the seeders (bold nodes) are guaranteed to be unchoked by the injector (bold edges). The identity of the seeders is not known to the other peers to prevent malicious behaviour targeted at the seeders. The injector and seeders behave like a small Content Delivery Network (CDN). Even though the seeders diminish the distributed nature of the BitTorrent algorithm, they can be required if the peers cannot provide each other with enough bandwidth to receive the video stream in real-time.

7.3.2 NextSharePC

NextShare is designed to run on consumer hardware most commonly found nowadays. NextSharePC is the integrate application of the NextShare core and PC components; it’s a BitTorrent core coupled with features of NextShare and a user interface. NextSharePC is explicitly design to attract user into the P2P infrastructure and market the usage and advantages of the NextShare technology. The basic interface of the NextShare core is the Tribler client [60]. Tribler is friendly user interface similar to that of other BitTorrent clients. It allows complete access to the NextShare features such as upload and downloading content, creating .torrent files and accessing social networking features. Tribler also features a command line interface which is useful for automatically deploying scenarios. From a content point of view, NextShare features the SwarmPlayer interface which is specifically direct towards video playback. The SwarmPlayer is a simple interface that makes use of VLC technology running on top of the NextShare core. SwarmPlayer renders video streams which are identified by a .tstream file. It subsequently acts as a BitTorrent client sharing streaming information. CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 120

The SwarmPlayer is an actual application running on Linux and Windows. In order to allow eas- ier and faster access to given content, the SwarmPlugin has been developed. SwarmPlugin is a browser plugin that enables video playback directly inside the browser. Currently, Firefox, IE and Safari are enabled on Windows and Mac platforms. The SwarmPlugin is a SwarmPlayer inside a browser making use of VLC technology. From an interface point of view is similar to the Swarm- Player. Actions such as fast forward, sound volume and others have to be enabled through the use of specialized VLC JavaScript API 1. VLC (Video LAN Client) has been chosen due to its ubiquity and the fact that it’s an open-source project, meaning availability of the source code and a capable community. VLC provided playback implementation for various codecs and could be easily integrated on top of the NextShare core. The core is codec agnostic – it sends and receives pieces of information, leaving the decoding to an external application, embodied as VLC. VLC support has been enabled since version 0.8.6f to version 1.1.5. VLC allowed the development of the web-based plugin using NextShare technology – SwarmPlugin. With the advent of HTML5, focus has been given on running video playback inside o browser using this technology. Currently this is enabled on Firefox, on any platform, as it uses HTML5. While the classical VLC SwarmPlugin is codec agnostic, HTML5-based rendering is bound to the usage of OGG Theora. While the initial NextSharePC client, the SwarmPlayer, was designed to be a standalone video player application, the aim for the second version was to create a plugin for video playback in a Web browser. Allowing playback directly in a browser makes it easier for content providers to integrate Peer-to-Peer based video delivery into their existing Web based distribution mechanisms (e.g. portals). The next version, called the SwarmPlugin consists of plugins for the Microsoft Internet Explorer and Mozilla Firefox browsers. The plugins support Peer-to-Peer based video-on-demand and live broadcasts, as did the SwarmPlayer. Content providers can use the plugin in their Web sites. First, they setup the required publishing infrastructure consisting of a Next-Share tracker server (to enable peers to find each other) and some Next-Share seeder servers (for providing the content initially). Second, as before, they use the Next-Share injection tools to create a .tstream metadata file for each piece of content (containing the tracker addresses, name and bitrate of the content and information for integrity checking the received content). Third, they upload these .tstream files to a Web server. The SwarmPlugin itself can be controlled from the browser using JavaScript. This API is called the NSPlugin JavaScript API and is currently a copy of the original VLC plugin’s JavaScript API. As such it provides methods for controlling playback (start,pause,resume), adjusting audio (volume) and video parameters (full screen) and receiving debug messages. In general, the SwarmPlugin architecture is similar to that of the SwarmPlayer, with the Graphical User Interface part of the SwarmPlayer being replaced by a browser plugin and JavaScript controls.

7.3.3 NextShareTV

Apart from being able to run on commodity hardware through the use of the NextSharePC imple- mentation, the P2P-Next consortium delivers the technology on top of consumers electronics. The device that will run the NextShare core is a Set-top Box named NextShareTV, delivered by Pioneer Digital Design. The scope of the NextShare TV is to develop a consumer device and user applications to aid the discovery, enjoyment, production, and proliferation of a broad universe of legitimate digital media. All the aforementioned digital media shall be delivered via a next generation, open source, P2P media delivery network called NextShare. Users of the device shall be able to engage with a social

1http://wiki.videolan.org/Documentation:WebPlugin CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 121 network that shall amplify their abilities to discovery, enjoy, enhance and share digital media. Users of the device shall be able to purchase a broad spectrum of content from niche or local program- ming, through semi-professional, to premium. The consumer device shall be able to interoperate with other networked peripherals such as mass storage, camera, mobile, and home PC devices residing on the premises home network. The NextShareTV satisfies a range of commercial requirements ranging from content distribution and streaming, to personalization and user interface. From a content point of view, the NextShareTV offers access to Live TV streams and to video demand assets using the Peer-to-Peer technology. Recommendation and tagging are available and users are able to become prosumers and broad- cast their own content live via a camera connected to the box. Social networking features are currently under work, with features such as social groupings, chat with friends, discovering what friends are watching, doing live recommendation to friends. Special care is given to performance, with the aim that the prebuffering phase (the playback latency) is no more than 2 seconds. Se- curity and content legitimacy are also taken into account to allow secure/certified and controlled content to be rendered on the STB; security features may also be controlled through the presence of usernames and user profiles. From an architectural point of view the NextShareTV combines benefits of many worlds: a high quality Set-top Box, a specialized Linux-based OS, provided by ST Microelectronics, the NextShare core and libraries, drivers and modules required for rendering, decoding and data management, and a remote management interface. In order to allow good running time, a Python RunTime was created – the NextShare core uses the Python programming language. NextShareTV devices have been deployed in various LivingLab locations within the NextShare project for testing, evaluation and Quality of Experience. Different types have users have accessed the devices and provided valuable feedback and information. A remote management interface had been enabled allowing remote control of devices and the running of various experiments. An internal testing framework had also been developed for stress testing the device and its applications.

7.3.4 LivingLab

The LivingLab is the mechanism by which Next-Share based application get tested “in the while” i.e. with real users and real environments. In the P2P-Next project the LivingLabs are located in several partner locations such as Lancaster (UK), Tromso (Norway), Tampere (Finland), Ljubljana (Slovenia) and Bucharest (Romania). LivingLabs provide content to users, advice on how to install and use the NextShare technology and support. From a user point of view, a LivingLab is identified by a site/portal that gives access to available resources. The LivingLab is backed by an infrastructure and framework that enable the benefits of the technology. The UPB LivingLab consists of several commodity hardware systems kindly provided by the NCIT cluster. These systems store content used for the LivingLab and the applications, scripts and services that power it. All systems are identical both from a hardware perspective and by means of software applications and content. The LivingLab site, described in detail in Section 7.4.1 provides the interface with the help of which users access content using NextShare technology. It possess a large number of VoD assets from various events that are available to be played back using the SwarmPlugin. Assets are HD con- tent in MP4 and OGG container format. The LivingLab describes useful information to users and provides a forum where questions may be asked and support may be requested. Various experiments have taken place in the UPB LivingLab, the most recent ones being described in Section 7.4. As the UPB LivingLab is focused on performance management and analysis, exper- iments were keen on measuring BitTorrent swarm information and gathering relevant information. CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 122

An important set of experiments have been the “BitTorrent Distribution Experiments”. With the help of Ubuntu and Fedora communities, several versions of the popular distributions had been made available through Peer-to-Peer technology. The UPB infrastructure had been enabled and prepared seeders to provide content in various swarms. The site1 had been used to present images of Bit- Torrent swarm evolution, such as number of peers, number of seeders, download speed, upload speed. In order to collect information the chosen approach involved the gathering client-side information collection regarding client and protocol implementation. BitTorrent clients had been instrumented to provide verbose information regarding BitTorrent protocol implementation. These results are collected and subsequently processed and analysed. The undertaken approach, while not scalable, aims to collect client-centric data, store and analyse it in order to provide information on the impact of network topology, protocol implementation and peer characteristics. The infrastructure provides micro-analysis, rather than macro-analysis of a given swarm. Focus was given to detailed peer-centric properties, rather than less-detailed global, tracker-centric information. Seeders had been using the NextShare technology albeit, for having to run non-interactively, made no use of the VLC component – the CLI of the client had been used. Specific scripts had been used to monitor the client activity and to restart the client in case it crashes. A typical activity of preparing a swarm was: creating the .torrent files for the content to be made available, distributing content to other seeders, starting the initial seeder and using BitTorrent to distribute content to other seeders, starting the monitoring script and making content available to user. In order to process tracker results, a specialized script was used to do real-time parsing of tracker log files. The current and overall goal of the trials to be deployed and the UPB LivingLab is comparing Peer-to-Peer streaming performance and functionalities to classical distribution. We aim to collect relevant internal information regarding the behavior of the technology both when used for VoD streaming and for BitTorrent swarm download and use that information for signaling points that should be taken into account to improve streaming performance, mostly related to peer download speed.

7.4 Evaluation of P2P Streaming Solutions

Within the UPB LivingLab, experiments that involve users, HD content and the NextShare streaming solutions have and are currently taking place. The goal is to provide useful information on the functionality of the solution, the user experience and transfer performance. As the goal of the P2P- Next project is to built the next-generation Peer-to-Peer content delivery platform, our aim is to provide useful information regarding key factors that influence its attractiveness. Questions such as the ones below need to be answered: Would people use this platform instead of YouTube? Would it outperform classical video streaming? How does it compare to classical BitTorrent distribution performance? The employed experiments have been lab experiments as a starting point and, subsequently, pushed for user experience and collecting feedback. Feedback has been used to provide infor- mation to the core team, to improve the site providing content, to update content and to spot bugs or inconsistencies. The LivingLab site is currently operational and is logging information from con- necting clients. A specialized class of experiments involve gathering users for a short period of time to simulta- neously access LivingLab resources and provide feedback. This approach results in a burst of

1http://p2p-next.cs.pub.ro/ CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 123

Figure 7.5: LivingLab Components connections and data transfer. Two such experiments have taken place in December 2010 and April 2011, more to be scheduled in the next months. The UPB LivingLab is the user window to the Next-Share technology. It provides instructions on using the technology and allows access to the content and facilities. From a user’s point of view, the LivingLab is the site/portal/frontend allowing the playback of various video files through the use of P2P technology. In the backend, the LivingLab is supported by systems in the NCIT cluster that store and serve content in the form of Video-on-Demand assets. The components of the LivingLab and their interaction are highlighted in Figure 7.5. There are three main components of the LivingLab: the NCIT cluster systems, the LivingLab site and the users.

7.4.1 LivingLab Infrastructure

The NCIT cluster systems are commodity hardware systems that provide content and seeders. Content is replicated among NCIT cluster systems and is seeded through automated NextShare technology applications. There are currently nine systems available. The same content assets are found among multiple systems such that there would be multiple seeders for each asset. Logging is enabled at seeder level for subsequent processing. Swarm trackers may reside on a NCIT system or outside of it; as tracker communication is reduced, its placement is of little relevance. The hardware systems are commodity hardware systems (2GB RAM, dual-core 3GHz CPU, 300GB HDD) with high speed (1Gbit Ethernet) Internet links. SSH-based access allows easy automation. CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 124

Figure 7.6: LivingLab Site

Figure 7.7: LivingLab Forum

LivingLab Site

The LivingLab site1, described in Figure 7.6, is the front-end users are interacting with. It provides access to NextShare installation packages, content and support. The LivingLab site is based on CMS Made Simple2; it has been updated to provide necessary features for easy content publishing, content selection and organization. It currently provides Video-on-Demand assets (no live stream- ing). Content is presented to the user in a hierarchical manner depending on the content category. The user may access a certain category and then select a preferred video asset. The forum (http://p2p-next.cs.pub.ro/forum/), shown in Figure 7.7 is Phorum installation (http://www.phorum.org/) allowing easy support facilities to users. The forum consists of four main categories: • Feedback and feature requests: request new features or provide feedback • General discussions: questions regarding LivingLab Trial site and general discussions • Support: request support • Announcement: latest information RSS feed is provided for each category, useful for notification and syncing. There have been 29 posts and 604 views on the forum. The forum has been an easy way for communicating with the WP4 team regarding various user reports.

1http://p2p-next.cs.pub.ro/ 2http://www.cmsmadesimple.org/ CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 125

User feedback is collected through feedback forms that request information about user experience, plugins used, issues encountered. Detailed information about feedback forms is provided in Sec- tion 7.4.3. The feedback form has been updated from a Google form-based one to one that is fully integrated into the site. At the same time, during experiment evolutions, questions were added or were discarded. Questions that related to the user’s OS, IP address and other technical information have been discarded as they could be deduced from connection of logging information. The cur- rent feedback version is integrated into the site and uses a MySQL backed for storing information. Feedback results and resulting actions are presented in Section 7.4.3. A specialized form of user feedback is user log files. During experiments, both from our side and from WP4 team members, we have requested users to provide us with logging information for troubleshooting issues. As such, we have integrated a log upload facility where users can upload basic upload information provided by NextShare. Log files are typically no greater than 100K, such that there is little overhead or burden in uploading them. Users are typically students at the Automatic Control and Computers faculty. Various trials have asked them to participate in the trial and take part in accessing the site, using NextShare technology for P2P streaming, giving feedback. The LivingLab site provides links to NextShare technology installation packages, content and other required information. Users are contacted through e-mail and asked to participate, with links to instructions and trial goals. Communication is thereafter handled either through e-mail or through the forum. A user has no information about the NCIT infrastructure system. All his/her requests and replies are handled by the LivingLab site. However, each time the local NextShare application is started, connection and communication are directed towards the seeder systems. As with Peer-to-Peer technology, protocol information is handled both through client-seeder communication and client- client communication. The initial bootstrap phase, however, requires communication to the seeder. Seeder logging is enabled in order to provide information about client connections, and client- seeder communication quality.

LivingLab Content

For the LivingLab we are using videos from technical presentations, courses, conferences and social events from our university. Most of them were shot with a Sony Full HD camera, specified as 1080i, i.e. 1080 pixels on height, interleaved. Although the resolution is 1440x1080, which defines an aspect ratio of 4:3, the perceived aspect ratio is 16:9 because this is the PAR (pixel aspect ratio). The raw movie stream loaded from the camera is packed into a .mts multimedia container, which is based on the MPEG-2 Transport Stream container format. The audio stream has an AC-3 (Audio Coding 3, Dolby Digital) format, which permits the coding of 6 surround channels. The sampling rate of 48 kHz is higher than the CD one, which is 44.1 kHz offering a better sound fidelity. The video stream is coded with an H.264/AVC format, a high performance video codec defined in MPEG-4 and ITU-T standards. The frame rate is 25 fps (frames per second) and the overall bit rate of the raw stream is 3982 kilobits per second, which is quite big, thus it needs compression in order to distribute the content over the Internet. Some of the movies need to be cut into pieces and others need to be joined together. The number of video assets offered today is 480, occupying 61.77 GiB on our hard-disks. The videos length ranges from a few minutes, from the social events, about half an hour, the technical presentations, to more hours, like the movies. The movie with the biggest length is the one which renders the Norway tour by train which has 5 hours and 39 minutes. Occasionally we provide live streams, but currently there is no live stream offered by UPB LivingLab. The availability of the videos has been made possible by seeding each video in 5 five different systems. So if a system fails, enough seeders remain alive in the swarm so that the movies are still available. This replication also increases the speed when a movie is downloaded by a plugin. CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 126

5 of the systems are used for seeding Ogg videos and the other 5 for MP4 videos, offering load balancing of the hardware resources. If a seeder client crashes a monitoring system automatically informs the administrator of the prob- lem so that he can restart the seeder. The tracker and all the seeder clients provide logging. The IP of any user can be obtained with additional statistics of its download speed evolution and clients connected so that this information ca be afterwards joined with the feedback provided by that user.

7.4.2 Evaluation Goals of Streaming Trials in the LivingLab

The current and overall goal of the trials to be deployed and the LivingLab is comparing Peer-to- Peer streaming performance and functionalities to classical distribution. We collect relevant internal information regarding the behavior of the technology both when used for VoD streaming and for BitTorrent swarm download and use that information for signaling points that should be taken into account to improve streaming performance, mostly related to peer download speed. We evaluate streaming performance of NextShare technology through experimental means. Trials are setup in the LivingLab and use those to collect logging information and use that for analysis and advice. Providing advice regarding performance of streaming implementation is the prime concern of our evaluation. We use status information such as peer download speed, peer upload speed, number of connec- tions and extensive information (usually provided through verbose logging) such as protocol mes- sages, protocol inner workings, piece picker algorithm (piece selection) to compare and analyze classical (pure BitTorrent-based data distribution) to video streaming (through NextShare technol- ogy). We aim to distinguish among the various patterns employed by each type of distribution, measuring variation/penalty in performance and identify weak and strong spots in each of the ap- proaches. A basic form of data collection would be from the swarm tracker. Though information is pretty scarce (only available with each announce message provided by the peer) it will still provide an overall view of swarms, number of peers and their dynamics. We have been successfully using a tracker monitoring service for classical distribution experiments and we plan to integrate it in the near future experiments. An important place for information storage it the LivingLab statistics facility currently hosted at ULANC. The main provider of status and extensive information are the peers themselves. Each peer uses log files to store information regarding its behavior and download sessions. As extensive logging results in high volume information, this may only be enabled on peers we have control over, such as seeders residing the NCIT cluster infrastructure. Verbose logging has to be enabled in a modified version of the NextShare plugin in order to allow retrieval of extensive information. Status information may be provided through uploading local log files or through the LivingLab statis- tics facility. Extensive logging information may still be collected from the local seeders residing on the NCIT cluster infrastructure. This will provide valuable information for subsequent analysis. As comparison between classical distribution and video delivery should consider the same context, we aim to persuade our users to also use a classical distribution client (maybe a non-streaming version of NextShare) within a session. The two sets of data will form the basis for comparison dissemination. We still have to investigate whether this is achievable and easy would it be on the seeders side (where verbose logging is enabled) to differentiate between the two kinds of sessions. CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 127

7.4.3 Experiments

User-Based Evaluting NextShare Technology

During December we have undertaken a medium-sized experiment for evaluating the NextShare technology and user satisfaction. The LivingLab site was the center of the experiment: it provided the interface for download the NextShare plugin, linking video files, publishing the feedback form and providing support through the forum. Version 1.1.0 of the NextShare plugin/SwarmPlugin (stable as of August 2010) had been deployed for this trial. It allows the use of VLC-based NextShare technology as a browser plugin on Windows platforms using Internet Explorer or Mozilla Firefox. The lack of availability of the plugin for the Linux platform had been a complaint on the feedback form, as user were part of the technical "geek crowd". The trial goals revolve around the generic objective of providing feedback and bug reports to WP4 and related packages, measure user satisfaction and promote NextShare technology and the partic- ular objective and comparing classical P2P/BitTorrent distribution against P2P-based video stream- ing, such as provided by NextShare. The video files deployed are those presented HD content, 3 to 30 minutes length videos, Video- on-Demand. Most content had been converted from AVCHD content to AVI container using H.264. The choice of AVI container did pose some problems, and had been removed for subsequent trials. All video files had been replicated on all available NCIT cluster systems. Each system had started a seeder for all video asset. As such, for each video asset there would be a number of seeders equal to the number of cluster systems, allowing a potential overall bandwidth of 9 x 1Gbit = 9Gbit. Users were mostly students at the UPB. They were contacted through email and support and com- munication had been ensured through the use of the forum or private e-mail messages. The feedback form, accessible from the LivingLab site, consisted of usability inquiries, technical aspects and general user satisfaction. Feedback questions were the ones described below: • Did you have any problems during the installation? • Did you successfully install the SwarmPlugin? • How would you classify the quality of the video playback? • How would you classify the plugin’s interface? • What OS are you running? • What browser did you use for testing the SwarmPlugin? • What are your computer hardware specifications? • Do you have any comments, remarks or suggestions? • What was the average download speed? • What version of the browser are you running? • What is your external IP network address? • What videos did you watch? • In case of problems, what were the issues you encountered? After the trial and discussions within the WP8 team, some of the questions had been eliminated as information regarding the operating system or browser version could be determined from peer con- nection to the seeders. The feedback form had been implemented as a Google spreadsheet/form. CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 128

A total number of 55 users completed the feedback form. A brief of the results: • users had problems installing the plugin and most of them were unable to properly start the playback; • video playback was deemed mediocre; some may be due to the problems in installing the plugin; • the plugin interface was considered to be usable - content playback could be easily enabled; • most users used the Firefox version of the Plugin; • users were using high bandwidth connections (> 512KB/s) or fairly good ones (> 50KB/s, < 100KB/s); • video freezing and the lack of a seek and volume bar were among the most significant re- marks. In order to collect information regarding the problems, FTP-based upload form to allow users to provide us their log files. The most important issues were problems during the installation and poor video playback (freezing). One of the main results A similar experiment would need to provide proper video content (if the problem is from video files), a properly working version of the NextShare plugin and a seek bar on the playback interface. The forum was heavily utilized during this period, as users experienced various problems with installing the plugin and making it work both on Internet Explorer and on Firefox. There have been 29 posts and 604 views on the forum. The forum has been an easy way for communicating with the WP4 team regarding various user reports. A total of 17 users created and used accounts on the forum.

Living Lab as a Service

The April 2011 Experiment was focused on testing basic functionality for comparison between clas- sical distribution versus streaming. The aim is to set up all pieces required for experimentation, data collection and analysis. In order to accomplish this data was collected both from classical Bit- Torrent clients such as uTorrent, Transmission and streaming functionality. Users were instructed to use both the NextSharePC plugin and a classical distribution BitTorrent client in order to provide information required for comparison. VoD assets were distributed over systems in the NCIT cluster: half of the systems had been used as seeders for MP4+H.264 files, while the other half had been used as seeders for OGG+Theora. The seeders have been continuously running ever since. Video files had been accessed through the use of the LivingLab site. Logging had been enabled at seeder level. Due to the high level of information only status informa- tion such as download speed, upload speed, peer connections had been collected in log files. A high quantity of data was collected resulting in 100GB of status log information per seeder in two months time. In order to match IP information from seeder logs to BitTorrent clients, logging had also been enabled at tracker level. Tracker log files display global information regarding swarms and periodic information (typically once every 30 seconds) from announce messages from clients. Important updates had been implemented since the December experiment, based on user feed- back and updated goals. VLC-based API for playback interaction (such as a progress bar, volume selection, pause, resume) had been added. CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 129

Stream files (.tstream) had been updated to provide bitrate information as required by the NextShare plugin. Two types of VoD assets had been created with respect to resolution. For each asset there would be an HD variant and an SD variant. SD variant was designated for high latency systems. Apart from the VLC-based NextSharePC plugin, the site was added the Firefox-based plugin using HTML5 technology. Thus, for each VoD assed, there would be two types of files: MP4+H.264 for the VLC-based plugin and OGG+Theora for the HTML5-based plugin. Users were provided with links to the other version of the file. This allowed using NextShare technology on Linux-based systems. There were several updates to the site and user interaction, also as part of user plugin or sugges- tions from inside the P2P-Next WP4 or WP8 teams. The log file upload form had been changed from a rather user-unfriendly FTP-based form to one based on HTTP. The feedback form had also been updated to discard information that could be retained otherwise (from browser information, for example – the IP address, operating system, browser, or from BitTorrent client provided information – average download/upload speed). As most users had also taken part in the previous experiment, there was no use of the forum. Several questions and suggestions were delivered through email. As before, users were mostly students at University POLITEHNICA of Bucharest, part of the “geek” crowd and some open source communities. Apart from email, user interaction had also occurred with the help of the feedback form, albeit at a lower degree due to the different focus of the experiment – gathering logging information for initial comparison of classical distribution and streaming. As mentioned before the feedback form had been minimized as some of the answers would be obtained from user connections to the LivingLab site or the tracker/seeders. The feedback form had been redesigned and is now backed by a MySQL databases, with the whole form integrated into the site (previous version had used Google Forms). The questions in the feedback form are: • Did you have any problems during the installation? • Did you successfully install the SwarmPlugin? • In case of problems, what were the issues you encountered? • How would you classify the quality of the video playback? • How would you classify the plugin’s interface? • What are your computer hardware specifications? (CPU, RAM, HDD) • Do you have any comments, remarks or suggestions, regarding content, accessibility, plugin interface, development or design? Several backend updates were considered to ensure the reliability and availability of seeders and the experimental setup. These updates had been considered from previous experience such as the December 2011 experiment and the BitTorrent distribution experiments. Startup scripts were created for easily running the seeder. As each seeder is responsible for a large number (close to a hundred) of video asset files, it needs a way for fast startup, implemented as a script around the command line version of the NextShare client. The script is integrated as a startup script in order to be run in the situation of a power failure or system restart. It can also be manually run at need. The seeder process is monitored such that, in case of failure, an email is sent to notify of the issue. Failure may be a software failure of the client or a sudden problem of the system on top of which it is running. An important update was the removal of hash checking phase when starting the seeder. As each seeder is responsible for a large number of VoD asset files, it would take a lot of time and processor power to check the consistency of files. As a seeder is having access to complete files and data CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 130 inconsistency is infrequent, we chose to remove this step when starting the seeder. We do a periodic check (twice a month) of the VoD files to ensure data consistency.

7.4.4 Issues and Feedback

Results of Experiments

42 users had correctly presented their IP addresses. Users had found the plugin and technology fun to use and provide feedback, both from an average user’s perspective and from a technical user’s perspective. Features that have been praised were: • Fast download speed. • Easy installation. • Good video quality. Users reported a variety of issues, most of which have been tackled for the LivingLab site or asked fro support from the NextShare core (WP4) team. Requests, suggestions and issues were: • The lack of a plugin for Google Chrome. • The lack of a plugin for the Linux platform. • The inability to choose an installation directory or a caching directory. Resulted in no more space being available. • Playback is "freezing" or choppy "freezing". • The lack of a volume or time/progress bar. • The lack of a full screen button. • High CPU consumption (as signaled in Task Manager). Many page faults were observed through Task Manager for the plugin process. • Content diversity should be improved. • Long buffering time. • Difficult playback from Edinburgh, UK. • Lack of reload button. • Placing lower resolutions videos for playback problems and for lower download speed. • The installer should signal what applications need to be closed in order to enable the plugin. • Sound was missing. • The installer should detect a previous version installation and point the user to uninstalling that first. The issues marked in bold were signaled often. The lack of a volume or time/progress bar was present in almost all suggestions in the feedback form. Technical issues had also been signaled on the forum and were asked for help from the WP4 core team. These included: • The lack of generation of a log file for troubleshooting. • Incorrect permissions (on Windows 7) that prevent users from installing the applications. Have to enable Administrative access rights. CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 131

• The lack of "dual" functionality for both Internet Explorer and Firefox. When the plugin is installed for one of the browsers it doesn’t work on the other. The important issues had been tackled after the trial, some of them being described in detail in Section 7.4.4. The most stringent issue deals with playback problems in the NextShare player for a diversity of video files. This was due to multiple reasons. One of them of the absence of a bitrate information in the .torrent/.tstream file; the client was unable to properly render the video. Another cause was the Due to our discovery of problems regarding bitrate variation influencing quality (or functionality) of video playback, we were considering the possibility of providing the end user with multiple versions of the same video file: different resolution, different bitrate, different codec, different container. That would imply some disparate statistics but may provide the user with better quality of experience when using a "less demanding" video file. More on this is discussed in Section 7.4.4. The lack of a Linux plugin has been addressed by integrating the HTML5-based plugin. The April 2011 trial had included dual video formats that are playable through the use of the classic VLC- based plugin and the HTML5-based plugin. Interface requests, the most demanding of the user’s replies, had been addresses through the use of VLC Javascript API. This enabled the addition of a progress bar, volume bar, time information, full screen button. More information is presented in Section 7.4.4. The focus of the April experiments was testing the experimental setup and premises for compar- ison between classical BitTorrent distribution and Peer-to-Peer streaming. As such, users were requested to use the NextSharePC plugin (either VLC-based or HTML5-based) and a classical BitTorrent client such as Transmission or uTorrent. There was reduced focus on getting feedback information. Due to users’ previous experience with the NextShare technology there were few questions regard- ing installation, playback and other features. As such the forum wasn’t used and the few reported issues were either mentioned on the feedback form or through e-mail. As mentioned, users were requested to use dual clients to provide sample information from clas- sical BitTorrent distribution clients and from the NextSharePC plugin as a representative of P2P streaming solutions. Download and upload rate for the NextSharePC is limited to 1MB/s for perfor- mance consideration such that preliminary results have little relevance other than providing sample information and testing the existing infrastructure. Sample results were: • “classical” clients used (no recommendation was given to users) were: uTorrent (16), Trans- mission (9) and BitTorrent mainline (2) • NextSharePC download speed varied between 300KB/s and 800KB/s • Transmission download speed: 500KB/s-1500KB/s • uTorrent download speed: 1000KB/s-4000KB/s Currently seeder logs only consider status information regarding peer and swarm evolution but no extensive/verbose information regarding peer behavior or protocol information. The addition of all that important information will need carefully crafted parser such that it would reduce the occupied disk space and post-experiment collected information parsing overhead. Figure 7.8 represents the download speed by client types during a wider window of the April 2011 trial. Figure 7.9 uses a smaller window centered around a time interval that provided a high number of client connections (between 8:00PM and 11:00PM on April 13, 2011). Each client type uses a specific color in the figure. As noticed, the NextSharePC client is limited to 1MB/s, while Trans- mission and uTorrent usually perform better. As users were requested to use NextSharePC and CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 132

Figure 7.8: April Trial – Download Speed Evolution by Client Type another client, the NextSharePC client is responsible for a large chunk of the plot. A recurrent issue reported by users (that we had also signaled) is the poor and non-deterministic behavior of the HTML5-based plugin. This seems to have to do with the rendering done by the HTML5 engine in Firefox 5, as the VLC-based plugin performs as expected. Further investigation and measurement are required. An issue we discovered whose context of appearance is still unclear is the disk space consumption of the root partition when running the command line NextShare client. After a certain period in time, more than 15GB of disk space are occupied with temporary files in the /tmp/ partition. However, after removing the files, data is still stored as hard disk space is occupied; we are currently guessing there are certain buffers within the kernel or the filesystem cache that are not removed properly. The problem is solved, albeit crudely, by restarting the NextShare client. A proper investigation and solution for this issue are under way.

Updates as Feedback from Results

The VLC based plugin does not provide any user interface. It only displays a rectangle where the video is rendered. Several P2P-Next’s trial sites extend the plugin’s interface by putting Play, Pause and Stop buttons. UPB LivingLab used to have the same interface for NextSharePC, but the users where complaining about these interface in their’s feedback, demanding several facilities. Based on their’s feedback we extended NextSharePC plugin with a JavaScript interface which uses jQuery. The new interface includes: 1. a progress bar, which shows the current position in the video and can be used to seek to a different position 2. the current time of the video in minutes and second (MM:SS format) and also the total time (in the same format) 3. a Fullscreen button 4. the old Play, Pause and Stop buttons. All the videos have been converted in two qualities: CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 133

Figure 7.9: April Trial – Download Speed Evolution by Client Type (8:00PM-11:PM, April 13, 2011)

• SD (Standard Definition) which has a resolution of 800x600 and a bitrate of 700 kb/s • HD (High Definition) with a 960x720 resolution and a 1400 kb/s bitrate Each video in each quality also needed to be converted using two different containers. The Swarm- Plugin which uses HTML5 requires an Ogg container with Theora video codec and Vorbis audio codec for compatibility with all browsers that support HTML5 (Mozilla Firefox and Internet Explorer). For the NextSharePC plugin which is based on VLC almost any common container and codec can be used. That is why we chose an H.264/AVC video codec that can provide high quality and still low bitrates and an MP3 audio codec packed in an MP4 container. Thus each video required encoding into four different versions. The decision of using the four versions for each video described above, has been made by experi- menting with a lot of other versions by choosing different: • containers: AVI, MP4, Ogg, WebM • video codecs: H.264/AVC, Theora, VP8 • audio codecs: MP3, AAC, Vorbis • resolutions: 800x600, 960x720, 1440x1080 • bit rates: 525 kb/s, 700 kb/s, 875 kb/s, 1050 kb/s, 1400 kb/s • frames rates (fps): 25 fps, 30 fps, 50 fps.

7.5 Conclusion

As a large part of content is currently video, Peer-to-Peer systems have sought out solutions for integrating streaming support. In this chapter, a proof of concept of integrating streaming algo- rithms into BitTorrent applications was the development of extra features for the popular libtorrent- rasterbar. We have added two streaming algorithms that altered the classical rarest piece first policy and instead focused on delivering sequential pieces of content. CHAPTER 7. MEASUREMENTS OF MULTIMEDIA DISTRIBUTION IN PEER-TO-PEER NETWORKS 134

An important solution for streaming is developed by the P2P-Next consortium1 in the form of the NextShare technology. The NextShare technology provides extensive features for delivering the next generation P2P media distribution framework. It provides support for VoD, live streaming, sup- ports VLC and HTML5 based browser plugins, runs on PCs and STBs and is constantly evaluated within the consortium LivingLab. The NextShare technology uses Give-to-Get [51], an update to Bit- Torrent’s policy, with good performance for VoD. For live streaming, NextShare creates mesh-based (swarm-based) overlays where several nodes are dubbed “seeders” and are always unchoked by the source node [50]. The NextShare technology and its features are constantly evaluated in the P2P-Next consortium LivingLabs2. One such LivingLab, located in UPB, provides more than 60 GB of VoD assets that may be accessed by users through the technology in the form of browser plugin playback. Feed- back is provided back to the NextShare core or to the LivingLab as a whole in order to improve attractiveness and quality of experience. Feedback is collected from users, in form of questions in feedback forms, or from log files and messages [20]. As direct partners in core work packages in the P2P-Next project, we’ve contributed to the devel- opment and evaluation of the NextShare technology. Our work has contributed to improving the streaming protocol in use and highlighting bugs and issues in the implementation. We have pro- vided experimental evaluation of the technology and compared it against classical distribution using BitTorrent implementations. The overall aim of the LivingLab is to evaluate the possibility for the NextShare platform to be deployed and used in real environments. As a Peer-to-Peer solution, both technical and legal challenges have to be undertaken and LivingLabs provide the framework in which real experiments take place. The LivingLab in UPB relies on students, technical people and open-source enthusiasts to partici- pate in various surveys and provide feedback around the technology. We are particularly interested in performance related aspects of the platform, using measures such as peer download speed, peer upload speed, swarm speed. We provide valuable insight regarding the behavior of the updated core and advice on improving performance and, thus, quality of experience for users.

1http://www.p2p-next.org/ 2http://livinglab.eu/ Chapter 8

Conclusion

In the ever evolving context of Internet technologies in general, and Peer-to-Peer technologies in particular, this work presented approaches on measuring protocol parameters and providing improvements with particular focus on performance. Considering the past decade evolution of Peer-to-Peer technology, we have been using lessons learned in the past years to augment existing protocols or design and implement new ones, while keeping the grip on running realistic trials and collecting and interpreting protocol parameters. Our measurements analysis has been centered at protocol level, with focus on peer protocols. We didn’t focus on providing an overall swarm view, but rather a client-centric analysis. Information provided by clients had bee subject to analysis – protocol parameters such as download rate, upload rate, peer connections, protocol events are at the center of the analysis. An “individualistic” client-centric approach was preferred against an overall swarm-centric one. Recognizing the immense impact of the BitTorrent protocol, most of the measurements and im- provements presented have made use of BitTorrent architecture and specifics. BitTorrent protocol message analysis, protocol parameters, streaming updates, tracker overlays, automating client ac- tions have formed the primary set of activities used in this work. The approach used in this work, as highlighted in the chronology of the chapters. We followed a deploy, run, analyze, evaluate, improve approach concerning Peer-to-Peer clients and protocol messages. An infrastructure is deployed forming the basis for nodes running within a swarm. Au- tomation had been employed to easily configure and run various implementations on the infrastruc- ture. Information is gathered in interpreted as protocol parameters, that are subsequently subject to evaluation. With the evaluation and advices in mind, improvements are implemented and proposed. The infrastructure that has been used as basis for trials and experimentation uses virtualization for efficiency reasons and easy configuration. Various swarms and topologies have been deployed on top of the infrastructure and run. The client-centric approach used in this work required data provided from peers; this data has been processed into protocol parameters that at their turn had been stored and then used for analysis, evaluation and comparison. Various approaches on data collection and processing had been considered and employed. Considering the recent deep focus on streaming updates into P2P systems, measurements, analy- sis and evaluation of streaming technologies had also been presented. The local LivingLab, as part of the P2P-Next project, has been the testing ground for live trials, involving users and streaming technologies. As the NextShare technology provides good support for both video-on-demand and live streaming it has been chosen as the etalon for our evaluation. Improvements to Peer-to-Peer systems have been providing in the light of the design and imple- mentation of two new protocols. The first one is a tracker overlay protocol on top of an existing

135 CHAPTER 8. CONCLUSION 136

BitTorrent swarm. The other is designing a multiparty protocol at Transport level in the Linux kernel networking stack – relying heavily on the previous swift implementation. These updates provide extended features or improved performance to Peer-to-Peer systems. The greater part of this work had been part of the P2P-Next project. The author is thankful and appreciative towards the P2P-Next team members, that have been working tirelessly and enthusi- astically towards providing the next generation Peer-to-Peer content delivery platform.

8.1 Summary

This thesis presented the work concerning measurements and improvements at protocol level for Peer-to-Peer systems with particular focus on the BitTorrent protocol. Various approaches, tech- niques and updates are shown throughout the chapters while generally trying to keep concepts and ideas in order, both chronologically and logically. First chapters are concerned with measurements and parameter evaluation, while the last chapters are focused on actual improvements. Thesis objective, scope and context are presented in Chapter1. The chapter is an overview of the work described in the rest of the thesis and highlights the most important keywords and primary focus. Chapter2 is a state of the art regarding the evolution and current challenges of Peer-to-Peer sys- tems. The Peer-to-Peer paradigm is presented, along with possible P2P topologies and implemen- tations. Focus is given to the BitTorrent protocol and streaming capabilities of P2P systems, paving the way for contributions highlighted in later chapters. At the basis of every work dealing with real world systems, there lies an environment or set of environments used as testing ground. The virtualized network infrastructure deployed for various trials is described in Chapter3. Advantages and benefits of virtualization are presented, with focus on OpenVZ – the solution chosen as the basis for the infrastructure implementation. Other tools and approaches are also presented with a general focus on efficiency, automation and ease of deployment. Chapter4 introduces the approaches for collecting Peer-to-Peer measured information and the pa- rameters at use and approaches to parameter analysis. The two approaches used (hooking into client source code) and collecting logs are presented within a generic logging library and client in- strumentation or context for providing log files. Parameters are stored in an easy to access database which is easily accessed by a rendering engine. One of the improvements highlighted in this work is the development of a tracker overlay protocol, dubbed Tracker Swarm Unification Protocol (TSUP), detailed in Chapter5. The new protocol is a used between tracker such that, if coordinating swarms centered around the same .torrent files, these swarms would be unified. Unification means that peers in the initially two different swarms would be able to communicate with each other and transfer data. All updates and extra communi- cation are tracker-centric; no modifications are required to client code. Chapter6 presents a novel design for a multiparty protocol at kernel level heavily based on the swift protocol. The aim is to provide a true running multiparty protocol at Transport level; the Linux kernel networking stack is the perfect place for such an operation. An intermediate approach, relieving the burden of kernel programming, but still providing the same programming interface is a raw socket-based implementation. Trials and evaluations centered around the local LivingLab and content distribution using Peer-to- Peer technology is presented in Chapter7. It presents a contribution to the libtorrent-rasterbar en- gine for providing streaming, and then focuses on the inner workings and features of the NextShare technology used within the P2P-Next project. We present the work and results involved in deploying live trials in the LivingLab and evaluation results. CHAPTER 8. CONCLUSION 137

The final chapter (Conclusion) presents ending remarks and highlights contributions of this work.

8.2 Contributions

The work described in this thesis provided significant contributions in Internet technology and Peer- to-Peer technology. The research and engineering activities described provide updates to mea- surement and evaluation methods and existing protocols used in Peer-to-Peer systems. With focus on client-centric behavior and performance evaluation, the contributions highlighted below amount to valuable progress in analysing and improving Peer-to-Peer protocols and applications. Contributions of this thesis range from infrastructure building and automation to protocol designs and implementations and streaming experiments. Main contributions are: • development of a fully automated and scalable virtualized infrastructure for Peer-to-Peer ex- periments; • proposal of metrics for evaluation of virtualization solutions; • design, development and incorporation of a BitTorrent logging library; • design and development of an automated framework for extracting, processing and analysing protocol data for BitTorrent implementations; • proposal of a formal context for evaluation of BitTorrent clients based on measured protocol parameters; • design and implementation of a novel protocol for swarm unification in BitTorrent environment; • design and implementation of a multiparty protocol in the Linux kernel; • evaluation of Peer-to-Peer streaming technologies through user-centric experiments. The creation of a fully automated and scalable infrastructure stands at the basis of extensive trials and experiments involving Peer-to-Peer systems. The infrastructure is able to integrated a va- riety of Peer-to-Peer clients and offer a in depth view of their behavior and performance parameters. The use of virtualization technology in the context of Peer-to-Peer systems offers an original approach towards providing a rapid deployment of experimental setups and trials. Virtualization offers the benefit of deploying a medium-sized swarm (consisting of several hundred nodes) on top of a much lower number (as low as 10) of modest commodity hardware systems. At the same time it provides an easy to use interface ensuring extensibility, configurability and automation. We have proven the scalable use of OpenVZ solutions for providing a realistic infrastructure. The use of OpenVZ, a lightweight virtualization solution, meant we ccould easily deploy a high number of virtualized systems and thus, Peer-to-Peer nodes. Operating system-level virtualization solutions are to be taken seriously in any situation where scalability and realism are of importance. The advantages and disadvantages of virtualization solutions have been explored and formalized as a set of virtualization metrics. These metrics provide mainly a comparison view of different virtualization solutions. While a complete numerical formalism is improbable to be created, we do provide an insight on how a virtualization solution compares against the other and what are the major factors that influence their performance. We undertook a set of approaches to simulating connection dropouts in BitTorrent environ- ments. We considered several approaches to simulating connection dropouts such as stopping/- suspending the process, terminating the process, filtering the connection through the use of a firewall. We drew several conclusions on the matter, presenting the advantages and disadvantages of each of the analysed approaches. CHAPTER 8. CONCLUSION 138

In the process of providing realistic and relevant trials with existing Peer-to-Peer clients, updates and patches have been applied to existing BitTorrent clients. Most of these have been applied to the clients with the heaviest use in our trials, namely Tribler/NextShare and hrktorrent/libtorrent- rasterbar. Updates have been sent to developers and integrated upstream. Instrumentation has also made use of hidden or non-obvious interfaces to existing clients, mostly in the process of gathering logging information. We isolated and defined protocol messages types and protocol parameters specific to Peer- to-Peer systems. We defined two types of messages: status messages and verbose messages; these messages consist of valuable information that is translated by a logging engine in protocol parameters such as download rate, upload rate, number of peer connections, protocol events and others. A generic logging library has been implemented to provide an API that may be hooked in existing clients. The library provides API both for status and verbose messages. It has been integrated and tested against Transmission and rtorrent clients. It offers configurability of the API to use and the type of logging information provided (either plain text or XML format). Client-centric logging updates and configuration have formed the basis of the most heavily used approach towards collecting protocol information from running clients. The logging facility in place, used in conjunction with the virtualized infrastructure, instruments and/or configures BitTor- rent clients to provide extensive logging information. The information is collected and subject to analysis. A protocol parameter parsing and analysis engine has been developed to provide an in depth look at information provided by client log files. It consists of several components such as parsers, storage engines and rendering engines. Parsers may be post-processing parsers or real-time parsers, with the ability of the latter to provide monitoring of clients and swarms. Storage engines, typically databases, are ideal for providing and easy to access, low overhead interface to protocol parameters. The rendering engine provides a GUI and graphics-centered interface of protocol information. Within protocol measurement activities, we’ve defined a formal context for performance evalua- tion of Peer-to-Peer protocols. It takes into account parameters extracted from logging messages and builds up a model for analysis and interpretation. The formalism is centered around perfor- mance, mainly transfer speed, and thus transfer time. During our investigation of Peer-to-Peer streaming technologies, we provided streaming support to the popular libtorrent implementation. Several streaming algorithms have been put into place; the piece picker algorithm used by libtorrent had to be significantly updated in order to provide a realistic streaming experience. Within the context of the P2P-Next project, the deployment and maintenance of the local LivingLab was essential towards providing live user centric experiments regarding Peer-to-Peer technol- ogy. The LivingLab provides access to a plethora of video asset files delivered through the use of Peer-to-Peer technology, in particular NextShare, in the form of browser plugins, easy to use and deploy by users. Various experiments have taken place involving users and allowing them to feedback the current technology and provide suggestions. A significant part of the LivingLab activities was and is dedicated to analysis and monitoring of BitTorrent streaming. We created the test ground for streaming trials involving users and Peer- to-Peer streaming technology. This has provided preliminary results in analysing BitTorrent based streaming and providing a comparison between classical BitTorrent distribution and streaming. A novel overlay network protocol on top of BitTorrent, aiming at integrating peers in different swarms, has been presented. Dubbed TSUP (Tracker Swarm Unification Protocol), the protocol is used for creating and maintaining a tracker network enabling peers in swarms to converge in a single swarm. Each initial swarm is controlled by a different tracker; trackers use the overlay protocol to CHAPTER 8. CONCLUSION 139 communicate with each other and, thus, take part in a greater swarm. Implementation has been tested against the XBT Tracker implementation. The TSUP protocol has been subject to experimentation within the XBT tracker. These experi- ments have used a variety of topologies testing and evaluating the protocol: from simple one to one tracker networks to complex networks involving multiple trackers in dedicated tracker networks. It was noticed, from the experiments, the reduced overhead incurred and the overall behavior of the new swarm. We have proposed and designed an optimization of the currently swift protocol. The integration the kernel space as a multiparty transport protocol that is solely responsible for getting the bits moving improves the over all protocol performance. It ensures maximum efficiency of data transfer by decreasing switches between user and kernel space and eliminating some performance penalties due to context switches. Currently in draft phase, there is an ongoing standardisation effort to provide swift and, thus, the current multiparty kernel implementation as an IETF approved standard. In order to provide a rapid development environment, we’ve created a user-space implementation and test suite for the multiparty protocol. Considering the harsh kernel development environ- ment, this implementation allows rapid design, implementation and testing. After features are tested against the user-space implementation they would be “ported” to the kernel-space implementation. Most of the code required for this work has made use of Python, the C programming language, and shell scripting. Code metrics for languages and tools that have been used throughout the project result in the below list: • Python: more than 20,000 lines of code • C/C++: around 5,000 lines of code • shell scripting and related (sed, awk): around 5,000 lines of code • R scripts for processing: more than 500 lines of code

8.3 Future Work

As is the case with improvements, updates and extensions, no work could ever be complete. Con- tinuous protocol evolution, the ever expanding Internet, new user request request further updating. At the same time, current approaches must be pushed towards a larger crowd, high quality formal evaluations and increased reliability. The multiparty protocol design and implementation in the Linux kernel has to be completed, tested, evaluated. Current goal is to push for a mainline integration, albeit in connection with the standard- isation effort taking place. The implementation presented, together with swift are part of a draft that is to be submitted to IETF for standardisation, as a PPSP (Peer-to-Peer Streaming Protocol). An efficient implementation, together with a standard back up, would simplify the work of providing the current implementation as an important contribution to the Linux kernel. The current infrastructure (both from a virtualized point of view and from measurement capabili- ties), needs enhancing. This means the addition of further clients, such as powerful and popular ones: uTorrent and swift, and the addition of other virtualization solutions (such as KVM, LXC, Xen). Scripts should be packed to provide an easily installable base on top of other hardware configura- tion. A draft model for virtualization adequacy had been presented at the end of Chapter3 proposing metrics such as isolation, efficiency and performance. This model hasn’t been applied to any virtual- ization solution and forms one of the desired directions for further work. Providing new virtualization CHAPTER 8. CONCLUSION 140 solutions to the infrastructure would provide the experimental testing ground for properly evaluating and verifying the model. Formal evaluation of Peer-to-Peer parameters, as highlighted in the final part of Chapter4 should be improved. Currently the defined model provides little insight on what are the bottleneck and problematic points where the experimenter should insist and focus his/her actions. Enhancing the evaluation implies providing the means through which the experimenter may detect anomalies or performance loss and also understand the factors influencing them, factor which should be tuned. On par with the formal evaluation of Peer-to-Peer parameters is the Peer-to-Peer streaming versus classical distribution activity mentioned in Chapter7. The focus of the LivingLab is providing sound comparison between the two updates and, based on that, provide advice regarding improvement of streaming extensions. Preliminary trials have only taken into account status message parameters from seeders in swarms; complete trials should take into account all parameters available and do analysis on top of that. The exciting and ever-evolving context of Peer-to-Peer systems offers a plethora of possibilities for further work on measuring and improving protocols. While primarily aiming to enhance current efforts, new research directions and challenges may be paved.

8.4 Publications and Talks

8.4.1 Papers

• Mircea Bardac, George Milescu, and Razvan˘ Deaconescu. Monitoring a BitTorrent Tracker for Peer-to-Peer System Analysis. In Intelligent Distributed Computing, pages 203–208, 2009 (ISI indexed)

• Calin-Andrei˘ Burloiu, Razvan˘ Deaconescu, and Nicolae T, apus˘ , . Design and Implementation of a BitTorrent Tracker Overlay for Swarm Unification. In International Conference on Network Services, 2011 (ISI indexed)

• Razvan˘ Deaconescu, George Milescu, Bogdan Aurelian, Razvan˘ Rughinis, , and Nicolae T, apus˘ , . A Virtualized Infrastructure for Automated BitTorrent Performance Testing and Evaluation. In- ternational Journal on Advances in Systems and Measurements, 2(2&3):236–247, 2009

• Razvan˘ Deaconescu, George Milescu, and Nicolae T, apus˘ , . Simulating Connection Dropouts in BitTorrent Environments. In EUROCON – International Conference on Computer as a Tool, 2011, IEEE, pages 1-4, 2011 (ISI indexed)

• Razvan˘ Deaconescu, Razvan˘ Rughinis, , and Nicolae T, apus˘ , . A BitTorrent Performance Evalu- ation Framework. Proceedings of Fifth International Conference of Networking and Services, 2009, Best Paper Award (ISI Indexed)

• Razvan˘ Deaconescu, Razvan˘ Rughinis, , and Nicolae T, apus˘ , . A Virtualized Testing Environ- ment for BitTorrent Applications. Proceedings of CSCS’17, 2009

• Razvan˘ Deaconescu, Marius Sandu-Popa, Adriana Draghici,˘ and Nicolae T, apus˘ , . Using En- hanced Logging for BitTorrent Swarm Analysis. In Proceedings of the 9th RoEduNet IEEE International Conference, Sibiu, 2010 (ISI Indexed)

• Razvan˘ Deaconescu, Marius Sandu-Popa, Adriana Draghici,˘ and Nicolae T, apus˘ , . BitTorrent Swarm Analysis through Automation and Enhanced Logging. International Journal of Com- puter Networks & Communications, 3(1):53–65, 2011 CHAPTER 8. CONCLUSION 141

• Andreea Let, a, Razvan˘ Deaconescu and Razvan˘ Rughinis, . Extending Packet Altering Capac- ities in Simulated Large Networks. In Proceedings of the 17th International Conference on Control Systems and Computer Science (CSCS17), Bucharest, 2009

• Marius Sandu-Popa, Adriana Draghici,˘ Razvan˘ Deaconescu, and Nicolae T, apus˘ , . A Peer-to- Peer Swarm Creation and Management Framewor. In Proceedings of the 1st Workshop on

Software Services: Frameworks and Platforms, Timis, oara, Romania, 2010 (ISI indexed)

• George Milescu, Razvan˘ Deaconescu, and Nicolae T, apus˘ , . Versatile Configuration and De- ployment of Realistic Peer-to-Peer Scenarios. In International Conference on Network Ser- vices, 2011 (ISI indexed)

• Razvan˘ Rughinis, and Deaconescu, Razvan.˘ Analysis of a QoS-based Traffic Engineering Solution in GMPLS Grid Networks. 17th International Conference on Control Systems and Computer Science, Bucharest, 2009

• Razvan˘ Rughinis, and Razvan˘ Deaconescu. Optimization Strategies in MPLS Traffic Engi- neering. UPB Scientific Bulletin, Series C, 1/ 2009, ISSN 1454-234x, pp. 91-102

• Razvan˘ Rughinis, and Razvan˘ Deaconescu. Methods of Adjusting MPLS Network Policies. UPB Scientific Bulletin, Series C, 3/2009, ISSN 1454-234x, pp. 121-132 • Mircea Bardac, Razvan˘ Deaconescu and Adina Magda Florea. Scaling Peer-to-Peer testing using Linux Containers. 9th RoEduNet IEEE International Conference, 2010, June 24-26, 2010, Sibiu, Romania, pag. 287-292, ISSN: 2068-1038, ISBN 978-1-4244-7335-9 (ISI in- dexed)

• Laura Gheorghe, Razvan˘ Rughinis, ,Razvan˘ Deaconescu, and Nicolae T, apus˘ , . Authentica- tion and Anti-replay Security Protocol for Wireless Sensor Networks. The Fifth International Conference on Systems and Networks Communications, 2010 (ISI indexed)

• Laura Gheorghe, Razvan˘ Rughinis, ,Razvan˘ Deaconescu and Nicolae T, apus˘ , . Reliable Au- thentication and Anti-replay Security Protocol for Wireless Sensor Networks. The Second International Conferences on Advanced Service Computing, 2010 (ISI indexed)

• Laura Gheorghe, Razvan˘ Rughinis, ,Razvan˘ Deaconescui and Nicolae T, apus˘ , . Adaptive Trust Management Protocol Based on Fault Detection for Wireless Sensor Networks. The Second International Conferences on Advanced Service Computing, 2010 (ISI indexed):

8.4.2 Books

• Razvan˘ Rughinis, ,Razvan˘ Deaconescu, George Milescu and Mircea Bardac. Introducere în sisteme de operare. Editura Printech, 2009, Bucharest. ISBN: 978-606-521-386-9

• Razvan˘ Rughinis, ,Razvan˘ Deaconescu, Andrei Ciorba and Bogdan Doinea. Ret, ele locale. Editura Printech, 2008, Bucharest. ISBN: 978-606-521-092-9

8.4.3 Workshops

• Performance of P2P Implementations. P2P’08 Workshop. Aachen, September 2008 • Peer-to-Peer Systems. Evolution and Challenges. Ixia HiTech Presentations. Bucharest, April 2011 Bibliography

[1] D. Kegel B. Ford, P. Srisuresh. Peer-to-Peer Communication Across Network Address Trans- lators. http://www.brynosaurus.com/pub/net/p2pnat/. [2] Hari Balakrishnan, M. Frans Kaashoek, David Karger, Robert Morris, and Ion Stoica. Looking Up Data in P2P Systems. Commun. ACM, 46:43–48, February 2003. [3] Mircea Bardac, Razvan˘ Deaconescu, and Adina Magda Florea. Scaling Peer-to-Peer Testing using Linux Containers. In Proceedings of the 9th RoEduNet IEEE International Conference, pages 287–292, 2010. [4] Mircea Bardac, George Milescu, and Razvan˘ Deaconescu. Monitoring a BitTorrent Tracker for Peer-to-Peer System Analysis. In Intelligent Distributed Computing, pages 203–208, 2009. [5] Riccardo Bernardini, Roberto Cesco Fabbro, and Roberto Rinaldo. Peer-to-peer streaming based on network coding decreases packet jitter. In Proceedings of the 2010 ACM workshop on Advanced video streaming techniques for peer-to-peer networks and social networking, AVSTP2P ’10, pages 13–18, New York, NY, USA, 2010. ACM. [6] A. R. Bharambe, C. Herley, and V. N. Padmanabhan. Analyzing and Improving a BitTorrent Networks Performance Mechanisms. In 25th IEEE International Conference on Computer Communications (INFOCOM 2006), pages 1–12, April 2006. [7] Andreas Binzenhöfer and Kenji Leibnitz. Estimating Churn in Structured P2P Networks. In Proceedings of the 20th international teletraffic conference on Managing traffic performance in converged networks, ITC20’07, pages 630–641, Berlin, 2007. Springer-Verlag. [8] Lorenzo Bracciale, Francesca Lo Piccolo, Stefano Salsano, and Dario Luzzi. Simulation of Peer-to-Peer Streaming over Large-Scale Networks using OPSS. In ValueTools ’07: Proceed- ings of the 2nd international conference on Performance evaluation methodologies and tools, pages 1–10, ICST, Brussels, Belgium, Belgium, 2007. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering). [9] Stanislav Bratanov, Roman Belenov, and Nikita Manovich. Virtual machines: a whole new world for performance analysis. SIGOPS Oper. Syst. Rev., 43:46–55, April 2009.

[10]C alin-Andrei˘ Burloiu, Razvan˘ Deaconescu, and Nicolae T, apus˘ , . Design and Implementation of a BitTorrent Tracker Overlay for Swarm Unification. In International Conference on Network Services, 2011. [11] Bram Cohen. Incentives Build Robustness in BitTorrent. http://www.bittorrent.org/bittorrentecon.pdf. [12] Shibsankar Das and Jussi Kangasharju. Evaluation of Network Impact of Content Distribu- tion Mechanisms. Proceedings of the 1st International Conference on Scalable Information Systems, pages 35–es, 2006.

142 BIBLIOGRAPHY 143

[13] Karel De Vogeleer, David Erman, and Adrian Popescu. Simulating bittorrent. In Proceedings of the 1st international conference on Simulation tools and techniques for communications, networks and systems & workshops, Simutools ’08, pages 2:1–2:7, ICST, Brussels, Belgium, Belgium, 2008. ICST (Institute for Computer Sciences, Social-Informatics and Telecommuni- cations Engineering).

[14]R azvan˘ Deaconescu, George Milescu, Bogdan Aurelian, Razvan˘ Rughinis, , and Nicolae

T, apus˘ , . A Virtualized Infrastructure for Automated BitTorrent Performance Testing and Eval- uation. International Journal on Advances in Systems and Measurements, 2(2&3):236–247, 2009.

[15]R azvan˘ Deaconescu, George Milescu, and Nicolae T, apus˘ , . Simulating Connection Dropouts in BitTorrent Environments. In EUROCON – International Conference on Computer as a Tool, 2011, IEEE, pages 1–4, 2011.

[16]R azvan˘ Deaconescu, Razvan˘ Rughinis, , and Nicolae T, apus˘ , . A BitTorrent Performance Evalu- ation Framework. Proceedings of Fifth International Conference of Networking and Services, 2009.

[17]R azvan˘ Deaconescu, Razvan˘ Rughinis, , and Nicolae T, apus˘ , . A Virtualized Testing Environment for BitTorrent Applications. In Proceedings of the 17th International Conference on Control Systems and Computer Science (CSCS17), number 313, Bucharest, 2009.

[18]R azvan˘ Deaconescu, Razvan˘ Rughinis, , and Nicolae T, apus˘ , . A Virtualized Testing Environment for BitTorrent Applications. Proceedings of CSCS’17, 2009.

[19]R azvan˘ Deaconescu, Marius Sandu-Popa, Adriana Draghici,˘ and Nicolae T, apus˘ , . Using En- hanced Logging for BitTorrent Swarm Analysis. In Proceedings of the 9th RoEduNet IEEE International Conference, Sibiu, 2010. [20]R azvanˇ Deaconescu, Marius Sandu-Popa, Adriana Draghici,ˇ and Nicolae Tapus.ˇ BitTorrent Swarm Analysis through Automation and Enhanced Logging. International Journal of Com- puter Networks & Communications, 3(1):52–65, 2011. [21] Tien Tuan Anh Dinh, Georgios Theodoropoulos, and Rob Minson. Evaluating Large Scale Dis- tributed Simulation of P2P Networks. In DS-RT ’08: Proceedings of the 2008 12th IEEE/ACM International Symposium on Distributed Simulation and Real-Time Applications, pages 51–58, Washington, DC, USA, 2008. IEEE Computer Society. [22] Rogier Dittner and David Rule Jr. The Best Damn Server Virtualization Book Period: Including Vmware, Xen, and Microsoft Virtual Server. Syngress, December 2007. [23] Pascal Felber Ernst. Self-scaling Networks for Content Distribution. In IEEE Infocom, 2004. [24] Bin Fan, John C. S. Lui, and Dah-Ming Chiu. The design trade-offs of bittorrent-like file sharing protocols. IEEE/ACM Trans. Netw., 17:365–376, April 2009. [25] Pawel Garbacki, Alexandru Iosup, Dick Epema, and Maarten van Steen. 2Fast: Collaborative Downloads in P2P Networks. Peer-to-Peer Computing, IEEE International Conference on, 0:23–30, 2006. [26] Victor Grishchenko and Johan Pouwelse. Binmaps: Hybridizing Bitmaps and Binary Trees. http://bouillon.math.usu.ru/articles/binmaps-alenex.pdf, 2009. [27] Krishna P. Gummadi, Richard J. Dunn, Stefan Saroiu, Steven D. Gribble, Henry M. Levy, and John Zahorjan. Measurement, Modeling, and Analysis of a Peer-to-Peer File-Sharing Work- load. SIGOPS Oper. Syst. Rev., 37:314–329, October 2003. [28] Lei Guo, Songqing Chen, Zhen Xiao, Enhua Tan, Xiaoning Ding, and Xiaodong Zhang. Mea- surements, Analysis, and Modeling of BitTorrent-like Systems. In Proceedings of the 5th ACM BIBLIOGRAPHY 144

SIGCOMM conference on Internet Measurement, IMC ’05, pages 4–4, Berkeley, CA, USA, 2005. USENIX Association. [29] Lei Guo, Songqing Chen, Zhen Xiao, Enhua Tan, Xiaoning Ding, and Xiaodong Zhang. A performance study of bittorrent-like peer-to-peer systems. IEEE Journal on Selected Areas in Communications, 25(1), 2007. [30] Dragos Ilie, David Erman, Adrian Popescu, and Arne Nilsson. Traffic Measurements of P2P Systems. 2004. [31] Alexandru Iosup, Pawel Garbacki, Johan Pouwelse, and Dick Epema. Correlating Topology and Path Characteristics of Overlay Networks and the Internet. In Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid, CCGRID ’06, pages 10–, Washington, DC, USA, 2006. IEEE Computer Society. [32] Bukhary Ikhwan Ismail, Devendran Jagadisan, and Mohammad Fairus Khalid. Determining Overhead, Variance & Isolation Metrics in Virtualization for IaaS Cloud. In Data Driven e- Science, pages 315–330. Springer Science + Business Media, LLC, 2011. [33] Xing Jin and S.-H. Gary Chan. Detecting malicious nodes in peer-to-peer streaming by peer- based monitoring. ACM Trans. Multimedia Comput. Commun. Appl., 6:9:1–9:18, March 2010. [34] K. Katsaros, V.P. Kemerlis, C. Stais, and G. Xylomenos. A BitTorrent Module for the OMNeT++ Simulator. IEEE, September 2009. [35] Yoram Kulbak and Danny Bickson. The eMule protocol specification, January 2005. [36] Frank Lehrieder, Simon Oechsner, Tobias Hossfeld, Dirk Staehle, Zoran Despotovic, Wolfgang Kellerer, and Maximilian Michel. Mitigating unfairness in locality-aware peer-to-peer networks. Int. J. Netw. Manag., 21:3–20, January 2011. [37] Bo Leuf. Peer to Peer: Collaboration and Sharing over the Internet. Peason Education, June 2002.

[38] Andreea Let, a, Razvan˘ Deaconescu, and Razvan˘ Rughinis, . Extending Packet Altering Capac- ities in Simulated Large Networks. In Proceedings of the 17th International Conference on Control Systems and Computer Science (CSCS17), Bucharest, 2009. [39] Jens Lischka and Holger Karl. Rias: overlay topology creation on a planetlab infrastructure. In Proceedings of the second ACM SIGCOMM workshop on Virtualized infrastructure systems and architectures, VISA ’10, pages 9–16, New York, NY, USA, 2010. ACM. [40] Yong Liu, Yang Guo, and Chao Liang. A Survey on Peer-to-Peer Video Streaming Systems. Peer-to-Peer Networking and Applications, 1(1):18–28, March 2008. [41] Thomas Locher, Patrick Moor, Stefan Schmid, and Roger Wattenhofer. Free Riding in BitTor- rent is Cheap. HotNets, 2006. [42] Alfred Wai-Sing Loo. Peer-to-Peer Computing: Building Supercomputers with Web Technolo- gies. Springer, December 2006. [43] R. Love. Linux Kernel Development, 2004. [44] Qiuming Luo, Yun Li, Wentao Dong, Gang Liu, and Rui Mao. A Novel Model and a Simulation Tool for Churn of P2P Network. IEEE, December 2010. [45] Petar Maymounkov and David Mazières. Kademlia: A Peer-to-Peer Information System Based on the XOR Metric. In Peter Druschel, Frans Kaashoek, and Antony Rowstron, editors, Peer- to-Peer Systems, volume 2429 of Lecture Notes in Computer Science, chapter 5, pages 53– 65. Springer Berlin / Heidelberg, Berlin, Heidelberg, October 2002. BIBLIOGRAPHY 145

[46] R. Merkle. A Digital Signature Based on a Conventional Encryption Function. In Proceedings CRYPTO’87, pages 369–378. Santa Barbara, CA, USA, 1987. [47] M. Meulpolder, J.A. Pouwelse, D.H.J. Epema, and H.J. Sips. Modeling and Analysis of Bandwidth-Inhomogeneous Swarms in BitTorrent. Proc. of IEEE P2P 2009, pages 232 – 241, 2009.

[48] George Milescu, Razvan˘ Deaconescu, and Nicolae T, apus˘ , . Versatile Configuration and De- ployment of Realistic Peer-to-Peer Scenarios. In International Conference on Network Ser- vices, 2011. [49] Shabnam Mirshokraie and Mohamed Hefeeda. Live peer-to-peer streaming with scalable video coding and networking coding. In Proceedings of the first annual ACM SIGMM confer- ence on Multimedia systems, MMSys ’10, pages 123–132, New York, NY, USA, 2010. ACM. [50] J. J. D. Mol, A. Bakker, J. A. Pouwelse, D. H. J. Epema, and H. J. Sips. The Design and Deployment of a BitTorrent Live Video Streaming Solution. In In IEEE International Symposium on Multimedia, 2009. [51] J. J. D. Mol, J. A. Pouwelse, M. Meulpolder, D. H. J. Epema, and H. J. Sips. Give-to-Get: Free-Riding Resilient Video-on-Demand in P2P Systems. In Society of Photo-Optical Instru- mentation Engineers (SPIE) Conference Series, volume 6818 of Presented at the Society of Photo-Optical Instrumentation Engineers (SPIE) Conference, 2008. [52] S. Naicken, B. Livingston, A. Basu, S. Rodhetbhai, I. Wakeman, and D. Chalmers. The State of Peer-to-Peer Simulators and Simulations. SIGCOMM Comput. Commun. Rev., 37(2):95–98, 2007. [53] Andy Oram. Peer-to-Peer : Harnessing the Power of Disruptive Technologies. O’Reilly Media, March 2001. [54] Zhonghong Ou, Erkki Harjula, Otso Kassinen, and Mika Ylianttila. Performance Evaluation of a Kademlia-based Communication-Oriented P2P System under Churn. Computer Networks, 54(5):689–705, April 2010. [55] Padala Padala, Xiaoyun Zhu, Zhikui Wang, Sharad Singhal, and Kang G. Shin. Performance Evaluation of Virtualization Technologies for Server Consolidation. Technical report. [56] Nadim Parvez, Carey Williamson, Anirban Mahanti, and Niklas Carlsson. Analysis of bittorrent- like protocols for on-demand stored media streaming. SIGMETRICS Perform. Eval. Rev., 36(1):301–312, 2008. [57] Nadim Parvez, Carey Williamson, Anirban Mahanti, and Niklas Carlsson. Analysis of BitTorrent-like Protocols for On-Demand Stored Media Streaming. SIGMETRICS Perform. Eval. Rev., 36(1):301–312, 2008. [58] J. Pouwelse, P. Garbacki, D. Epema, and H. Sips. A Measurement Study of the BitTorrent Peer-to-Peer File-Sharing System, 2004. [59] J. A. Pouwelse, P. Garbacki, D. H. J. Epema, and H. J. Sips. The Bittorrent P2P File-Sharing System: Measurements And Analysis. In 4th International Workshop on Peer-to-Peer Systems (IPTPS), 2005. [60] J. A. Pouwelse, P. Garbacki, J. Wang, A. Bakker, J. Yang, A. Iosup, D. H. J. Epema, M. Rein- ders, M. R. van Steen, and H. J. Sips. TRIBLER: A Social-based Peer-to-Peer System: Re- search Articles. Concurr. Comput. : Pract. Exper., 20:127–138, February 2008. [61] G. Hazel S. Shalunov. Low Extra Delay Background Transport (LEDBAT). http://tools.ietf.org/html/draft-ietf-ledbat-congestion-03, 2010. BIBLIOGRAPHY 146

[62] Marius Sandu-Popa, Adriana Draghici,˘ Razvan˘ Deaconescu, and Nicolae T, apus˘ , . A Peer-to- Peer Swarm Creation and Management Framework. In Proceedings of the 1st Workshop on

Software Services: Frameworks and Platforms, Timis, oara, Romania, 2010. [63] S. Shalunov. Low Extra Delay Background Transport (LEDBAT). http://www.ietf.org/id/draft- ietf-ledbat-congestion-00.txt, 2009. [64] Spyros Sioutas, George Papaloukopoulos, Evangelos Sakkopoulos, Kostas Tsichlas, and Yan- nis Manolopoulos. A Novel Distributed P2P Simulator Architecture: D-P2P-sim. In CIKM ’09: Proceeding of the 18th ACM conference on Information and knowledge management, pages 2069–2070, New York, NY, USA, 2009. ACM. [65] Stephen Soltesz, Herbert Poetzl, Marc E. Fiuczynski, Andy Bavier, and Larry Peterson. Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors. In EuroSys ’07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Con- ference on Computer Systems 2007, volume 41, pages 275–287, New York, NY, USA, 2007. ACM. [66] R. Stevens. UNIX Network Programming, Volume 2, Second Edition, 1999. [67] Daniel Stutzbach and Reza Rejaie. Understanding Churn in Peer-to-Peer Networks. In Pro- ceedings of the 6th ACM SIGCOMM conference on Internet measurement, IMC ’06, pages 189–202, New York, NY, USA, 2006. ACM. [68] Ramesh Subramanian and Bran Goodman. Peer to Peer Computing: The Evolution of a Disruptive Technology. Idea Group Pub, February 2005. [69] Ramesh Subramanian and Brian Goodman. Peer to Peer Computing: The Evolution of a Disruptive Technology. Idea Group Pub, 2005. [70] Dreamtech Software Team. Peer to Peer Application Development: Cracking the Code. Hungry Minds, April 2001. [71] Ye Tian, Di Wu, and Kam-Wing Ng. Performance Analysis and Improvement for BitTorrent-like File Sharing Systems. Concurrency: Practice and Experience, 19:1811–1835, 2007. [72] A. Vlavianos, M. Iliofotou, and M. Faloutsos. BiToS: Enhancing BitTorrent for Supporting Streaming Applications. INFOCOM 2006. 25th IEEE International Conference on Computer Communications. Proceedings, pages 1–6, April 2006. [73] Haiyang Wang, Jiangchuan Liu, and Ke Xu. Exploring BitTorrent Peer Distribution via Hy- brid PlanetLab-Internet Measurement. In 18th International Workshop on Quality of Service (IWQoS), 2010. [74] Timothy Wood, Ludmila Cherkasova, Kivanc Ozonat, and Prashant Shenoy. Profiling and mod- eling resource usage of virtualized applications. In Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware, Middleware ’08, pages 366–387, New York, NY, USA, 2008. Springer-Verlag New York, Inc. [75] Boxun Zhang, Alexandru Iosup, Johan Pouwelse, and Dick Epema. The peer-to-peer trace archive: design and comparative trace analysis. In Proceedings of the ACM CoNEXT Student Workshop, CoNEXT ’10 Student Workshop, pages 21:1–21:2, New York, NY, USA, 2010. ACM.