Design and Implementation of a Peer-to-Peer based System to enable the Functionality in a Platform-independent Storage Overlay

Michael Ammann Jona, Switzerland Student ID: 06-920-805 – Communication Systems Group, Prof. Dr. Burkhard Stiller HESIS T

Supervisors: Guilherme Sperb Machado, Prof. Dr. Burkhard Stiller ASTER Date of Submission: January 31, 2013 M

University of Zurich Department of Informatics (IFI) Binzmühlestrasse 14, CH-8050 Zurich, Switzerland ifi Master Thesis Communication Systems Group (CSG) Department of Informatics (IFI) University of Zurich Binzmühlestrasse 14, CH-8050 Zürich, Switzerland URL: http://www.csg.uzh.ch/ Abstract

Cloud storage services have aroused great interest in modern information technology, providing a simple way of storing and sharing files. However, they are mostly built on centralized architectures and hence subject to drawbacks. Peer-to-Peer (P2P) systems are a good alternative to deal with the limitations of centralized systems in the scope of file sharing, due to shared usage of distributed resources and higher fault tolerance. Therefore, this thesis designs, implements, and evaluates a P2P-based file sharing system to be integrated into PiCsMu, a novel platform-independent application. The proposed system features a modular design which relies on well-defined interfaces, high security, as well as a P2P storage concept with good performance. Additional features that try to distinguish this approach from other P2P file sharing systems include an integrated file search and the possibility to privately share files. The results of a qualitative and quantitative evaluation show that the proposed system performs well and as designed. Considering that a download of 750 Megabytes files takes in average 154.65 seconds, the complexity added by the proposed system (4.21 seconds) is minimal. Yet the system still leaves aspects to be optimized.

i ii Zusammenfassung

Cloud-Speicherdienste haben ein grosses Interesse in der modernen Informationstechno- logie erweckt, denn sie bieten eine einfache Methode an, um Daten zu speichern und zu teilen. Da solche Systeme jedoch meistens auf zentralisierten Architekturen aufbauen, un- terliegen sie auch ihren Nachteilen. Peer-to-Peer (P2P) Systeme stellen eine hervorragende Alternative fur¨ den gemeinsamen Zugriff auf Daten dar, um mit den Limitierungen zen- traler Systeme umzugehen. Das Ziel dieser Arbeit ist das Entwerfen, Implementieren und Auswerten eines P2P Systems fur¨ den gemeinsamen Zugriff auf Daten, welches in PiCsMu, einer neuartigen Cloud-Speicher Anwendung, integriert wird. Das vorgeschlagene System hat einen modularen Entwurf, der auf klar definierten Schnittstellen, hoher Sicherheit und einem leistungsf¨ahigen P2P Speicherkonzept basiert. Eine integrierte Suchfunktion sowie die M¨oglichkeit eines privaten Datenaustausches sind Zusatzfunktionen, die das System von anderen P2P Systemen hervorheben soll. Die Resultate der qualitativen und quanti- tativen Evaluation zeigen, dass das vorgeschlagene System funktioniert und leistungsf¨ahig ist. Der Download von 750 Megabytes grossen Dateien ben¨otigt im Durchschnitt 154.65 Sekunden, wobei die Zeitverl¨angerung mit dem entworfenen System nur 4.21 Sekunden betr¨agt. Die Verl¨angerung ist im Verh¨altnis zur Gesamtzeit minimal. Es besteht jedoch immer noch Potential zur Optimierung.

iii iv Acknowledgments

I would like to express my appreciation to Prof. Dr. Burkhard Stiller and the Communi- cation Systems Group of the University of Zurich for giving me the opportunity to work on such an interesting topic and be part of their research.

I owe my deepest gratitude to my supervisor Guilherme Sperb Machado, whose research, guidance, enthusiasm, and persistent help allowed me to complete this thesis. His work showed me new prospects and again awaken my interest in science. In addition, a special thank to Dr. Thomas Bocek for providing TomP2P and for always being up to provide support in case of any problems.

Lastly, I am indebted to my family, Karl-Heinz Saxer, Lyn Shepard, and everyone who supported me during this thesis.

v vi Contents

Abstract i

Zusammenfassung iii

Acknowledgments v

1 Introduction 1

1.1 Description of Work and Thesis Goals ...... 3

1.2 ThesisOutline...... 3

2 Terminology and Related Work 5

2.1 Terminology...... 5

2.2 Cloud Storage Services ...... 6

2.3 A Brief History of Peer-to-Peer ...... 7

2.3.1 ...... 8

2.3.2 ...... 8

2.3.3 ...... 9

2.3.4 Anonymous P2P and ...... 11

2.3.5 Comparison ...... 12

vii viii CONTENTS

3 Design 17

3.1 File and Download Protocol ...... 17

3.2 Index...... 18

3.3 System Architecture ...... 20

3.3.1 Application ...... 20

3.3.2 Central ...... 22

3.3.3 Peer-to-Peer Network ...... 23

3.3.4 Cloud Services ...... 25

3.4 DistributedHashTable...... 25

3.4.1 Storage Concept using a ...... 26

3.4.2 Enabling a Content-Based Search ...... 27

3.5 Privacy and Secure Data Exchange ...... 29

3.5.1 Cryptography ...... 30

3.6 ShareFunctionality...... 32

3.6.1 PublicSharing ...... 32

3.6.2 Share Notification ...... 33

3.6.3 Private Sharing ...... 34

4 Implementation 37

4.1 An Overview of the P2P Code Architecture ...... 37

4.2 Data Exchange using Beans ...... 38

4.3 File-Sharing Protocols ...... 38

4.4 Asynchronous and Non-Blocking Communication ...... 40

4.4.1 Interacting with the DHT using Futures ...... 41

4.4.2 Asynchronous Callbacks and Event Handling ...... 42

4.5 Implementing Security Measures ...... 44

4.6 File Search Functionality ...... 46

4.6.1 Keyword Generation ...... 47

4.6.2 Fast Similarity Search ...... 48 CONTENTS ix

5 Evaluation 49

5.1 Testbed and Evaluation Setup ...... 49

5.1.1 Test Cases ...... 50

5.2 Qualitative Evaluation ...... 50

5.2.1 Code Coverage using JUnit Tests ...... 50

5.2.2 Functional Testing ...... 53

5.3 Quantitative Evaluation ...... 54

5.4 Additional Issues and Problems ...... 58

6 Summary and Conclusions 61

Bibliography 69

Abbreviations 71

Glossary 73

List of Figures 75

List of Tables 77

List of Listings 79

A Installation Guidelines 83

B Contents of the CD 85 x CONTENTS Chapter 1

Introduction

Modern Information Technology (IT) strives to provide as a utility. Utility computing is a concept based on delivering services and resources (e.g., processing, stor- age, ) to end-users while hiding their internal mechanisms. The acquisition of these offerings should be possible at minimal costs, having the advantage of being pur- chasable from private users to large IT companies. As an example, developers can easily launch new services without the need to primarily invest a large amount of money in infrastructure. Thus, organizations are able to first deploy services and later check if the demands meet predictions, e.g., number of users simultaneously accessing the service [21]. Multiple distributed technologies such as cluster, grid, and have emerged from this paradigm. They allow access to large amounts of computing power in a fully virtualized manner, by aggregating resources and offering a single system view [25]. Cloud computing (or simply cloud) has been one of the most used in IT over recent years. The name emerged as a metaphor for the Internet, which has been typically represented as a cloud symbol in network diagrams [59]. The symbol is an abstraction for the complex underlying mechanisms. A variety of cloud providers (i.e., companies offering cloud computing) have been incorporated over the last couple of years, including big names such as Amazon, , and . They offer cloud services that can be classified into three types: Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Software-as-a-Service (SaaS) [59]. Well-known examples of cloud services are, among others, [1], Google Picasa [9], or Microsoft Office 365 [11]. Section 2.2 introduces cloud services relevant for this thesis. In the scope of SaaS, cloud providers offer applications with a specific purpose and responsibility. Consequently, data that is uploaded to the services is restricted by the corresponding cloud providers’ data validation process. For example, Google Picasa only allows hosting images of a certain file type, resolution, and size. Machado et al. [42] investigated on how to bypass the data validation process of cloud providers to store arbitrary data without restrictions. By applying different encoders (i.e., Steganography-related, FileFormatHeader-related, Appending-related) data is transformed to a accepted by the cloud provider. A weak data validation process does not recognize this change in data. Therefore, it is possible to upload any kind of data to a cloud service without violating the service-specific restrictions being noticed by the cloud provider, causing impacts in security, accounting, and charging.

1 2 CHAPTER 1. INTRODUCTION

On the basis of the idea presented by Machado et al., its authors created a Platform- independent Cross Storage System for Multi-Usage (PiCsMu) [15]. PiCsMu is a proto- typical cloud-based overlay (i.e., network on top of another network) offering file type independent cloud storage using file type dependent cloud services. By applying encod- ing, the application can store arbitrary files on any cloud service. The way PiCsMu stores files is divided into three major steps: (1) fragmentation, (2) encoding, and (3) upload. Step (1) splits the file to be uploaded into multiple file parts. In the scope of (2), each file part is encoded specific to the service the file part will be uploaded. Finally, step (3) all encoded file parts to their corresponding cloud service. Advantages of this method are discussed in Section 2.2; and the individual steps are explained in more detail in Section 3.1. In order to the file, the reversed process is applied. PiCsMu itself does not store data, but index data (i.e., index data, this is data about data), which con- tains information about the upload process, such as (1) number and order of fragmented file parts, (2) applied encoders, and (3) used cloud services. Using the index data, PiC- sMu is able to retrieve and reconstruct a previously stored file. Section 3.2 describes how the index data is composed. The denotation of PiCsMu as an overlay is based upon the characteristics that it runs on top of multiple cloud services, using indexes to manage the files distributed across them.

Since the file parts are stored in public available services, any authorized person can access and download them. Hence, by sharing index data of a file with another PiCsMu , he or she can obtain all necessary file parts from the corresponding cloud services and reconstruct the file using index data information. As a -server (C/S) application, PiCsMu offers a centralized way of sharing files. However, having one central authority bears drawbacks in applying file sharing. High network traffic from many concurrent users may rapidly lead to congestion at the server, causing a temporal denial of service or complete breakdown. Further, the identity of the users (i.e., (IP) address) is known to the server and can be required by governmental authorities for prosecution in case of sharing copyrighted or sensitive files. Having said this, it is also easy for an authority to shut down the entire system by disabling the central server.

Peer-to-Peer (P2P) systems have become popular on the Internet, especially in the field of file sharing and media streaming. Steinmetz et al. [54] provide the following definition: “A Peer-to-Peer system is a self-organizing system of equal, autonomous entities (peers) which aims for the shared usage of distributed resources in a networked environment avoiding central services.” The increasing availability of high-speed broadband connections made the concept of globally applicable P2P systems feasible. Average users have the computa- tional capacity and resources to act as as server and client simultaneously (i.e., definition of a peer), providing and consuming data. P2P systems are designed to overcome the limitations and drawbacks of classical C/S systems. They are (1) extensible (easy to add new resources), (2) more fault tolerant, (3) scalable (system can grow without loss in performance), and (4) resistant to lawsuits. Hence, this thesis investigates, implements, evaluates, and discusses the design of a P2P-based system to enable a distributed share functionality in PiCsMu to overcome its current architectural limitations. 1.1. DESCRIPTION OF WORK AND THESIS GOALS 3 1.1 Description of Work and Thesis Goals

The objective of this thesis is to design, implement, and evaluate a P2P-based system to be integrated into PiCsMu to extend and enhance current functionality. This primar- ily includes the introduction of a distributed and decentralized file sharing and storing mechanism. Since PiCsMu already exists as a prototypical implementation, the system architecture and code design is analyzed in order to integrate new functions in the best possible way, minimizing changes to the original PiCsMu implementation. Furthermore, an investigation and comparison of related work in the field of cloud storage services is done to see how PiCsMu differs in features and how the work of this thesis can make a contribution. The proposed P2P solution is compared side-by-side with well-known P2P file-sharing systems in order to analyze advantages and drawbacks. The observation of the current implementation of PiCsMu, together with the analysis of related work, result in a design solution for the new PiCsMu with a P2P extension. The design is intended to be modular, in order to make future adaptions (e.g., new features, different P2P ar- chitecture) possible with minimum effort. In addition, design decisions are based on how to make best use of the functionality that PiCsMu already offers. The implementation realizes all aspects of the design and is done in the Java programming language. In the scope of evaluation, the proposed solution is evaluated to survey the design’s validity and feasibility. The evaluation process comprises of defining different scenarios in which the system’s functionality and performance is measured, and the results obtained are dis- cussed. Finally, the end-to-end work is analyzed and open issues as well as future work are presented.

1.2 Thesis Outline

The remainder of this thesis is structured as follows: Chapter 2 presents the terminology and the related work in the area of cloud services and P2P file-sharing systems. Chapter 3 presents the design of the proposed system architecture and explains the system compo- nents thoroughly. Chapter 4 focuses on the technical part of the thesis and explains how the design decisions are implemented into PiCsMu. Chapter 5 presents and discusses results obtained from the evaluation. Finally, Chapter 6 summarizes and concludes what was achieved and presents future work. 4 CHAPTER 1. INTRODUCTION Chapter 2

Terminology and Related Work

The first part of this chapter intends to define and agree on the definition of general terms used throughout the thesis. The second part introduces related work. It includes a short comparison of cloud storage services, followed by a more detailed comparison of P2P file-sharing systems.

2.1 Terminology

Computer files are common resources of users interacting with cloud services based on file-sharing systems. “A computer file (or just file) is composed of organizational data and content data. The former is represented by headers or any kind of control information, usually following a certain standard. It is data about content data. The latter is the data which the organizational data describes” [42].

A file format or file type is a particular way that information is encoded for storage in a computer file, as files need a way to be represented as bits when stored on a digital storage medium. Such a file format is usually identified by an intuitive and human-readable file name extension [42].

Encoding is the process in which data is converted into a specific form. Possible encoding applications include reduction of the file size or hiding data inside other data type files to conceal the original content. Decoding is the reverse process to restore the encoded data to their original form.

Encryption and decryption are the most basic applications of cryptography. Classical cryptography deals with the problem of enabling secret communication over insecure communication media. Modern cryptography is further concerned with construction of information systems that are robust against a variety of malicious attacks [45]. applies encoding to transform data into a format that cannot be easily understood by unauthorized parties. Decryption reverts the encrypted data to its original state. Usually, decryption can only be done by authorized parties.

5 6 CHAPTER 2. TERMINOLOGY AND RELATED WORK

An (or simply overlay) is a virtual or logical network on top of another network in which endpoints are addressable [24]. This provides connectivity, routing, and messaging between endpoints. Overlays are often used for providing a routing topology not available from the underlying physical network. P2P systems are overlay networks that mainly run on top of the Internet.

“A Peer-to-Peer system is a self-organizing system of equal, autonomous entities (peers) which aims for the shared usage of distributed resources in a networked environment avoiding central services” [54]. P2P systems are based on a distributed network architec- ture that allows distribution of bandwidth usage to reduce bottlenecks while being more robust against failure.

“Cloud computing refers to both the applications delivered as services over the Internet and the hardware and systems software in data centers that provide those services” [21]. Cloud providers offer cloud computing as services. These cloud services can be classified into three types: Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Software- as-a-Service (SaaS) [59].

2.2 Cloud Storage Services

There are a variety of file hosting services that provide free online storage using the cloud storage model. Table 2.1 presents well-known services that have proved themselves over the last years and compares their features. Otixo [14] is the closest approach to PiCsMu. Both systems are not directly considered as file hosting services, nor provide storage themselves. However, they use an overlay to manage multiple other services. The advantage is the complexity reduction delivered to end-users. Users just need to use one application to have access on all other services and stored files, instead of accessing each service itself.

A high security standard is a prevailing topic relating cloud services. The user is con- cerned that his files are kept private and that no malicious third person gains access to them. A simple way to make files unreadable is encryption. While all services provide server-side encryption, only SpiderOak [17], Wuala [20], and PiCsMu use client-side en- cryption. Otixo is not using native client-side encryption (i.e., encryption provided by Otixo itself), but may use a service that provides it. Using server-side encryption, the service provider holds the encryption and decryption key pair. Technically, server-side encryption allows the provider to view and access all files. This type of encryption mainly serves as protection against a security breach, where malicious users gain access to the files. However, it does not give a user the feeling of real privacy. Client-side encryption tends to solve such issues: all data is encrypted on the end-user machine before sending it to the cloud service. This ensures that a provider has no possibility of reading your files as well as protects against security breaches. The biggest disadvantage of using client-side encryption is the possibility of users losing the decryption key. Since the user is responsi- ble for storing and protecting the key, a loss can result in losing the ability to decrypt all files, making all the data useless. 2.3. A BRIEF HISTORY OF PEER-TO-PEER FILE SHARING 7

PiCsMu is the only service known to fragment and encode files in order to distribute fragments on multiple cloud services. The advantages are (1) increased security, (2) data redundancy (mutliple fragments in multiple cloud services), (3) theoretical infinite storage space, and (4) file-type independent storage. The encoding adds an another layer of security to the files, making it even harder for a provider to identify what is actually stored on its system. The fragmentation enables PiCsMu to use as many cloud services as desired, giving the user a nearly infinite size of storage.

Dropbox [3], [8], SkyDrive[16], SpiderOak [17], Wuala [20], Otixo [14], and PiCsMu offer a centralized solution to share the stored files with others. As a result of this thesis, in addition, PiCsMu is extended to provide a decentralized approach using a P2P network. The advantages and drawbacks of a fully decentralized file-sharing system are further explained in Section 2.3.

Cloud Service Feature Google Drive SkyDrive SpiderOak Wuala Otixo PiCsMu Overlay No No No No No Yes Yes Supports Additional No No No No No Yes Yes Services Fragment to Multiple No No No No No No Yes Clouds Client-side No No No Yes Yes No1 Yes Encryption Encoding No No No No No No Yes Centralized Yes Yes Yes Yes Yes Yes Yes Sharing Decentralized No No No No No No Yes Sharing Table 2.1: A feature comparison of cloud storage services

2.3 A Brief History of Peer-to-Peer File Sharing

The main focus of this work is to enable share functionality in PiCsMu using a P2P net- work. In order to position PiCsMu as a P2P file-sharing system among other existing solutions, a brief analysis and comparison of main characteristics is necessary. Therefore, this section describes the most well-known P2P systems and highlights specific character- istics that are compared to PiCsMu. The detailed comparison appears in Section 2.3.5 and is summarized in Table 2.2.

1Encryption might be offered by one of the integrated services. 8 CHAPTER 2. TERMINOLOGY AND RELATED WORK

2.3.1 Napster

Napster is one of the first widely used file-sharing systems that became popular. It was founded in early 1999 as an online service to share audio files. During its prime time in 2001, approximately 1.6 million users were online at the same time. It is estimated that over 2 billion audio files were downloaded up until that point [33].

Despite its success, Napster had severe problems with issues and subsequent lawsuits [55]. Based on a court decision in 2001, an injunction was passed ordering Napster to shut down. The service had a restart in 2003. It now uses a pay-per-song charging model to avoid additional copyright lawsuits.

Saroiu et al. [51] classifies Napster as an unstructured, centralized P2P system. Unstruc- tured systems are characterized by data placement in the network without any basis of knowledge on its topology, as in structured P2P systems. The main component in the system architecture is the cluster of dedicated central servers. These are responsible for bootstrapping as well as providing the lookup service. The cluster maintains an index with all information on file locations. This includes a list of currently connected users and their files. Each time a peer starts Napster, it establishes a connection to the central server cluster. When looking for a file, the peer queries the index servers. The query is processed by checking each connected user on availability of the file. A list of possible trade partners is returned. Then, the requesting peer can choose a user to download from. A direct Hypertext Transfer Protocol (HTTP) connection is established between both peers, and the file can be downloaded. After the file exchange, the HTTP connection is closed.

2.3.2 Gnutella

Gnutella is a term that has different meanings. It is mostly referred to as an open- source distributed file-sharing protocol [48], originally developed by Justin Frankel and in early 2000. When talking about Gnutella, you can either refer to the file- sharing network itself, or to the original client software used to connect to the network. In this work, only the protocol is of interest for comparison with PiCsMu.

The system developed based on the Gnutella 0.4 protocol is classified as an unstructured P2P network [41]. The network has no knowledge on file locations. There is no central server or that could be in control of this task. This is a major difference from the central index paradigm that Napster used.

In order to be part of the Gnutella network, the clients have to bootstrap first to find an existing . Since the network topology changes constantly, finding peers already con- nected to the network can be a problem [37]. There are different possibilities implemented that address this issue. The most common one is to use a predefined list of well-known hosts that should always be reachable. This solution is similar to having multiple boot- strap servers. Gnutella Web caches or UDP () host caches are additional solutions. Web caches are programs placed on any Web server that stores IP addresses of hosts online in the Gnutella network. These servers constantly refresh their 2.3. A BRIEF HISTORY OF PEER-TO-PEER FILE SHARING 9 cache to be up-to-date and can be queried by the Gnutella application. UDP host caches work in a similar way. Peer information is included in the UDP packets transferred within a Gnutella network. If a peer contacts a bootstrap peer (e.g., from the list of well-known hosts), it receives an additional list of online peers in the UDP response message. Since UDP messages are very small and do not take much bandwidth, this approach scales and performs better than using Web caches. In addition, every peer in the network can be a UDP host cache.

Files are searched and located based on the lookup protocol [41]. In Gnutella, this is done by flooding the network. Since a peer knows at least one other peer that runs the software (i.e., Gnutella), the initial search query message (QUERY) is sent to all directly known peers. A peer is identified by having its IP address. Therefore, Gnutella and Napster are considered non-anonymous P2P networks. Such networks are characterized by the possibility to associate a file with an IP address (i.e., the identifier of a user). The concept of anonymous P2P networks is explained in Section 2.3.4. When a peer receives a QUERY message, it checks its local hard disk to evaluate if the file is available. If the file exists, a QUERY RESPONSE message is returned. The response message contains the IP address as well as the file name located. In addition to checking for the file on the user’s own machine, the QUERY is forwarded to all peers known to this machine. By following this procedure, the original QUERY is forwarded and multiplied by a huge factor. For the sake of the network load, each QUERY contains a Time-to-live (TTL). The TTL limits the time a QUERY is forwarded on the network. Without it, messages would be routed forever, heavily decreasing the networks performance.

At the end of the above described process, the searching peer receives a list of file names and the corresponding peer (IP address) possessing it. The peer then directly connects to one other peer holding the desired file by sending a GET message. Upon receiving a GET message, the peer starts sending the file to the requesting peer by sending PUSH messages. This process repeats until the file is completely exchanged. Based on this architecture, Gnutella is only designed to be a public file-sharing system. It is not suited to enable private file sharing due to its flooding lookup. However, since it is an open protocol, everyone can use it to build their own private sharing network.

The latest protocol version (Gnutella 0.6) introduced new concepts to increase routing performance. The notion of ultra-peers is the most important to be noticed. Ultra-peers possess a better bandwidth connectivity and can therefore process queries on behalf of other peers. This new architecture made Gnutella a hybrid P2P system. A variety of software exist that is built on top of Gnutella. A well-known software is the early version of FrostWire [7]. It was later moved to operate with BitTorrent, described in the next Section 2.3.3.

2.3.3 BitTorrent

The biggest player in P2P file sharing nowadays is the BitTorrent protocol. It accounts for 50-75% of the overall P2P traffic and an equal amount of all Internet traffic [32]. BitTorrent was established in 2001 and has since developed further, now having a variety 10 CHAPTER 2. TERMINOLOGY AND RELATED WORK of clients available. The main strength of BitTorrent lies in the possibility of downloading large files considerably fast by using parallel downloads from multiple peers [34]. The architecture models an unstructured overlay network in which peers participate to share and download files [41]. Peers that actively participate in the file exchange are called a swarm. A file is being simultaneously downloaded from multiple other peers inside the swarm instead of just a single peer. This procedure allows downloading pieces (i.e., chunks) of a file from single sources. In order to organize the swarm, BitTorrent uses central servers called trackers. A tracker is responsible for finding and connecting peers that possess parts of the file. Unlike other P2P file-sharing systems discussed in this chapter, BitTorrent does not support the search of files integrated into the client application. In order to download a file, the user has to possess a torrent file. These are very small files that contain meta-data about the file to be shared and about a specific tracker, in order to join the swarm [47]. The torrent files are normally obtained through third parties Web servers. Peers can download a file if the peer providing the resource runs a BitTorrent client being the seed. The seed (or seeder) refers to a peer holding a complete file (i.e., possessing all chunks) [41]. After obtaining a torrent file, the peer contacts the corresponding tracker and receives a random list of seeders. The client then opens a Transmission Control Protocol (TCP) connection with some of them and starts to exchange file chunks. The file sharing only stops when the peer removes the torrent file from the client application. The file exchange protocol follows two essential policies: peer selection and piece selection [29]. The peer selection algorithm uses the tit-for-tat strategy. While a peer is downloading chunks from the swarm, it also starts uploading chunks already obtained. The strat- egy therefore prioritizes peers that are providing chunks at a high rate, implying that downloading from these peers is faster. This solution aims to reduce leeching. Leeching denotes peers downloading content without providing at the same time -- drastically lim- iting the use of BitTorrent. In addition to Tit-for-Tat, every 30 seconds a new random peer is obtained to update the list of available chunks and to identify new possible swarm peers [38]. In addition to peer selection, the order in which chunks of a file are downloaded is crucial for overall download speed. An insufficient algorithm can lead to situations, where peers do not have any chunks left to upload. The piece selection algorithm tries to optimize this mechanism. The methods in use are rarest first, random first, and endgame mode. At first, a random chunk is downloaded. In this manner, the peer can already start to upload this chunk again, improving its chance to be selected by the peer selection algorithm. Afterwards, the rarest chunk is always selected. The endgame policy ensures that a download fails to be completed because of a single low capacity peer. Newest implementations of BitTorrent introduced a design in which trackers are no longer needed. The main point of failure of the original BitTorrent protocol is the centralized tracker. If this fails, a peer has no way of discovering other peers from which to download. A solution to this drawback was found by introducing distributed trackers, organized as a Distributed Hash Table (DHT) [58]. The general functionality of a DHT is explained in Section 3.3.3. In the case of BitTorrent, the DHT stores peers that know the location of other peers. The lookup is done efficiently based on the implemented DHT. Popular 2.3. A BRIEF HISTORY OF PEER-TO-PEER FILE SHARING 11

BitTorrent clients that implement this approach are µTorrent [12] and [19]. The DHT implementation is based on Kademlia [44].

2.3.4 Anonymous P2P and Freenet

The growth of and erosion of privacy on the Internet is the driving force behind the idea of anonymous P2P. In these systems, peers that share information and files try to protect their identity. This is one of the motivations behind PiCsMu as well. There are many reasons to favor in the Internet. The distribution of content (e.g., sharing of audio and video files) may be illegal. Users could fear retribution from the government or an organization. The so-called whistle-blower affairs around WikiLeaks is the most popular example. Further reasons to prefer anonymity include censorship and personal privacy preferences. Users do not want data about their behavior to be stored or analyzed.

Ian Clarke and other developers behind Freenet [6] were among the first to represent this philosophy and provided a P2P file-sharing software that was built to protect anonymity. Freenet is a fully decentralized architecture with no central control. The basic principles are encryption, data forwarding, and data storage [52].

Contrary to other P2P systems presented, peers provide two essential services. First, each peer provides local storage space (data store) to the Freenet network, building a large distributed cache. The user has no control and no knowledge of what will be stored on his data store. Files are not only transmitted between peers but actually stored. Therefore, Freenet is referenced as a file storage service rather than a file-sharing service [27]. Second, peers are responsible for routing. Each peer holds a private routing table in which known routes to file keys are stored. Information contained in the routing table will be kept secret only to the peer itself. Each time data is inserted into the network, a file key is generated to locate the data. File keys are calculated using Secure Hash Algorithms (SHA), e.g., SHA-1 and SHA-2 [46]. This key-based approach is similar to the DHT approach used in PiCsMu and selected BitTorrent clients, e.g., µTorrent, Vuze, and BitTorrent (since version 4.2.0).

When a new peer wants to join the network, it has to generate a public-private key pair. This pair serves to identify a peer by signing its physical IP address. Afterwards, the new peer sends an announcement message to a randomly chosen other peer from its bootstrap list, containing its public key and IP address. The receiving node stores the information in its private routing table and forwards the message to another randomly chosen peer. This procedure repeats until the TTL of the message runs out. Finally, all peers that received the announcement message mutually calculate a Globally Unique Identifier (GUID) for the new peer. Based on the GUID, the new peer is responsible for a specific range of file keys.

In order to insert a file into the network, a peer generates the file key and sends an insert message. The inserting peer looks up the closest key in its routing table and forwards the message to the corresponding peer. This procedure is repeated until the TTL runs out. Each peer along this path verifies the data against its file key, stores it, and creates 12 CHAPTER 2. TERMINOLOGY AND RELATED WORK a routing table entry. This routing algorithm provides anonymity to the users. Messages travel through P2P chains, in which each link is individually encrypted. Each peer only knows its direct neighbors. The peer inserting a file does not know where the file will be stored, and the other peers do not know where the file comes from. Even the peer immediately after the inserting peer cannot verify whether its predecessor is the message originator or was just forwarding it. To request a file, the system follows the same process as with insertion. The request is forwarded among neighbors until a peer has a matching file key in its local data storage. The file is then returned back along the routing path. Each peer on this path can decide to store the file as well to increase availability.

As an additional mechanism to protect peers from legal issues, all data on the local data storage is encrypted. The encryption happens before inserting the data into the network using a randomly generated encryption key. The corresponding decryption key is never stored with the file but is only published with the file key, making the decryption key only available to end users requesting the file. Therefore, a peer has no way to read its local data store and remains ignorant of its content. This should support the users in case of a prosecution.

Since version 0.7, Freenet exists in two specifications: and Opennet. Opennet refers to the routing mode where connections between peers are established automatically. The user is connected to random users all over the world. As with any P2P system, this approach suffers from malicious peers trying to harm the network. Darknet therefore enables the user to establish manual connections to other peers [28]. This ensures that a peer is only connected to trustworthy users, building a private network.

2.3.5 Comparison

This section compares the P2P systems described in Sections 2.3.1 - 2.3.4 to the PiCsMu system with P2P capabilities. The detailed procedures behind PiCsMu are explained throughout this work. Although a lot more systems exist that are not directly considered in the comparison, each selected example represents the characteristics of a typical group of P2P systems. Table 2.2 summarizes the comparison of P2P file-sharing systems. A discussion on security-related issues is not part of this work. The reason is that security is mainly influenced by very specific implementations rather than a general system. Lua et al. [41] present a complete comparison of structured and unstructured P2P overlays. The following terminology is used:

ˆ Topology - Categorizes the network topology into centralized or decentralized.

ˆ Architecture - Categorizes the overlay scheme into structured or unstructured.

ˆ Lookup - The implemented protocol to query other peers for information.

ˆ Efficiency - Is the lookup guaranteed to find content and how efficient is it?

ˆ Bootstrapping - Describes the mechanism on how a peer joins the network.

ˆ Storage - What do peers store, and what is exchanged? 2.3. A BRIEF HISTORY OF PEER-TO-PEER FILE SHARING 13

ˆ File Search - Describes how the user can search for files.

ˆ Download - Identifies the entity where data is downloaded from.

ˆ Upload - Identifies the entity where data is uploaded to.

ˆ Public Sharing - Does the system support the sharing of files with all peers?

ˆ Private Sharing - Does the system support the sharing of files with specific peers only?

As Table 2.2 shows, PiCsMu is the only system considered to have a structured overlay architecture. The network topology is controlled and known by the system. Content is placed at specific peers that will make subsequent queries more efficient, in contrast to random peer selection in most unstructured overlays [41]. The most common imple- mentation for structured overlays is the DHT. PiCsMu is the only system presented that solely relies on a DHT, built on top of TomP2P [18]. This approach has been chosen because of the fully decentralized P2P network, the efficient and guaranteed lookup algo- rithm [O(logN)], and the availability of a DHT implementation developed by the research group supervising this work. Therefore, PiCsMu is also intended to be a contribution to the research of TomP2P.

Napster and BitTorrent do have an efficient lookup as well. Peers in Napster only need one query to the central index server to get content information. If the content is available, the lookup is guaranteed to find it. While the central server makes the lookup extremely efficient, it can also lead to failure. If the server is down, the whole system becomes inoperable. The same applies for the trackers in BitTorrent. Therefore, the structured DHT solution is supported by some implementations. The efficiency of BitTorrent mainly relies on the availability of content. If a file is very popular (available on a lot of peers), BitTorrent provides the best performance of all systems. Gnutella is in general the least efficient solution. Flooding the network generates a lot of unnecessary traffic. Further- more, the lookup is not guaranteed to find the content, even if it is available somewhere in the network. This is due to the TTL in each file request. It may run out before a peer holding the file is found. Freenet provides a lookup solution in between flooding and DHT. Each peer holds a lookup table to select a most likely optimal peer to forward a message. Since the table is maintained only by the peer itself, it is less powerful and accurate than a DHT. The lookup efficiency is ideally O([logN]2), which is unsatisfactorily in comparison with a DHT approach like in PiCsMu. As with Gnutella, each message has a TTL and may time out before it is successful. Therefore, the lookup is not guaranteed.

The most popular bootstrap method is to use a bootstrap server or a cluster of well-known peers that are likely to be online. In addition, Gnutella caches highly available peers from previous sessions to connect again after a restart. PiCsMu is using one central server in its prototype state. This solution is chosen because the server is already available for other tasks, as described in Section 3.3.2. Later implementations may offer a list of well-known peers or caching as well. Based on the information contained in the torrent files, peers in BitTorrent know to which centralized tracker they have to bootstrap. The responsible tracker is giving the newly arrived peer the addresses of a set of other peers interested in the same file. This procedure makes the lookup more efficient. However, this approach is 14 CHAPTER 2. TERMINOLOGY AND RELATED WORK only viable due to the whole concept of BitTorrent. To conclude, the bootstrap method alone has no significant influence on the performance of the system. There is research in this area to be noticed [26, 60], but the choice depends mainly on good interaction of all components.

Peer-to-Peer File-Sharing System Criteria Napster Gnutella BitTorrent Freenet PiCsMu Thesis Section 2.3.1 2.3.2 2.3.3 2.3.4 Thesis Topology Centralized Decentralized Centralized Decentralized Decentralized Architecture Unstructured Unstructured Unstructured Unstructured Structured Lookup Central Index Flooding Tracker/DHT Key-based DHT Efficiency One-hop Not Guaranteed Guaranteed Not Guaranteed log N Bootstrapping Server Server/Cache Tracker Known Peers Server Storage Files Files Files Files Meta-files File Search Internal Internal External External Internal Download Peers Peers Peers Peers Cloud Services Upload Peers Peers Peers Peers Cloud Services Public Sharing Yes Yes Yes Yes Yes Private Sharing No No Yes No Yes Table 2.2: A comparison of P2P file-sharing systems

The novelty of PiCsMu is its data storage. While peers in all other systems provide files (or parts of them) directly from their machines, PiCsMu provides meta-files. Similar to a torrent file, the meta-files contain information on where the file is actually stored. Napster, Gnutella, and BitTorrent do not store or replicate files on their system. They maintain a global index and connect peers that are interested in the same file. On the other hand, Freenet is a file storage system, where a file is replicated among plenty of peers. However, the actual data transfer between peers is always the desired file. A peer uploads or downloads the file directly from one or more other peers. In PiCsMu, peers only exchange meta-files. The actual file download and upload happens outside the P2P network by directly connecting to cloud services.

Finally, some additional features are compared. PiCsMu offers an integrated file search. The user can type a key term into the application and retrieves a list of files available in the network. Napster and Gnutella offer the same functionality. While Napster and PiCsMu can find every file in the whole network that match the search criteria, Gnutella is can only return results from the peers it reached during the query. BitTorrent and Freenet do not have an integrated search function. The torrent files or file keys have to be searched and obtained through other means. An advantage of the integrated search is to be up-to-date. The chance that a file really exists is higher. There are situations in PiCsMu where a search result may lead to defective or nonexistent files. The functionality of the search is explained in detail in Section 3.4.2. Since the external search is not coupled in any way with the P2P system, the user has no guarantee at all if the file still exists. 2.3. A BRIEF HISTORY OF PEER-TO-PEER FILE SHARING 15

Private sharing among peers is an additional feature offered by PiCsMu. A peer can share a specific file with a set of known peers. The same function is also available in BitTorrent, where the file contributor has to create its own tracker and adjust the torrent file. Napster and Gnutella have no option to enable private sharing due to their architecture. Freenet does not offer private sharing in its network. However, by following the Darknet approach described in Section 2.3.4, peers may establish manual connections to other peers. There- fore, users can build a friend-to-friend network for private sharing. Peers still have to share with everyone inside this network and have to establish new connections each time they want to change users. This is time consuming and still not as powerful as sharing in PiCsMu. 16 CHAPTER 2. TERMINOLOGY AND RELATED WORK Chapter 3

Design

This chapter explains the composition of the PiCsMu software. The designed protocols are shown and explained with the help of selected figures and tables. The conception of the main components is described in detail, and design decisions are analyzed, showing their trade-off.

3.1 File Upload and Download Protocol

This section provides an overview to the essential mechanisms of PiCsMu that are im- portant for the further understanding of thesis. An abstraction of the protocols for file upload and file download is introduced. These two are the most basic functions in the PiCsMu system.

Figures 3.1 and 3.2 show a very simplified sequence diagram of the file upload and down- load protocol. The Application Core forms the central point of the software and handles (1) initialization and (2) control flow. Each operation is executed from here once a specific start condition is available. A start condition can be a user input (e.g., click a button in the application) or the result of a previously finished operation. File encoding and frag- mentation are the two main mechanisms provided by the Application File Handler. Its responsibilities are on the one hand, prepare a file to be uploaded, and on the other hand, reconstruct a file from the file parts downloaded. The P2P/Central Server object represents the storage for the index (i.e., the set of all information needed to store and retrieve a file). The index is explained in more detail in Section 3.2. If a file is stored as private, the central server is used. Otherwise, if a file is being shared, the P2P network provides the index storage. Finally, Cloud Service represents the set of cloud services that provide the actual data storage for the encoded file parts.

A file upload in general works as follows: The user selects a file to upload for sharing or private usage. In order to make use of the weak data validation in cloud services, the file first needs to be split into several file parts. This fragmentation process is handled by the Application File Handler. Each of the file parts is then encoded into files with file types that are accepted in the cloud services in use. To be more precise, the file types

17 18 CHAPTER 3. DESIGN

Application Core Application File Handler P2P / Central Server Cloud Service

Split file File parts

Encode file parts Encoded files

Upload encoded files Operation result

Store index Operation result

Figure 3.1: Simplified PiCsMu protocol sequence diagram: File upload of the encoded file parts have to comply with regulations of the cloud services. If only image files (e.g., PNG, JPEG, GIF) are allowed to be uploaded, the file parts have to be encoded into images. The number and order of the produced file parts, together with the applied encoding algorithms, are stored in the index.

Next, if all file parts are encoded, each is uploaded to one or more cloud services. The personal accounts of the user are used to gain access. In order to find all parts related to a file and to decode them using the correct algorithm, the index needs to be stored as well. Depending on the state of the file, shared or private, the index is stored on the P2P network or the central server respectively.

The process of the file download works exactly the opposite way than the upload process. The Application Core obtains an Universally Unique Identifier (UUID) for the file to download. UUID is an identifier standard to enable distributed systems to uniquely identify information without central control. UUIDs are included in search results or share notifications. A notification is received when a PiCsMu user shares a file specific to another user. The index is downloaded from the P2P/Central Server using the file UUID. With the information contained in the index, the Application Core knows all cloud services from which to download the necessary file parts in order to reconstruct the file.

Before the file can be reassembled, each file part needs to be decoded using the corre- sponding algorithm from the encoding. This way, the original data is obtained. With the number of file parts and their order known from the index, the file is reassembled and presented to the user.

3.2 Index

The index is a collective term that describes information about the resulting PiCsMu upload process. It contains all parameters necessary to locate and reconstruct a file in the PiCsMu network. The following parts build the index: 3.2. INDEX 19

Application Core Application File Handler P2P / Central Server Cloud Service

Get index Index

Download encoded files Encoded files

Decode file parts File parts

Reassemble file File

Figure 3.2: Simplified PiCsMu protocol sequence diagram: File download

Basic file information: Data to identify and describe a file in the PiCsMu system. Each file can be identified by a unique number. In addition, helpful user information such as file name, description, tags, upload date, etc., are included.

Fragmentation: The information on how a file has been fragmented into multiple parts. In order to reconstruct a file, the exact order of fragmentation and size of each file part needs to be known. Otherwise, the bytes forming the data cannot be correctly aligned, and the file becomes corrupted and unusable.

Encoding/Decoding: Each file part is encoded separately. Hence, the individually used encoding algorithm and the embedded file type has to be stored in the index. Without this information, a receiver of the file would not be able to decode a file part in the correct way, resulting in loss of data.

Location: Knowledge of where a file part has been stored. This includes individual locations (i.e., web address) of all cloud services used during the upload process as well as methods to authorize against them. Without proper authorization, PiCsMu would not be able to gain access to the file parts.

An index is created each time a file is uploaded, and is therefore specific and unique for each file. Only with all information provided by the index can the application find all the corresponding file parts and reconstruct the original file. The reconstruction process consists of applying the correct decoding algorithm on each file part and assembling them in the right order. Private file storage and public file sharing use the index alike. If a file should not be shared and therefore kept private, the index is stored on a central server. This ensures that access to the index is restricted only to the user in possession of the file. On the other hand, in order to enable public file sharing, the index needs to be accessible for everyone. It is stored in the P2P network and thus available to each PiCsMu user. 20 CHAPTER 3. DESIGN 3.3 System Architecture

PiCsMu is designed with the goal of achieving modular architecture. Aspects of a modular design are: (1) offering the possibility of subdividing a system into multiple components, (2) making each component independent and exchangeable, and (3) having well-defined interfaces between interacting components [22]. For this reason, the proposed system architecture is composed of four main components:

ˆ PiCsMu Application

ˆ Central Server

ˆ Peer-to-Peer Network

ˆ Cloud Services

Each component is characterized through specific properties and responsibilities explained in the following Sections 3.3.1 - 3.3.4. Having this layout, future adaptions to the design should be possible without the need to change the complete architecture. It is possible to exchange single components (e.g., use a different P2P implementation) as long as it complies with the existing interfaces.

An illustration of the system architecture components and their most important aspects is given in Figure 3.3. A view of the PiCsMu user and its role is shown as well. However, it is not seen as part of the system architecture, because the user is simply interacting with the application. Users provide data to the system by uploading and sharing files, and consume by searching and downloading. In order to do this, the application needs to assign a unique identifier (i.e., the user name) as well as a public/private encryption key pair to the user. The only requirement the user must meet is to possess cloud services accounts (credentials). The role of these accounts is described in Section 3.3.4.

3.3.1 Application

Software consists of a set of functions and algorithms used to operate and instruct the underlying hardware. PiCsMu, as a software, is mainly built from two parts: (1) applica- tion and (2) P2P, whereas the former is the core of the system architecture. Each system component is connected and instructed through the application and its given control flow. This interconnection is symbolized with a star topology (i.e., every node is connected only to a central node), illustrated in Figure 3.3.

PiCsMu is designed to run as a local program embedded on the users’ . In order to control the application, a Graphical (GUI) is used. Its goal is to abstract the underlying complex mechanisms and provide easily understandable and usable functions to the user. Hence, the GUI of PiCsMu is designed to organize a users’ files in a simple and clear way. This includes an overview of his or her private files as well 3.3. SYSTEM ARCHITECTURE 21

PiCsMu User

· Data provider and consumer · Local interface (GUI) to access the PiCsMu system · Cloud services account holder · Encoding and decoding of data · Owns a public/private encryption key pair · Fragmentation and Join of data · Connecting and uploading to the cloud services API · Encryption and decryption of data

Local PiCsMu Application Database Cloud Services PiCsMu Central Server

· Data storage · Identity provider · Weak data validation · PiCsMu application configuration · Implements oAuth protocol · Stores the index for non-shared, private files · Peer-to-Peer bootstrap server · Friend list

· Distributed Hash Table · Stores the Index for shared, public files Peer-to-Peer · Share notifications Network · Search domain

Figure 3.3: PiCsMu system architecture components and characteristics. as the ones the user sharing. Using the GUI, a user can access, upload, share, search, and download files.

As the central point of the application flow, different functions are in place that each have its area of responsibility. The following technical terms explain the most important application algorithms in use:

Fragmentation/Join: The fragmentation is divided in two parts: (1) fragmentation planning, and (2) fragmentation process. In (1), PiCsMu builds a plan based on the file to upload and the cloud service’s credentials available. First, the system looks to the cloud service’s credentials and check what are the file types possible to upload. With a list of file types to upload for each available credential, PiCsMu chooses an encoder, accordingly to the correct file type, and estimates the total file size resulting of the encoding process. If the encoded file size fits to the maximum file size accepted by the chosen cloud service, means that the estimation is correct and the part should follow the estimated size.

Encoding/Decoding: These algorithms are used in PiCsMu in order to make it more difficult to identify and validate the original data by third parties (cloud services). Due to their weak data validation procedures, the original data is almost impossi- ble to be discovered. Therefore, arbitrary data can be stored on any cloud service. 22 CHAPTER 3. DESIGN

Data can be encoded into, e.g., image, audio, or text files. PiCsMu uses multiple encoding algorithms, because certain cloud services only allow the upload of a spe- cific data type. Based on Machado et al. [42], three types of encoders are used: (1) Steganography-related, (2) FileFormatHeader-related, and (3) Appending-related. Encoders in the scope of (1) make use of the steganography technique, where data is hidden in a way that intends to turn it imperceptibly apart from the sender and intended recipient. FileFormatHeader-related encoders try to use the file format header (contains data about the file) to inject a high amount of data. The last type of encoders (3) simply append data to the end of the file. Each encoder type has mul- tiple implementations used for different file types. Decoding is the reversed process used to read encoded data and restore their original form. An encoded file can only be read by using the exact contrary decoder (i.e., same type and implementation).

Encryption/Decryption: PiCsMu uses different techniques of cryptography to share files with selected users and therefore encrypt index data stored on the central server and P2P network – this kind of sharing is called private file sharing. Everyone can access the index data stored in the P2P network’s DHT, but only users able to properly decrypt can read it. The applied method is explained in more detail in Section 3.5.1. Data on the central server is encrypted using Password-Based Encryption (PBE), where the user defines a secret password to access his or her files.

Connection handling: In order to exchange data between the application and other components of the system, PiCsMu needs to be able to establish different network connections. The upload and download of data has to be handled separately for the central server, P2P network, different cloud services, and the users’ local computer.

3.3.2 Central Server

Having one server in the system architecture might lead to the assumption that PiCsMu is an unstructured centralized P2P system. Such system designs are characterized with the indispensable need of a central entity that organizes the network. However, PiCsMu is a structured and DHT-based P2P system that is not dependent on a central entity for network functionality and data sharing. There is a central server in the system design, but not as a specific part of the P2P network. From the view of the P2P system, the server is only used as a reliable bootstrap server to initially connect to the network.

The main responsibility of the PiCsMu central server is to store the index for private files in its database. PiCsMu is not only for sharing but also for simply storing and administrating personal files. If the user likes to store private files on PiCsMu that should not be shared with anybody, storing the index on the central server is preferable instead of storing on the DHT. This design decision is based on reliability and security. Exemplary problems of a P2P system are churn and malicious behavior of peers. Churn is an effect that describes the independent arrival and departure of peers that may lead to loss of data [57]. The content of a peer can only be accessed by other peers if that peer is online. When a peer departs from the network, it takes some time for other peers to be notified of this topology change. During this brief time out, content queries may go 3.3. SYSTEM ARCHITECTURE 23 unanswered [24]. There are mechanisms in the design of a P2P system that reduce this effect, but it is a problem to keep in mind. Leibnitz et al. [39] showed that servers are in general more reliable, especially due to the fact that manipulated data is easier to inject in P2P networks. A drawback of servers is their performance under heavy network load and the susceptibility to directed attacks, e.g., Distributed-Denial-of-Service (DDoS) attacks. Based on these facts, the design decision is debatable and has to be evaluated in future.

In addition to storing the index, the server is used as an identity and configuration provider. Identity in fully decentralized P2P systems has unresolved issues and is a big area of research [30, 43]. Malicious users can easily create multiple counterfeit identities to gain control of the P2P network (i.e., Sybil attacks). As a consequence, interactions between legitimate peers may be mediated by one of the counterfeit identities, thus be- ing manipulated for the benefit of the attacker [30]. Therefore, a trustworthy centralized service (i.e., the central server) is used to provide identities. As an identity provider, the server holds all registered PiCsMu users and their account information. The identities are used in the authentication process to access private files as well as a method of identifi- cation to share files with other PiCsMu users. In order to build trustworthy relationships between users, each can have a list of friends that is stored on the server. The list is obtained and updated every time the application starts. In private file sharing, users need to get hold of receivers’ public keys from the server. To avoid requesting a public key manually, a user can easily share files with a selected number of friends by using the account names on the friend list.

Finally, the role as a configuration provider or update server is simply to update all local PiCsMu applications with the latest system configuration.

3.3.3 Peer-to-Peer Network

Allowing a shared and distributed access to resources of the network is a characteristic that makes a P2P network preferable for exchange of data. PiCsMu makes use of this aspect to enable its file sharing.

One of the essential constructs in structured P2P systems is the DHT. A lookup service is provided by applying pairwise (key, value) storage. Due to the distributed architecture, every peer can retrieve a value associated with a given key. A key uniquely identifies an entry in the data structure. DHTs offer a variety of features and properties that make them popular in P2P systems: (1) use of routing information for efficient content search, (2) their , that allows the system to work properly with a huge number of nodes, (3) being fault tolerance against peer departures, and (4) having no bottlenecks as well as being self-organizing. DHTs can be implemented in P2P systems in several ways. Well- known examples include, e.g., TomP2P [18], Kademlia [44], Pastry [50], Chord [56], and Tapestry [61]. PiCsMu makes use of TomP2P to design and build its P2P network. The conception and design of the proposed DHT is explained in Section 3.4.

In order to gain access to the P2P network, each PiCsMu application has to define a client (i.e., peer) that can bootstrap to the central server. By design, one application stands for one peer. All peers have to be uniquely identifiable in the network, because otherwise, 24 CHAPTER 3. DESIGN routing and content discovery would not be possible. Table 3.1 shows the design of a peer. Its unique identifier, or Peer Identifier (PeerID), is a random-generated number that is assigned by the application. However, before joining the network, the application does not know if the PeerID is already in use by another peer. This is a common problem in distributed systems. Therefore, the globally unique combination of IP address, TCP port, and UDP port also identify a peer in the network. Each personal computer that connects to the Internet has an IP address that enables the sending and receiving of data. Ports are application-specific identifiers that enable different applications, running on a single computer, to be addressed. As an example, if PiCsMu uses the IP address 205.140.64.1 and listens on TCP/UDP port 8001, all data sent from the P2P network to the address 205.140.64.1 port 8001 is received by PiCsMu directly. Besides, use of different ports on the same IP address allows to run multiple independent instances of PiCsMu.

Peer Attributes Methods Peer Identifier Bootstrap(IP Address, TCP Port, UDP Port) IP Address Put(Key, Value) TCP Port Get(Key) UDP Port Remove(Key)

Table 3.1: Design of a peer

Apart from the attributes presented, peers have defined methods to interact with the P2P network and its DHT (cf. Table 3.1). Initial connections between peers are established by bootstrapping to an IP address and its available TCP/UDP ports. Per default, all peers bootstrap to a predefined location (i.e., the IP address of the central server). However, if other peers are known, it is possible to connect to them directly. The exchange of data is basically handled by the PUT and GET operations on the DHT. Peers are additionally allowed to remove data from the network, as long as they originally provided it. By signing the PUT operation with a public key, only the peer holding the corresponding private key can call REMOVE on the same DHT key.

There are four use cases where the P2P part of PiCsMu is in use. A detailed explanation is given in Section 3.6:

Share a file with everybody: The user likes to contribute to the information library of PiCsMu and therefore shares a file with the public community.

Share a file with selected users: There is a need to privately share a file with a limited set of PiCsMu users.

Search a file: The user searches a file that has been uploaded to the system by another PiCsMu user.

Receive a share notification: A file can also be downloaded upon receiving a share notification (a friend has shared a file specific to a user). 3.4. DISTRIBUTED HASH TABLE 25

3.3.4 Cloud Services

This component of the system is the only operating part not in direct control of PiCsMu. Cloud services are operated by their corresponding cloud provider and provide interfaces to establish connections on a software basis, called an Application Programming Interface (API). Such are predefined protocols that allow use of programmatic functions of the services that provide it. The architecture relies on the existence of third party cloud providers and their storage space, because PiCsMu itself is an overlay that does not provide infrastructure (except its server). Hence, PiCsMu implements APIs of different cloud services to be able to easily upload and download file parts. Further functionality may include, e.g., query of the service status or account information.

It is assumed that each PiCsMu user holds personal accounts on one or more cloud services used on his or her behalf. An arbitrary amount of those accounts can be associated with PiCsMu, as long as it is supported (i.e., having its API implemented). Currently supported public services include:

ˆ Google Picasa ˆ Facebook Photos

ˆ Google Docs ˆ Facebook Status Messages

ˆ Google GMail ˆ SoundCloud

ˆ Google+ Status Messages ˆ TwitPic

ˆ ImageShack ˆ Imgur

One common aspect of these services that enables the idea behind PiCsMu is their weak data validation. Arbitrary files can be uploaded using fragmentation and encoding tech- niques. In order to enable a share function, it must be possible to temporarily access the cloud service accounts of other people to download file parts. Therefore, for all used services during the upload process, the user has to provide credentials that are stored with the index. While it is possible to simply authorize using the account name and matching password, users tend to keep their accounts more private. OAuth [13] is used to assure a secure authorization. OAuth is an open standard authorization protocol that gives client applications limited access to private server resources on behalf of the resource owner [36]. If a user enables OAuth on his cloud service accounts, PiCsMu can access those accounts without compromising their security.

3.4 Distributed Hash Table

In order to store index data in a distributed manner, PiCsMu uses the DHT provided by TomP2P [18] as an underlying mechanism for network functionality. The design aspects presented in this section are based on how to make optimal use of available functions in TomP2P. Consequently, the goal is to present a DHT storage scheme that demands fewest calls and therefore performs better. 26 CHAPTER 3. DESIGN

3.4.1 Storage Concept using a Distributed Hash Table

Calls to the DHT are designed to follow a scheme that tries to optimally create entries. Table 3.3 represents this scheme and shows all possible generic entries. Key collisions and unintended redundancy should be avoided. When calling the PUT operation using an already existent key, the available entry is overwritten (if it is not protected), or the call is ignored. Therefore, TomP2P allows creation of additional identifiers (domains). A domain is used together with a DHT key and builds a unique combination. Implications include an equal DHT key that can be used multiple times in different domains, extending the key space (i.e., set of available keys) by a large factor. Additionally, a more efficient lookup is possible. The proposed approach introduces three domains: (1) index domain, (2) search domain, and (3) notification domain. In the scope of (1), all values related to the index are stored. Domain (2) is used to store values that can be searched for (i.e., files publicly shared). As further explained in Sections 3.4.2 and 4.6, multiple keywords are stored for each file, demanding a large key space and justifying a separate domain. In addition, performance can be increased by limiting search space to a single domain. The notification domain (3) stores user names and its associated share notifications. Each time a user privately shares a file with another user, a notification is placed in this domain of the receiver’s user name. Files available to be downloaded can be seen by checking one’s personal notification domain. Based on the same principle as (2), a separate domain is used to reduce the key space and increase lookup performance.

Use of Content Keys Put(location key, content key, value) Result Put(0x123, 1, object1) - Put(0x123, 2, object2) - Get(location key, content key) Result Get(0x123, 1) object 1 Get(0x123) object1, object2

Table 3.2: Example of Put/Get operations using location and content keys

Regular DHTs typically operate with one type of keys (i.e., location keys) to store and retrieve values. One key is associated with one value. Yet there is often a need to store multiple values under the same key. Domains are able to solve this issue, but if it is desired to retrieve multiple values using only one key, domains are at their limits. Therefore, the specification of TomP2P allows differentiation between location keys and content keys to design a DHT. By storing values under the same location key but with different content keys, it is possible to retrieve all values at once using one GET call. An example is given in Table 3.2. Two objects (object1, object2) are stored using the same location key (0x123) but with different content keys (1 and 2). With GET, either one specific value or all stored values can be retrieved.

Based on the two newly introduced concepts, domains and content keys, the DHT storage scheme in Table 3.3 is proposed. Location and content keys are hashes of certain values. Hash functions (e.g., SHA-1 [46], SHA-2 [46], MD5 [49]) create fixed length keys (i.e., hashes) from variable length data. A hash function will always create the same key for 3.4. DISTRIBUTED HASH TABLE 27

Distributed Hash Table Location Key Content Key Value Domain hash(FileUUID) hash(File) File Index hash(FileUUID+UserID) hash(enc.(File)) enc.(File) Index hash(FileUUID) hash(FilePartID) File Part Index hash(FileUUID+UserID) hash(FilePartID) enc.(FilePart) Index hash(CredentialID) hash(Credential) Credential Index hash(CredentialID+UserID) hash(enc.(Credential)) enc.(Credential) Index hash(UserID) hash(CredentialID) enc.(Credential) Index hash(UserName) hash(FileUUID) Notification Notification hash(Keyword) hash(FileUUID) FileUUID Search

Table 3.3: Storage scheme for the DHT in PiCsMu equal input. Such hashes are important for the DHT’s underlying routing algorithm. The abbreviation “enc.” is short for encryption. In private sharing, all objects put to the DHT are encrypted using the public key of the receiver. Table 3.3 also shows the benefits of content keys. Since file parts are stored under their equal file UUID, all parts can be retrieved using one GET(fileUUID), instead of having one DHT call for each file part. Reducing the number of calls on the DHT increases performance and is less susceptible to failures.

3.4.2 Enabling a Content-Based Search

One fundamental feature of file-sharing systems is the possibility to search for available files. DHT-based systems lack the support of a user friendly content-based search and can only do exact match search by default. Content-based search allows the lookup of files without knowing an exact keyword. Results similar to the specified search term are presented. The reason why DHTs are intrinsically unable to perform such a search is the use of hash functions. Calculating a hash on similar keys results in completely different hash values. As an example, if a file is stored using its file name as key, it can only be found by a DHT search if the exact same name is used again. If only one letter in the name is different, the hash function will create a totally different key and the DHT can not retrieve the file. This is unwanted behavior when trying to search for files without knowing their exact keywords. Therefore, some P2P systems (e.g., BitTorrent, Freenet) do not support a direct key search but use other structures, like a web service.

PiCsMu is designed to support a content-based search by using a similarity algorithm. Bocek et al. [23] proposed P2P Fast Similarity Search (P2PFastSS), a similarity search algorithm based on the edit distance metric. The edit distance, generally referred to as Levenshtein distance, is a string metric to calculate the minimum number of operations required to transform one word into another, using the operations: deletion, insertion, and replacement [40]. For example, the edit distance between house and mouse is 1 because the only operation needed to transform house into mouse is to replace h with m [23]. Based on such transformations, the algorithm defines a deletion neighborhood to search for similar words. The deletion neighborhood of a word is the set of all strings (i.e., neighbors) 28 CHAPTER 3. DESIGN created by deleting each character of the word once (edit distance=1). Table 3.4 shows an example by generating the deletion neighborhood of the word fast.

fast Deleted Character Neighbors f ast a fst s fat t fas

Table 3.4: Deletion neighborhood of the word fast with edit distance=1

Enabling a content-based file search requires three steps to take place for each uploaded file: (1) generation of keywords, (2) generation of the deletion neighborhood for each keyword, and (3) indexing all created neighbors. Step (1) defines all keywords related to a file that a user is able to search. When storing a file using PiCsMu, a user can define certain attributes which the content-based search will use to locate the file. The composition of a file is given in Table 3.5. Each attribute, except the file description, is included to the set of keywords as it is. Therefore, a user will be able to search for a file name or tag directly and receive a list of matching files. Since the file description is a text in natural language, it contains words that are unsuited to be keywords in a search, e.g., prepositions, conjunctions and pronouns. Such words, called stop words, are commonly used and can not identify a single file. Hence, the description needs to be filtered to identify a number of keywords that stand for the complete text. Therefore, a simple tag cloud algorithm is used. The algorithm removes the stop words from the description, counts the frequency of occurrence of remaining words, ranks them in descending order, and adds the top results to the set of keywords.

Concept of a file Attribute Description File name Simple file name File description A short text about the file Tags List of words that categorize the file User name PiCsMu user name of the uploader

Table 3.5: User-defined attributes of a file

Upon generating all keywords, step (2) creates the deletion neighborhood for each keyword as shown in Table 3.4. Step (3) is concerned to index all search keywords and their generated neighbors, by putting them in the DHT. In the following explanation, only the term keywords is used to describe the combination of search keywords and their neighbors. As explained in Section 3.4.1 and illustrated in Table 3.3, keywords are assigned to an individual domain in the DHT. They are stored using the keyword itself as location key and the UUID of the file they belong to as content key. Therefore, different files but with similar keywords have their UUIDs stored under the same location keys.

After completion of all three steps, the file can be found by the content-based search. Search results are based on the joint possession of neighbors. Table 3.6 shows a neighbor 3.5. PRIVACY AND SECURE DATA EXCHANGE 29 comparison of the words east and west. By having at least one neighbor in common (here: est), the two words are considered to be similar. If a user searches for a word, step (2) of the above described process is applied again. For each generated neighbor, a DHT GET operation is done, using the neighbor as a key. Thus, the user gets all file UUIDs, with at least one common neighbor of the word and the generated keywords of a file. With a list of file UUIDs (i.e., the search result), selected files can be downloaded if desired. Neighbor Comparison Operation east west Deletion of character 1 ast est Deletion of character 2 est wst Deletion of character 3 eat wet Deletion of character 4 eas wes

Table 3.6: Comparison of the deletion neighborhood of two words with edit distance=1

3.5 Privacy and Secure Data Exchange

The rapid growth of the Internet and the evolving possibility to use web services with simple registration (i.e., provide E-mail and password) and store data apart from personal has led to a heightened awareness to protect one’s privacy from disclosure. A secure design is essential, since PiCsMu involves the use of personal cloud service accounts to store files and further offers a share functionality that allows others access to them. is concerned with protecting an automated information system in order to attain the applicable objectives of preserving the confidentiality, integrity, and availability of information system resources, e.g., data, software, and hardware [35]. This definition introduces three key requirements that embody the fundamental concepts in computer security. They are often referred to as the CIA triad [35, 53]:

Confidentiality: A requirement intended to assure that private or confidential informa- tion is not disclosed to unauthorized individuals. This includes protection of data as well as personal privacy. Integrity: Integrity is often distinguished between data integrity and system integrity. The former assures that data is changed only in a specified and authorized man- ner, e.g., data that is exchanged between two persons over an open network cannot unknowingly be manipulated by a malicious third party. System integrity assures that a system performs its intended functions in an unimpaired manner. Availability: Assures that systems can be reached reliably as intended and service is not denied to authorized users.

PiCsMu is designed to comply with the CIA requirements. Well-known concepts in com- puter security that have proven themselves are therefore integrated within the system architecture. 30 CHAPTER 3. DESIGN

3.5.1 Cryptography

Cryptography is the practice of enabling over insecure communi- cation media and therefore supports construction of secure information systems that are robust against a variety of attacks. PiCsMu makes use of two cryptographic principals to assure confidentiality and integrity in the scope of private file sharing: (1) symmetric encryption and (2) public key cryptography.

Symmetric encryption, also referred to as secret-key encryption, is a method that encrypts data for secure transmission using a secret key. The secret key is shared by the sender and recipient and only known to them. However, the problem of this method is to share the secret key. If the sender and recipient are not able to physically exchange the secret key, they have to share it over an open channel (e.g., Internet), which is typically insecure.

Knowing the drawbacks that symmetric encryption has, public key cryptography (or asymmetric encryption) was introduced. This allowed making an encryption key pub- licly available, thus having no problem in exchanging it over an insecure communication media. The method uses two related encryption keys, called a key pair. One key is kept secret (i.e., private key), while the other (public key) is shared with communication part- ners. A sender can then encrypt data using the receiver’s public key. As a consequence, it is only possible for the holder of the corresponding private key (i.e., the receiver) to decipher the data.

Public key cryptography has yet another use besides providing confidentiality to trans- mitted data; that is: message authentication. Suppose that a recipient wants to assure that a received message does come, in fact, from the expected sender. In this case, the sender encrypts a digest of the message (e.g., a hash of it) with his/her own private key. Such an encrypted message digest is called a digital signature. Afterwards, the recipient can try to decipher the digital signature using the sender’s public key. If it is possible, the origin, content, and sequencing of the message is verified.

Data stored in the distributed environment of a DHT is accessible by all participants of the P2P network. In order to provide a private share functionality that complies with the CIA triad requirements, this storage needs to be made secure. PiCsMu allows sharing files privately with a number of other users. Before storing the index of a shared file in the DHT, it first runs through multiple encryption steps. Figure 3.4 illustrates the proposed security mechanism.

In order to create a digital signature, a secure hash function (e.g., SHA-1) is applied on the index to create a digest. Since the signature will be stored with the index, its size should be as small as possible to avoid a considerable large amount of data to be uploaded. The digital signature can be generated afterwards by encrypting the digest with the private key of the file provider. Alongside these steps, the index is encrypted with a random-generated secret key (symmetric encryption). Advantages of secret keys are their fixed length and very small data size. Encryption speed is based on the length of the key and the data size of the file to encrypt. Based on these facts, instead of encrypting a possible large index multiple times, the small secret key is individually encrypted using the receivers’ public keys. Decreasing the time needed to encrypt increases the overall performance of the 3.5. PRIVACY AND SECURE DATA EXCHANGE 31

Random Secret Key Public Key Receiver

Index Encryption Encryption

Hash Function Index Append Index

Digest Encryption Signature Append

Index Private Key Sender

Figure 3.4: Applied encryption scheme for private sharing system in equal measure. The result of this encryption step is appended to the encrypted index as well as the digital signature. This construct is generated for each receiver and then stored in the DHT. Summarizing, four steps are necessary to enable an index for private sharing:

1. Create a digital signature by encrypting an index digest with the sender’s private key.

2. Encrypt the index once, using a random-generated secret key.

3. Encrypt the previous generated secret key for each receiver, using their public keys.

4. Concatenate the results of steps 1-3 for each receiver and store these objects in the DHT.

Receiving the index works by applying the reverse process. First, the digital signature can be verified by trying to decipher it with the file provider’s public key. The secret key is deciphered with the private key and then used to decipher the index. Finally, the shared file can be retrieved by using index information. Encryption and use of digital signatures provide confidentiality and integrity to PiCsMu, complying with CIA triad requirements. Availability, as the last requirement, is also met by combining public key cryptography with TomP2P. All entries in the DHT are signed by the user who created 32 CHAPTER 3. DESIGN them, using his or her public key. Therefore, an entry can only be removed by the user holding the corresponding private key, and the data is thus available as long as intended by the provider. As a result, the design of data exchange using the DHT meets all CIA triad requirements and is therefore considered secure.

3.6 Share Functionality

One of the major goals of this master thesis is provide to current PiCsMu system a decentralized share functionality based on a P2P network. The proposed approach defines two modes of operation: (1) public sharing and (2) private sharing. Both are explained with a use case that sums up all proposed design decisions and again shows their primary purpose and application.

3.6.1 Public Sharing

Of the two modes of operation, public sharing is the simpler one, because it requires less complex functions (e.g., no cryptography) to store a file. Figure 3.5 illustrates a use case in which Alice shares a file with the PiCsMu community.

In the first step (1), Alice selects the file she would like to share with other PiCsMu users. During step (2), her local PiCsMu application splits the data of the file into multiple file parts according to the internal fragmentation plan. Based on the available cloud service accounts Alice did register, all file parts are individually encoded to a file type accepted by the data validation process of cloud services. Afterwards, step (3) uploads the encoded file parts to the services and thus creates credentials that allow the used accounts to be temporarily accessed by other PiCsMu applications.

Step (4) shows how the fragmentation plan, applied encoders, and available credentials build the index of the file and how it is stored in the DHT. Storing the index follows a predefined scheme, trying to minimize the required number of DHT calls and therefore increasing performance. Since the file needs to be found by other PiCsMu users, the content-based search algorithm creates a set of keywords that categorize the file. These are stored in the DHT’s assigned search domain and become available as results of the file search.

Users search for files by providing PiCsMu with search terms. Hence, the application presents a list of files that contain similar keywords. Search results include a file UUID used to start a sequence of DHT calls necessary to retrieve the index. Step (5) shows how the index is obtained. Afterwards, in step (6), PiCsMu uses index information to locate, access, and download all file parts. In the final step (7), all file parts are decoded and reassembled according to the fragmentation and encoding plan in order to recreate the shared file. 3.6. SHARE FUNCTIONALITY 33

Fragmentation Plan Encoding Plan F E Alice (Provider) 1 Res: F n o i g t g n E a n i i t

Local PiCsMu n d d n i o o e 2 o c J Application A c m Req: e n g E D

File a r F Res: C 3 Req: Req: F E

4 Req: INDEX 7 Cloud Services F E C Req: C Res: 6 Res:

Req: File identifier Local PiCsMu DHT 5 Application B Res: INDEX F E C

Peer-to-Peer Network PiCsMu User (Consumer)

Figure 3.5: Use case: Public sharing with all users of PiCsMu

3.6.2 Share Notification

PiCsMu features an automated function that notifies a user when someone else shares a file with him or her. As explained in Section 3.4.1, the DHT scheme reserves a domain for share notifications. After making a file and its index available in the PiCsMu system, notifications for all intended receivers are put into the DHT, according to its storage design. Table 3.7 shows the concept of a notification and its attributes.

Concept of a notification Attribute Description File UUID A unique identifier needed to start a file download User name The PiCsMu user name of the person who shares this file

Table 3.7: Conception of a personal share notification used in private sharing

On each application start, PiCsMu checks the personal share notification domain for updates and downloads them. By using the file UUID contained in each notification, the system can start a file download. After completion, the sender’s user name can be used 34 CHAPTER 3. DESIGN to get the corresponding public key from the central server and use it to verify the digital signature of the file and therefore its integrity.

3.6.3 Private Sharing

Figure 3.6 shows a use case that illustrates the eight necessary steps to privately share a file between two PiCsMu users, Alice and Bob.

Step (1) defines the prerequisite condition that must be met for users to receive files. On account creation, PiCsMu generates a unique public cryptography key pair. The possession of such a key pair is the first requirement to enable private file sharing. Users need access to others’ public keys, because they are needed to specifically encrypt files. Therefore, each user uploads the public key associated to his or her PiCsMu Identifier (PM-ID) to the central server.

PiCsMu Friendlist A Encryption Table Bob PM-ID B PM-ID A Key A PM-ID A File PM-ID B Key B Claire PM-ID C Alice PM-ID C Key C (Provider) Erin PM-ID E ...... INDEX 2 F E C Req: Req: Local PiCsMu PM-ID B Encryption 5 Application A 4 Res: Res: INDEX Database INDEX F E C Res: Public Key Bob PiCsMu Central Server F E C C 3 Req: Req: PM-ID B

6 Public Key Bob Req: PM-ID B Cloud Services 1 INDEX Req: F E C C 9 Res: Res: INDEX F E C INDEX Req: F E C PM-ID B Local PiCsMu DHT Decryption 7 Application B 8 Res: PM-ID B Req: INDEX INDEX F E C F E C Peer-to-Peer Network PM-ID B Private Key Bob Bob (Consumer)

Figure 3.6: Use case: Private sharing between two PiCsMu user

Sharing a file requires it to be uploaded to cloud services, using fragmentation and en- coding techniques applied by PiCsMu during step (2). Details of this process are equal to step (2) of the public sharing use case and can be looked up in Figure 3.5. 3.6. SHARE FUNCTIONALITY 35

Step (3) shows how the encoded file parts are uploaded to their corresponding cloud service and, in return, define a credential to access them. Encoding, fragmentation, and credential build the index needed to retrieve and reconstruct the file. During step (4), Bob’s public key is obtained by querying the server with his PM-ID that is listed in the friend list of Alice. In step (5), the index is encrypted with a randomly generated secret key which is afterwards encrypted with Bob’s public key, making him the only receiver able to decipher and read the index.

In the scope of step (6), the encrypted index is put into the DHT using the storage scheme designed for it. In addition, Alice creates a share notification that includes her user name and the shared file’s UUID. This notification is stored as well in the DHT under Bob’s share notification domain. With this step, Alice concludes the private file sharing from her side. As long as she does not remove the index, PiCsMu assumes that the file is available.

When Bob starts his local PiCsMu application, it connects to the P2P network and queries the share notification domain in the DHT for his PM-ID. PiCsMu recognizes a new notification and informs Bob about the available file Alice shared with him. If he chooses to download the file, PiCsMu uses the file UUID contained with the notification to start a sequence of DHT calls necessary to retrieve the complete index. Step (7) illustrates this process.

Having completed the download of the index, PiCsMu first verifies the digital signature against the public key provided by the central server. Since Alice originated the file, the signature is approved and PiCsMu is able to decipher the index with Bob’s private key, shown in step (8). Finally, step (9) includes index use to locate all file parts, access the cloud services they are stored on, and download them. Like the last public-sharing step, each file part is first decoded and then assembled with the others, recreating the file Alice shared with Bob (cf. Figure 3.5). 36 CHAPTER 3. DESIGN Chapter 4

Implementation

This chapter describes how design decisions described in Chapter 3 are implemented and integrated into PiCsMu in order to enable the share functionality. A deeper view into the implemented protocols and mechanisms is given, with the support of selected figures and code fragments.

4.1 An Overview of the P2P Code Architecture

A simplified view of the code architecture is presented for a better understanding of the system and explanations in this chapter. Figure 4.1 shows a Unified Modeling Language (UML) class diagram reflecting the implemented classes. The proposed code can be divided into four parts (i.e., packages), each responsible for a certain functionality.

Classes of the API package provide interfaces to connect the P2P component with PiCsMu. They are the only part visible to the outside (e.g., the application) and therefore encap- sulate other underlying mechanisms. This modular design approach enables to maintain or exchange P2P functionality without the need of making changes to connected compo- nents. An explanation of how components communicate with the API and its classes is given in Section 4.4.2.

Network and P2P functions are part of the network package. IndexClientP2P represents a peer and provides various PUT and GET methods to interact with the DHT. These methods are implemented on top of TomP2P. Bootstrap is a condensed form of IndexClientP2P used to run a single peer only listening for incoming connections. This is needed if the P2P system is solely intended as access point for bootstrapping.

The index package serves as bridge between API and network. Requests coming from the application through the API are processed and forwarded to the IndexClientP2P. P2PIndex implements the DHT storage scheme by formulating DHT calls with specific location key, content key, and domain. In addition, it applies encryption to data. Finally, the search package provides algorithms and data structures to enable a content-based search for DHTs. Section 4.6 explains file search in detail.

37 38 CHAPTER 4. IMPLEMENTATION

<> <> #cryptography -instance CA Callback C DistributedCryptography 0..1 0..1 index.distributed..callback cryptography

0..1 <> -cryptography I DistributedIndex index.distributed

<> <> <> <> -index -fastSSIndex I DistributedIndexAPI C P2PIndexAPI C P2PIndex C FastSimilaritySearch 0..1 0..1 index.distributed.api index.distributed.api index.distributed index.distributed.search

<> -peer -dhtSearch C FrequencyMap 0..1 index.distributed.search 0..1 <> <> <> -map 0..* C C C Bootstrap IndexClientP2P DHTSearch <> network.bootstrap network.client index.distributed.search C Counter index.distributed.search

Figure 4.1: Simplified UML class diagram of the P2P system’s code architecture

4.2 Data Exchange using Beans

Beans are reusable software components that conform to a particular convention and are an often used concept in the Java programming language. Their main purpose is to encap- sulate many objects into a single one (i.e., the Bean) that can be used to transmit as a sole entity. The classical PiCsMu already made use of Beans to exchange data with the cen- tral server, e.g., FileBean, FilePartBean, CredentialBean, among others. This concept becomes useful when integrating the P2P system into PiCsMu. The clear advantage is to put single objects to the DHT (encapsulating a lot of information), which is preferable in terms of performance instead of putting multiple objects once at a time. The same applies for retrieving objects from the DHT. Based on this concept, the P2P system introduces two new Beans: (1) DistributedCryptographyBean and (2) ShareNotificationBean.

The first is used to wrap encrypted data (e.g., an encrypted FileBean) and their digital signature in one object. This Bean is used in case of private sharing and further explained in Section 4.5. In the same context, the ShareNotificationBean is the object put into the DHT to inform another user that a file has been shared. It wraps the file UUID as well as the user name of the sender.

4.3 File-Sharing Protocols

Figure 4.2 pictures the implemented sequence of DHT calls necessary to upload and down- load a file in public file sharing. The focus lies on the communication between application and DHT, neglecting the influence of cloud services and central server. Public file upload 4.3. FILE-SHARING PROTOCOLS 39 requires four basic PUT operations to store the FileBean, FilePartBean, Credential- Bean, and search keywords. A loop block indicates that enclosed operations are executed multiple times until the loop condition is met (e.g., for each keyword of the file). Un- like GET, PUT operations do not expect an object to be returned from the DHT. However, the application needs to be informed on completion of the process to start subsequent method calls. The implemented concept that addresses this issue is explained in Sec- tion 4.4.1.

Application Upload DHT

PUT(Hash(FileBeanUUID), INDEX, FileBean)

Loop (for each FilePart of List)

PUT(Hash(FilePartBeanID), INDEX, FilePartBean)

PUT(Hash(CredentialBeanID), INDEX, CredentialBean)

Loop (for each Keyword of the file) PUT(Hash(keyword), SEARCH, FileBeanID)

Download GET(Hash(keyword), SEARCH) List GET(Hash(FileBeanUUID), INDEX) FileBean

Loop (for each FilePartID of List) GET(Hash(FilePartBeanID), INDEX) FilePartBean GET(Hash(CredentialBeanID), INDEX) CredentialBean

Figure 4.2: Public file exchange protocol sequence

In order to download a publicly available file, a user needs to obtain a FileBean UUID. Querying the search domain returns a list of FileBean UUIDs matching the specified key- word. A FileBean is always the starting point in downloading since it contains informa- tion about the number of required FilePartBeans. Besides, each FilePartBean contains the identifier of the CredentialBean used to access the cloud service this FilePartBean is stored on. After obtaining all required Beans (i.e., the index), the encoded file parts of the actual file can be download from cloud services.

Private file sharing requires the same basic operations as public sharing but has different method signatures. Figure 4.3 shows the protocol and method calls. Contrary to public sharing, almost all methods use a generic DistributedCryptographyBean as an argument instead of a specific Bean. Since private sharing encrypts Beans to ensure data security 40 CHAPTER 4. IMPLEMENTATION and privacy, it requires another wrapper object to contain encrypted data and its digital signature. Another difference between these file exchange protocols is the composition of DHT location keys. Both use the same location keys but private sharing adds the receiver’s identifier to each key to avoid ambiguity in the DHT. Identifiers of Beans are used to build location keys because they are unique in the system, and therefore never create the same DHT entry (i.e., same location and content key) twice. If an entry already exists, a PUT call with the same location and content key as the entry would overwrite it or be ignored (depending on methods used).

Application Upload DHT

PUT(Hash(FileBeanUUID+UserID), INDEX, DistributedCryptographyBean)

Loop (for each FilePart of List)

PUT(Hash(FilePartBeanID+UserID), INDEX, DistributedCryptographyBean)

PUT(Hash(CredentialBeanID+UserID), INDEX, DistributedCryptographyBean)

PUT(Hash(user name), SHARE_NOTIFICATION, FileBeanUUID)

Download

GET(Hash(user name), SHARE_NOTIFICATION) List GET(Hash(FileBeanUUID+UserID), INDEX) DistributedCryptographyBean(FileBean)

Loop (for each FilePartID of List) GET(Hash(FilePartBeanID+UserID), INDEX) DistributedCryptographyBean(FilePartBean)

GET(Hash(CredentialBeanID+UserID), INDEX) DistributedCryptographyBean(CredentialBean)

Figure 4.3: Private file exchange protocol sequence

Instead of searching, share notifications are used to obtain file UUIDs. A list of per- sonal ShareNotificationBeans is retrieved from the DHT if queried expressing a user name as an argument. Only the rightful user is able to download and decrypt all as- sociated Beans of a share notification’s file UUID. Like the public-sharing download, all required Beans are obtained to start the actual file download.

4.4 Asynchronous and Non-Blocking Communication

TomP2P is based on the concept of asynchronous, non-blocking communication that al- lows an operation to start without waiting for the result or completion of a preceding 4.4. ASYNCHRONOUS AND NON-BLOCKING COMMUNICATION 41 operation. Since calls to the DHT need some time to be internally processed, an ap- plication without non-blocking communication would be required to stop the processing and wait for the call to complete. Drawbacks of this method are (1) low performance (processing takes more time), (2) possibility of deadlocks (competing operations are each waiting for the other to finish and thus neither does), and (3) no parallel processing. Therefore, PiCsMu implements two concepts for handling asynchronous, non-blocking communication with its P2P component, explained in Sections 4.4.1 and 4.4.2.

4.4.1 Interacting with the DHT using Futures

Non-blocking communication in TomP2P is based on the notion of future objects. Futures keep track of DHT calls and their results, thus, a PUT or GET returns immediately and the future object can later be processed to obtain the results from those operations. The implementation handles such future events with the use of Listeners. Listeners are interfaces that await the reception of events in order to execute a method.

An example of this concept is given in Listing 4.1. The status of DHT.get(0x123) is tem- porarily stored in the future object in order to keep track of the operation. Afterwards, a Listener interface (i.e., BaseFutureAdapter) is added on the future and implemented as an anonymous inner class that defines a operationComplete(FutureDHT future) method. Since all asynchronous operations are internally handled by TomP2P, the application just needs to be notified when the operation has been completed. Ex- actly this notification event is received by the Listener and invokes the operationCom- plete(FutureDHT future) method. If the operation was successful, data from the DHT becomes available on the future object and can be retrieved.

//get the object with the specified location key FutureDHT future = DHT.get(0x123); future.addListener(new BaseFutureAdapter() {

@Override public void operationComplete(FutureDHT future) throws Exception { if(future.isSuccess()) { //retrieve data stored with the future object System.out.println(future.getData()); } else{ System.out.println("failure"); } } }); Listing 4.1: Non-blocking communication using futures 42 CHAPTER 4. IMPLEMENTATION

4.4.2 Asynchronous Callbacks and Event Handling

Since the application is the system architecture’s core and responsible for handling control flow, it demands a simple and lightweight interface to access functionality of the P2P component. The terms “simple” and “lightweight” are used here in the sense of providing an interface that demands no knowledge of the underlying mechanisms and, in the same time, offers a manageable set of distinct methods.

DistributedIndexAPI is the implemented interface which provides a variation of PUT and GET methods to interact with the DHT. All those methods are asynchronous and would return future objects. However, since the application should be completely separated from the P2P part and is not responsible for handling futures, an additional structure, called Callbacks, is implemented.

Callbacks are logical Listener devices that listen to the completion of asynchronous operations from the DHT, and make use of the observer design pattern to notify the application about its state or a raised event. Observer belongs to the family of behavioral design patterns and defines a one-to-many relationship between objects [31]. One object (i.e., observable) maintains a list of dependent objects (i.e., observers) notified of changes in the observable. This design pattern is mainly used to implement distributed event handling.

In order to process the results of all different DHT operations, a set of Callback classes is provided, each having its unique purpose. Figure 4.4 pictures the implemented class hier- archy as a UML diagram. Callback is an abstract base class that serves as a template for its concrete subclasses. Since TomP2P uses different types of futures, a generic type parameter is defined which can represent any future object. In addition, Callback extends Observable and implements the BaseFutureListener interface, whereat the latter enables handling of futures in the same way as shown in the exemplary code sample in Listing 4.1. Observable allows all Callback classes to attach one or more observers to themselves and further provides methods to inform observers about changes in the state of the Callbacks. This concept is needed to inform the observer when a DHT call and its assigned Callback are completed, as well as when the processed result of the call is available on the Callback. The observer can be implemented in a GUI, which may block certain user interactions as long as an operation (e.g., file download) is running and notifies the user when it is completed.

An example of this end-to-end callback process, and its involved classes with methods, is illustrated in Listing 4.2 and Figure 4.5.

public void getFileBean(GetBeanCallback callback, String fileUUID) { FutureDHT future = DHT.getFileBean(fileUUID); future.addListener(callback); } Listing 4.2: Non-blocking API method to request a file 4.4. ASYNCHRONOUS AND NON-BLOCKING COMMUNICATION 43

<> <> I BaseFutureListener C Observable net.tomp2p.futures java.util

operationComplete(F):void Observable() exceptionCaught(Throwable):void addObserver(Observer):void deleteObserver(Observer):void deleteObservers():void notifyObservers():void <> notifyObservers(Object):void <> A hasChanged():boolean C Callback countObservers():int C PutObjectFutureJoinCallback index.distributed.api.callback index.distributed.api.callback Callback() PutObjectFutureJoinCallback() setOperationCompleteAndNotify():void operationComplete(FutureLateJoin):void isOperationComplete():boolean exceptionCaught(Throwable):void setOperationSuccessful():void isOperationSuccessful():boolean setResult(Object):void getResult():Object <> verifySignature(PublicKey):boolean C SearchResultCallback isSignatureAvailable():boolean <> setSignature(DistributedCryptographyBean):void index.distributed.api.callback C GetBeanCallback SearchResultCallback() index.distributed.api.callback operationComplete(FutureLateJoin):void exceptionCaught(Throwable):void GetBeanCallback() operationComplete(FutureDHT):void exceptionCaught(Throwable):void

<> <> <> C GetMultipleBeansCallback C PutObjectFutureCallback C BootstrapCallback index.distributed.api.callback index.distributed.api.callback index.distributed.api.callback GetMultipleBeansCallback() PutObjectFutureCallback() BootstrapCallback() operationComplete(FutureDHT):void operationComplete(FutureDHT):void operationComplete(FutureDHT):void exceptionCaught(Throwable):void exceptionCaught(Throwable):void exceptionCaught(Throwable):void

Figure 4.4: Callback UML class diagram

Considering that the GUI is the observer, the following example can be illustrated to a better understanding. Alice selects a file she would like to download by interacting with the GUI. The issued command is forwarded to the application, and the GUI blocks any further interactions related to the same file in order to prevent flooding the system with the same request. Since the application holds an object of DistributedIndexAPI, it can interface with the P2P system by calling its methods. Listing 4.2 shows the method to get a file from the P2P system, providing as arguments the GetBeanCallback object and the requested file’s UUID. GetBeanCallback is the type of Callbacks used when the future object from the DHT is expected to contain a Bean as data. Inside the method, a call to the DHT is made and the GetBeanCallback is added as Listener to the resulting future. As soon as the GetBeanCallback gets informed by the internal future handler of TomP2P, its operationComplete() method is called, which further invokes notify- Observers(Callback callback). This method notifies all attached observers about the completion of GetBeanCallback by passing the object itself to their update(Object obj) method. The result contained in the GetBeanCallback object can then be processed by the application and start the actual file download. Afterwards, the file is presented to the user. Finally, the GUI releases the lock on the file and allows further interactions with it. 44 CHAPTER 4. IMPLEMENTATION

PiCsMu User Alice GUI Application GetBeanCallback DistributedIndexAPI P2PIndex

Click «Download» Download file getFileBean(callback, UUID) getFileBean(UUID)

future

operationComplete()

notifyObservers(callback) update(callback)

Process result getResult()

FileBean

Process FileBean and download file File File

Figure 4.5: Asynchronous control flow of requesting a file

4.5 Implementing Security Measures

DistributedCryptography is the class responsible for providing security mechanisms in a distributed environment (i.e., P2P system) to comply with the CIA triad requirements. It holds a unique public cryptography key pair as well as a key generator to create ran- dom symmetric secret keys. Since the class is used by different components of the system and relies on always having the exact same key pair, DistributedCryptography is im- plemented as a Singleton. Singleton is a design pattern that belongs to the family of creational patterns and is used when there must be exactly one instance of a class, and it must be accessible to clients from a well-known access point [31]. Advantages are (1) a controlled access to sole instance, (2) reduced name space (avoids the need of global variables), and (3) allows refinement by subclassing.

One of the most important methods of DistributedCryptography is shown in Listing 4.3. It includes all necessary steps to encrypt and sign an object for the designated receiver. The method relies on two two different encryption keys: (1) random 128/256-bit secret key and (2) 1024/2048-bit public cryptography key. Key sizes depend on the available Java Runtime Environment (JRE) installed on the personal computers. Both minimal values are default in all JRE installations and are considered to be secure, while the maximal values are optional.

public DistributedCryptographyBean encrypt(Data data, Key encryptionKey) { //wrapper to store in the DHT DistributedCryptographyBean bean;

//new random AES secret key SecretKey secretKey = generateSecretKey(); 4.5. IMPLEMENTING SECURITY MEASURES 45

byte[] encryptedData = encryptAES(data.getData(), secretKey); byte[] encryptedKey = encryptRSA(secretKey.getEncoded(), encryptionKey); bean = new DistribuedCryptographyBean(encryptedData, encryptedKey); sign(bean);

return bean; } Listing 4.3: Public cryptography encryption

Key (1) (i.e., secretKey) is generated to be used by the Advanced Encryption Standard (AES) algorithm. AES is a symmetric encryption algorithm proposed in 2001 by the National Institute of Standards and Technology (NIST) as a replacement for the back then used Data Encryption Standard (DES) [53]. It provides equal or better security and is significantly more efficient. Key (2) (i.e., encryptionKey) is provided as an argument for the method and is used by the internal Rivest-Shamir-Adleman (RSA) encryption algorithm. Its name is given by the creators’ last names. Contrary to AES, RSA is an asymmetric public key cryptography algorithm based on the supposed difficulty of factoring large integers. It is considered secure because its key generation is based on a mathematical problem which is presumed to be impossible to solve in feasible time. However, if this mathematical barrier is overcome in the near future, it becomes possible to reconstruct a private key based on its corresponding public key. Still, this threat does not invalidate RSA and can be overcome by increasing the key sizes.

Having these prerequisites, data provided as argument to the method is first encrypted by AES. Afterwards, the secretKey (AES) is encrypted by the RSA method using the public key of the intended receiver. The results of both steps are wrapped in a Dis- tributedCryptographyBean object, later used to be stored in the DHT. In addition, the created bean is signed with a digital signature. Wrapping is a necessary step in order to have only one object that needs to be put to the DHT, saving multiple DHT calls.

Decryption applies the reverse process and is displayed in Listing 4.4. The provided Dis- tributedCryptographyBean is deciphered using the private key stored with the appli- cation. With the decryptedKey object, it is possible to decipher and finally return the actual data. Digital signatures are verified at the moment of reception and before that the DistributedCryptographyBean is provided as argument to the decryption method.

public Data decrypt(DistributedCryptographyBean bean) { byte[] decryptedKey = decryptRSA(bean.getSecret(), privateKey); SecretKey key = new SecretKeySpec(decryptedKey,"AES"); byte[] decryptedData = decryptAES(bean.getData(), key); return new Data(decryptedData); } Listing 4.4: Public cryptography decryption 46 CHAPTER 4. IMPLEMENTATION 4.6 File Search Functionality

Enabling a file for the content-based search requires three steps: (1) generation of key- words, (2) generation of deletion neighborhood for each keyword, and (3) store all created neighbors in the DHT. Listing 4.5 illustrates these steps in pseudo code. Step (1) is fur- ther explained in Section 4.6.1 and step (2) in Section 4.6.2, each by providing actual code examples.

indexFile(File file) { //generate keywords and their neighbors Set keywords = generateKeywordsFromFile(file); Set neighbors = generateDeletionNeighborhood(keywords);

Set searchKeys = new HashSet(); searchKeys.addAll(keywords); searchKeys.addAll(neighbors);

//store in the Distributed Hash Table for(String key : searchKeys) { DHT.put(key, file.getUUID(), SEARCH_DOMAIN); } } Listing 4.5: Content-based search pseudo code for indexing a file

After generating keywords and their neighbors, both are joined together and stored in the DHT. Therefore, the PUT operation with following signature PUT(location key, value, domain) is used. The content key is automatically set to a hash of the file UUID. Hence, multiple file UUIDs are stored with the same key, allowing one to retrieve multiple files with one search request.

The action of search files requires similar steps. Listing 4.6 shows in pseudo-code how a lookup works. First, the deletion neighborhood for the search word is generated in the same way as with indexing. Afterwards, a GET(location key, domain) call is executed, directly using the word as location key. This direct search is faster than content-based search and might already provide results. Finally, the same operation is applied for each neighbor. The operation returns a collection of file UUIDs that are available for download.

Set search(String word) { //file UUIDs are Strings(Java UUID) Set result = new HashSet(); Set neighbors = generateDeletionNeighborhood(word);

//direct search result.add(DHT.get(word, SEARCH_DOMAIN)); 4.6. FILE SEARCH FUNCTIONALITY 47

//content-based search for(String neighbor : neighbors) { result.add(DHT.get(neighbor, SEARCH_DOMAIN)); } return result; } Listing 4.6: Content-based search pseudo code for searching a file

4.6.1 Keyword Generation

Files need to have distinct keywords in order to categorize and filter them during a file search. The implemented keyword generation algorithm collects all necessary attributes of a file to provide them as arguments to the P2PFastSS algorithm for neighbor generation. File description texts need additional processing to overcome the complexity of natural language. Listing 4.7 shows the method that generates keywords by providing as argument a text in natural language.

private Collection generateKeywords(String description) { FrequencyMap frequencyMap = new FrequencyMap();

//remove punctuation String cleanText = cleanText(description); String[] words = cleanText.split("\\s+");

for (String word : words) { //stop words are undesired words without deep meaning if(!isStopWord(word)) { frequencyMap.add(word.trim()); } } return frequencyMap.getMostFrequentWords(WORD_FREQUENCY); } Listing 4.7: Keyword generation from file description

After the description is cleaned from punctuation, it is split into single words and stored in an array. In the for-loop, each word is checked against a list of unwanted stop words and rejected if it is one, adding only usable words to the frequency map. Stop words are provided by a file stored with the application that can be updated by the central server.

FrequencyMap is a data structure designed for the sole purpose of keeping track of word occurrences. It is a hash-map-based implementation that provides a mapping be- tween a word (key) and the frequency that occurs in the text (value). It is assumed that the more frequent a word occurs in the text, the more important it is. By call- ing the getMostFrequentWords(WORD_FREQUENCY) method, FrequencyMap determines 48 CHAPTER 4. IMPLEMENTATION all words with a frequency higher than, or equal to, the specified WORD_FREQUENCY. If no words are available having such frequency, the method recursively decreases the fre- quency until a minimum number of words is found or the frequency reaches 1, which would return all words. The current version uses a WORD_FREQUENCY=4 and expects a minimum of three words having this frequency. These values produced acceptable results in various test cases.

4.6.2 Fast Similarity Search

The heart of the content-based search is the P2PFastSS algorithm, providing neighbor generation and comparison to identify similar words. Each keyword generates mk neigh- bors, where m is the length of a word, and k is the edit distance. Therefore, each keyword generates mk+1 DHT calls in order to store all neighbors plus the word itself in the search domain, making a total of n(mk+1) DHT calls, where n is the number of gener- ated keywords per file. Considering performance, generating the deletion neighborhood is implemented using an edit distance k=1 as the only selectable constant. Since the neighbor generation is an exponential function, and DHT calls are expensive in terms of time and performance, k is kept as small as possible. With the chosen k=1, the function becomes linear and therefore performs better. In addition, variables n and m may get arbitrarily large. Hence, they are limited to m<16 and n<10. These values are not proven to be optimal and should be evaluated further.

Listing 4.8 shows the code fragment that generates neighbors for a specified word. The limiting of words and their length is applied before providing them as an argument to the generateDeletionNeighborhood(String keyword) method. First, all words are made lowercase in order to avoid ambiguity. As illustrated, the algorithm loops through each character of the word, removes it, and stores the remainder in the neighbors set. Afterwards, the earlier removed character is added again to correctly build the other neighbors.

public Set generateDeletionNeighborhood(String keyword) { StringBuilder sb = new StringBuilder(keyword.toLowerCase()); int length = keyword.length(); Set neighbors = new HashSet(length);

for(int i = 0; i < length; i++) { char c = sb.charAt(i); sb.deleteCharAt(i); neighbors.add(sb.toString()); sb.insert(i, c); } return neighbors; } Listing 4.8: Method to generate the deletion neighborhood with an edit distance k=1 Chapter 5

Evaluation

The evaluation chapter describes test scenarios to cover the most important functions of the proposed P2P-based approach. A qualitative and quantitative evaluation are con- ducted to investigate if the system works as intended and if it performs in a scalable manner. The obtained results are presented and discussed.

5.1 Testbed and Evaluation Setup

The proposed approach is evaluated in a controlled environment on the CSG testbed infrastructure [2]. Table 5.1 summarizes the evaluation setup. Three locally connected physical machines are used to run the software and simulate a network of peers interacting with each other. One machine (b04) acts as central server, serving as (1) identity provider and (2) P2P bootstrap server. In addition, the b04 node runs a file upload service in the local network aimed to simulate cloud services (e.g., ImageShack or SoundCloud). The file upload service receives a file through HTTP as input, returning a Uniform Resource Locator (URL) with the location of the resource. Then, users can download the file, as the URL with the resource has a unique identifier. The others nodes (n13 and n15) run multi- ple instances of PiCsMu each, equally distributing computational load. Running multiple instances of PiCsMu requires a certain amount of Random Access Memory (RAM) for handling Java Virtual Machines (JVMs). Therefore, a maximum of 100 active peers is defined to simultaneously run on the network, equally divided between n13 and n15. This value is chosen in regard to the individual machine’s computational power, allowing a stable evaluation.

Setup Machine Specification Role b04 4 cores, 2.8 GigaHertz (GHz), 32 GigaBytes (GB) RAM Central Server, Service n13 2x12 cores, 2.5 GHz, 64 GB RAM Run 50 peers n15 2x12 cores, 2.5 GHz, 64 GB RAM Run 50 peers Table 5.1: Evaluation setup

49 50 CHAPTER 5. EVALUATION

5.1.1 Test Cases

Evaluation focuses on four high-level test cases, closely reflecting the system’s design: (1) files being publicly shared, (2) files being privately shared with at least one other PiCsMu user, (3) public files being searched and downloaded, and (4) private files being downloaded after receiving a share notification. These employed test cases simulate user behavior and cover all designed as well as implemented functions. A detailed list of employed test cases to gather data is shown in Table 5.2. Each test case is subject to the following assumptions:

ˆ Peers bootstrap to machine b04.

ˆ Peers are locally connected and listen on different ports.

ˆ 100 peers are initially online, but may leave afterwards.

ˆ All peers use the same service with the same credentials.

ˆ Random peers are chosen to execute operations.

ˆ The same files are used in each test.

ˆ Test cases are repeated at least three times under the same circumstances.

5.2 Qualitative Evaluation

Qualitative evaluation has the goal to assess the system’s functionality by reasoning about its correctness. The evaluation process comprises of two parts: (1) unit testing and (2) system test cases. In the scope of (1), Section 5.2.1 describes a method for measuring the rate of unit tests. Section 5.2.2 presents selected test cases of typical system scenarios. By showing their correctness, a general statement about the system’s overall functionality can be made.

5.2.1 Code Coverage using JUnit Tests

Software testing is an important step in evaluating the quality of a product. Its goals are to (1) identify errors (i.e., software bugs), (2) check if the software meets all defined requirements, and (3) operates as expected. This part of the evaluation focuses on unit testing, which is a method in which small parts of are tested with associated control data. Therefore, it is possible to check if these unit tests behave in the expected manner. Ideally, unit tests are independent of each other and individually isolate a part of the program to test its correctness. should not only be done at the end of a project but as a continuous process, since it drastically facilitates integration and helps in detecting problems early. In the scope of this thesis, the JUnit framework [10] is used to conduct software testing. 5.2. QUALITATIVE EVALUATION 51

Test Case Description (1.A) One user publicly shares an 8 Megabytes (Mbytes) file. (1.B) One user publicly shares a 50 Mbytes file. (1.C) One user publicly shares a 180 Mbytes file. (1.D) One user publicly shares a 750 Mbytes file. (1.E) One user publicly shares a 1 Gigabytes (Gbytes) file.

(2.A) One user privately shares an 8 Mbytes file with one user. (2.B) One user privately shares a 50 Mbytes file with one user. (2.C) One user privately shares a 180 Mbytes file with one user. (2.D) One user privately shares a 750 Mbytes file with one user. (2.E) One user privately shares a 1 Gbytes file with one user.

(3.A) One user privately shares a 50 Mbytes file with one user. (3.B) One user privately shares a 50 Mbytes file with 10 users. (3.C) One user privately shares a 50 Mbytes file with 80 users.

(4.A) One user downloads a publicly shared 8 Mbytes file. (4.B) One user downloads a publicly shared 50 Mbytes file. (4.C) One user downloads a publicly shared 180 Mbytes file. (4.D) One user downloads a publicly shared 750 Mbytes file. (4.E) One user downloads a publicly shared 1 Gbytes file.

(5.A) One user downloads a privately shared 8 Mbytes file. (5.B) One user downloads a privately shared 50 Mbytes file. (5.C) One user downloads a privately shared 180 Mbytes file. (5.D) One user downloads a privately shared 750 Mbytes file. (5.E) One user downloads a privately shared 1 Gbytes file.

(6.A) Multiple users simultaneously share a public 50 Mbytes file. (6.B) Multiple users simultaneously share a public 180 Mbytes file. (6.C) Multiple users simultaneously share a private 50 Mbytes file. (6.D) Multiple users simultaneously share a private 180 Mbytes file.

(7.A) Multiple users simultaneously download a publicly shared 50 Mbytes file. (7.B) Multiple users simultaneously download a publicly shared 180 Mbytes file. (7.C) Multiple users simultaneously download a privately shared 50 Mbytes file. (7.D) Multiple users simultaneously download a privately shared 180 Mbytes file.

(8.A) One user searches a keyword with 50 results. (8.B) One user searches a keyword with 500 results. (8.C) One user searches a keyword with 1000 results.

(9.A) One user retrieves 50 share notifications. (9.B) One user retrieves 500 share notifications. (9.C) One user retrieves 1000 share notifications.

Table 5.2: Test cases simulating user behavior in a file-sharing system 52 CHAPTER 5. EVALUATION

Code coverage is a technique to measure the degree to which source code has been tested. It records which lines of code were executed during test cases and therefore indicates which parts of the code remain untested and may still contain software bugs. This type of code coverage is often referred to as statement coverage. However, even if a code coverage of 100% is achieved, which is untypical in large software projects, this measure does not provide guarantees that the software perfectly works. The coverage degree only indicates how many test cases exist but gives no evidence about their correctness. An example of visual code coverage is given in Listing 5.1. Green colored lines are fully covered by at least one test case, while yellow ones are only partially covered, and red lines have not been executed by any test. Partially covered means that some instructions or branches are missing, which typically happens in if-else statements. In this example, only the case at which the statement evaluates to true is handled.

public String encodePublicKeyBase64() { if(publicKey != null) { byte[] key = this.publicKey.getEncoded(); return Base64.encodeBase64String(key); } return null; } Listing 5.1: Method to illustrate code coverage

In the scope of this qualitative evaluation, JUnit code coverage is measured and dis- cussed. Its goal is to assess the quality of the system by evaluating how thoroughly it is tested, assuming that a tested method operates as designed. The freely available software EclEmma [4] in combination with Eclipse [5] were used to measure code coverage of this thesis’ Java project. Code coverage is measured only for packages containing operational code. Test and helper classes are ignored, because they are never executed in a real scenario and would therefore falsify the result. Results are presented in Figure 5.1.

An overall coverage rate of 95.1% is achieved, having tested 9,667 out of 10,166 statements. In addition, all 19 Java classes have at least one test case, and 18 out of 194 methods remain untested. Classes of the callback package have lowest coverage. An investigation of source code shows that error handling (i.e., exceptions) is not covered by tests. Call- backs fail if their future object is unsuccessful (i.e., an operation to the DHT has failed) or an exception is thrown in the underlying BaseFutureListener. Neither of the cases should occur in a simple JUnit test due to the terms in its controlled environment, e.g., small number of peers, no parallel access to DHT, or use of local network. However, these errors become possible in the real application and therefore need to be enforced in JUnit. Many untested methods are convenience methods (e.g., getters, setters, or generic PUTs) added for the sake of future adaptions to source code. They are not used in current implementation but become helpful in extending the project. Nevertheless, these reasons do not justify to skip any test cases, since errors are eminent due to human programming and should be caught beforehand.

In summary, it can be said that the achieved code coverage of 95.1% indicates a well- 5.2. QUALITATIVE EVALUATION 53

Total

beans index cryptography

Package client api callback search

0 20 40 60 80 100 Coverage in % Covered Missed

Figure 5.1: Code coverage result tested and functional implementation. However, this result cannot assure quality and correctness of individual JUnit tests. An analysis regarding the missing 4.9% shows two untested areas: (1) error handling and (2) convenience methods. Methods in the scope of (1) should especially have a higher appeal for testing, leaving such task as an important target for future work.

5.2.2 Functional Testing

Functional testing is another method in the area of software testing for validating parts of the system. It belongs to the class of black testing that examines an application’s visible functionality (e.g., share a file) without looking into its inner implementation details. Contrary to unit testing, functional testing makes use of the system’s complete functionality and does not try to isolate parts of it. In the scope of this evaluation section, the defined test cases of Table 5.2 are executed and assessed for successful completion. The results are presented in Table 5.3. The symbol“X”shows that the test case was successful, having no errors in all test runs; and symbol “” means that a problem occurred in at least one test run.

Test Cases 1 2 3 4 5 6 7 8 9 A XXXXXXXXX B XXXXX  XXX C XXXXXXXXX D XXXXXX  XX E  XXXXXXXX Table 5.3: Functional evaluation results 54 CHAPTER 5. EVALUATION

In test case (1.E), unallocated RAM of node b04 ran out during the first test run, because the Java application in control of the file upload service did buffer too much data in RAM. The test did successfully complete after restarting the test setup and increasing JVM virtual memory and heap size of the file service application. During test case (6.B), 3 peers lost connection to the network and timed out, failing to finish the share action. Peers may time out if they are inactive over a defined period, but are able to reconnect to the network at any time. It is difficult to discover why these peers failed and it would require more sophisticated testing and diagnose methods. However, this error only occurred in 1 out of 3 test runs. In (7.D), the download failed because the peer was not able to retrieve a CredentialBean. An investigation of source code revealed an error in the handling of Callbacks. At the moment of running the evaluation, the application was released with such bug that did not wait for successful completion of the Callback. Therefore, the CredentialBean was still not available on the future object, being the cause of a not found CredentialBean.

5.3 Quantitative Evaluation

Quantitative evaluation is an assessment process to measure the system’s performance in different scenarios. Test cases are executed, and the time of single processes is measured to identify possible bottlenecks. In order to assess the proposed design and implementation, the times of the following actions are measured:

(1) Total time to share a file: This measurement includes the following actions: (a) preparation of a file (i.e., fragmentation and encoding), (b) encryption of index data (only in private sharing), (c) storage of index data in the DHT, and (d) upload of file parts to the service.

(1.b) Encryption of index data: Measures the time of encrypting index data for re- ceivers in order to assess the proposed encryption scheme.

(1.c) Storage of index data in the DHT: To investigate the proposed DHT storage scheme, this action observes the time of performing PUT calls to the DHT to store index data.

(2) Total time to receive a shared file: This measurement includes the following ac- tions: (a) download of index data from the DHT, (b) decryption of index data (only in private sharing), (c) download of all file parts from the service, and (d) assembly of a file (i.e., decoding and join).

(2.a) Download of index data from the DHT: Measures the time of performing GET calls to retrieve index data, and therefore investigates performance of the DHT stor- age scheme.

(2.b) Decryption of index data: Measures the time of decrypting index data. 5.3. QUANTITATIVE EVALUATION 55

(3) Search and share notification: The time to search a file or retrieve share notifica- tions is not included in measurement (2). It is separately investigated because it is another user action. However, the time of performing necessary DHT calls to store keywords and share notifications is part of measurement (1) since it belongs to the same process.

DHT calls and cryptographic methods are designed to be fast and should not add complex- ity to the file upload and download, which are expected bottlenecks. Therefore, having this setup, it is possible to assess this statement by comparing individual times of different system actions.

First, an investigation of file uploads with different file sizes is done. Figure 5.2 shows the total upload times. These times include fragmentation, encoding, encryption (in case of private upload), DHT storage, and upload to the service. This test case is based on privately sharing with only one user in order to compare effective process times of public and private sharing, showing direct impact of encryption. If a file is privately shared with multiple users, time added by encryption is negligible, since file upload needs to be done for each user, significantly increasing total time. Private upload is expected to take longer because of additional encryption steps. As illustrated, this assumption is true except for the 8 Mbytes and 1 Gbytes files. A single encryption step is measured to take in average 6 milliseconds, thus being very fast. In case of the 8 Mbytes file, only 5 Beans (i.e., 1 x FileBean, 2 x FilePartBeans, and 2 x CredentialBeans) were encrypted and uploaded, adding 30 milliseconds of encryption time to the process. This value seems to be insignificantly small in comparison to upload time, which can vary for small files due to slightly different upload speeds. As upload speed becomes more constant with larger files and more Beans need to be encrypted, private upload shows the expected higher total times. An investigation of the 1 Gbytes file’s total time showed a large parallel access to the file upload service during public sharing, causing a bottleneck. Therefore, file upload at full speed was not possible, resulting in a longer total time for public sharing.

300 260.14 243.54 250 200 180.28184.16 150 100 45.40 51.43 Time Secondsin 50 6.57 5.28 15.00 17.69 0 8 Mbytes 50 Mbytes 180 Mbytes 750 Mbytes 1 Gbytes Files

Public Share Upload Action Private Share Upload Action

Figure 5.2: Total time of public and private file-sharing file upload actions 56 CHAPTER 5. EVALUATION

In comparison, Figure 5.3 illustrates total DHT and encryption times of sharing files. Public and private sharing generally require equal DHT times to store index data, as long as the file is shared with one person. Additional receivers linearly increase total DHT time. In the scope of public sharing, the number and length of generated search keywords influences DHT time as well. Each keyword generates m+1 additional calls, m being the length of a keyword. In comparison, private sharing only requires one additional DHT call for putting the share notification. For the sake of clarity, Figure 5.3 only considers DHT time for storing the index. As illustrated, time needed to store the index directly relates to the number of file parts created by fragmenting the file. Fragmentation is based on file types allowed by cloud services and randomly chosen considering available encoders, varying the number of created file parts for equal files. In general, the larger a file is, more file parts the fragmentation process tends to create. For example, the 180 Mbytes file is split into 50 parts. Therefore, each FilePartBean plus its CredentialBean have to be stored in the DHT, thus generating 100 DHT calls. In addition, one more DHT call is required to store the FileBean, making a total of 101 DHT calls for index storing of a 180 Mbytes file. PUT operations require 30 milliseconds on average, resulting in a total DHT time of 3.03 seconds, closely reflecting the measured and displayed average value. In private sharing, time added due to encryption equally increases with the number of file parts.

35 29.51 30 24.30 25 20.40 20 16.80 15 10 2.80 3.40 Time Secondsin 5 0.11 0.14 0.56 0.68 0 8 Mbytes 50 Mbytes 180 Mbytes 750 Mbytes 1 Gbytes Files

Public Share Upload Action Private Share Upload Action

Figure 5.3: Time of public and private file-sharing DHT and encryption upload actions

In summary, the designed DHT storage scheme for PUT operations strongly correlates with the number of file parts, requiring 2 calls per part. Considering that the DHT shows linear performance even with a large number of file parts, complexity added by the designed storage scheme is minimal. Therefore, optimization can only be seen by reducing the number of file parts.

Evaluation of the file download action is similar to file upload. Downloading is generally faster because PiCsMu does not need to calculate a fragmentation plan in advance; it just decodes and joins file parts based on index information. The total download times are presented in Figure 5.4. Contrary to total upload times presented, the file service in this tests was flooded with incoming requests, resulting in expected time differences between 5.3. QUANTITATIVE EVALUATION 57 public and private download. Again, time spent for searching files or retrieving share notifications is not included in this measurement because it is exclusive of the upload operation.

300

250 222.14229.78 200 158.86164.34 150 100 39.79

Time Secondsin 39.12 50 17.82 4.11 4.30 15.65 0 8 Mbytes 50 Mbytes 180 Mbytes 750 Mbytes 1 Gbytes Files

Public Share Download Action Private Share Download Action

Figure 5.4: Total time of public and private file-sharing file download actions

Interesting results are presented in Figure 5.5. Time spent on retrieving the index from the DHT is significantly lower than storing it, e.g., 4.21 seconds compared to 16.8 seconds for a publicly shared 750 Mbytes file with an average of 300 file parts. Two reasons explain this observation: (1) DHT operation time and (2) DHT storage scheme. In the scope of (1), measurements show that DHT GET operations take in average 14 milliseconds, which is about half of a PUT. However, this is no evidence for the proposed design, because it only reflects TomP2P’s DHT performance. Still, even when doubling the time of GET calls (i.e., ignoring DHT performance differences), downloading remains twice as fast. The proposed design (2) allows multiple FilePartBeans storage under their common file UUID, reducing DHT calls by a factor of two. By calling the method API.getFilePartBeans(fileUUID), all FilePartBeans are retrieved using one simple DHT GET operation. The combination of observation (1) and (2) makes total DHT time of public downloads 4 times faster than its upload. Decryption additionally influences download times of privately shared files. Measurements show an average decryption time of 3 milliseconds compared to 6 milliseconds of encryp- tion. Even if file parts can be retrieved in one call, each needs to be separately decrypted. As an example, downloading the index of a 750 Mbytes file with 300 file parts requires 302 DHT calls (i.e., 1 x getFileBean, 1 x getFilePartBeans, and 300 x getCreden- tialBean) and therefore takes 4.28 seconds, assuming average GET time. In addition, all 601 Beans (i.e., 1 x FileBean, 300 x FilePartBeans, and 300 x CredentialBeans) have to be decrypted, taking another 1.8 seconds, making a total of 6.08 seconds in this example. Having this results, further optimization could be possible by storing CredentialBeans under a common identifier similar to FilePartBeans. Under this assumption, retrieving 58 CHAPTER 5. EVALUATION the index would only require 3 DHT calls and perform incredibly well. However, this fact is hard to achieve, since CredentialBeans are associated with FilePartBeans and therefore lack a common identifier. A possible approach is to identify which file parts, if any, use the same credential and group them accordingly, already decreasing DHT calls by a certain amount.

10 8.69 8 6.01 6.09 6 4.21 4

2 1.01

Time Secondsin 0.71 0.04 0.05 0.15 0.21 0 8 Mbytes 50 Mbytes 180 Mbytes 750 Mbytes 1 Gbytes Files

Public Share Download Action Private Share Download Action

Figure 5.5: Time of public and private file-sharing DHT and encryption download actions

Finally, an investigation of search and share notification performance is done by measuring the time it takes to retrieve certain search and notification results. A fast operating procedure is essential for usability, since these two features (i.e., integrated search and private sharing) are meant to distinguish PiCsMu from other P2P file-sharing applications.

Users interacting with famous P2P clients, such as BitTorrent, are accustomed to fast file searches, because they can use external Web services. Figure 5.6 shows evaluation results. The use of separate DHT domains for search keywords and notifications, together with the possibility of storing multiple values under the same location key have proven to be good design choices. Even retrieving a high amount of results from the DHT performs considerably fast, being on par with C/S web searches. However, the recorded data set is not large enough to draw reliable conclusions on the system’s behavior in real case scenarios with hundreds and thousands of concurrent users.

5.4 Additional Issues and Problems

In the scope of a more sophisticated evaluation, it would be interesting to analyze the system in a distributed environment, measuring how DHT times compare to file upload and download times of real cloud services, being exposed to latency. In addition, stress tests could be done to see if there is a change in performance for a long running system. Therefore, more accurate results could be gathered. 5.4. ADDITIONAL ISSUES AND PROBLEMS 59

0.25 0.2 0.186 0.157 0.15 0.106 0.1 0.083 0.039 0.042

Time Secondsin 0.05 0 50 500 1000 Results Search Share Notification

Figure 5.6: DHT times for retrieving search results and share notifications

Considering PiCsMu’s current implementation, the system runs stable in a controlled environment with a manageable number of peers. However, in order to make the software publicly available, its stability in real networks demands further investigation.

Removing data from the P2P network is not ideally handled. While it is technically possible for peers to remove their data from the network, there might be unforeseen con- sequences. In addition, data might not be replicated enough and can get lost. Therefore, search results may return invalid file UUIDs pointing to non-existing files.

In the same context, the impact of expiring oAuth credentials has not been investigated. Depending on how cloud providers are using the oAuth protocol, the provided credentials expire after a defined period. Then, PiCsMu needs to either update existing DHT entries or put the complete file again.

Related to security, if a user forgets his password to authenticate against the central server, he or she will not be able to recover personal files anymore. However, this is a common issue when using external services. Furthermore, there are possibilities to misuse the DHT’s storage scheme, e.g., flooding a user’s share notification domain. 60 CHAPTER 5. EVALUATION Chapter 6

Summary and Conclusions

Cloud computing originated a variety of cloud storage services to store, exchange, and manage files between personal computers. Using such services requires compliance with their data restrictions, limiting the upload of files to certain file types. Machado et al. investigated how to bypass data validation processes of cloud providers to store arbitrary data. By applying encoding, data can be transformed to any format representation ac- cepted by cloud providers’ data validation process, turning it unrecognizable. Based on this idea, PiCsMu was created and emerged as a prototypical C/S implementation. PiC- sMu offers an overlay on top of other cloud services to store and share files. By utilizing index data, the system can keep track on where and how it distributes data. As with other cloud storage services, PiCsMu offers a centralized way of sharing files. However, centralized systems bear drawbacks in applying file sharing, e.g., bottlenecks at server, single point of failure, and its ease to shut down by governmental restrictions. Therefore, the objective of this thesis is to design, implement, and evaluate a P2P-based system to enable a distributed share functionality in PiCsMu.

The proposed design is modular and adaptive, making future changes to the project with minimum effort. Hence, it defines four main components: (1) application, (2), central server, (3) cloud services, and (4) P2P network. Each component is an isolated entity that defines clear interfaces to access its functionality. As long as all components comply with the provided interfaces, exchanging them or making internal changes should be possible without significantly influencing the others. Components (1)-(3) reflect PiCsMu’s original implementation, while providing the P2P system and its interface is the main contribution of this thesis.

In order to provide a system with good performance, a DHT storage scheme is proposed with the goal of minimizing required DHT calls, since they are time consuming. The concept is based on functionality offered by TomP2P, a freely available DHT implemen- tation. With subtle combination of location keys, content keys, and domains, DHT calls are reduced to a minimum, especially making GET operations perform well.

Another design challenge in distributed systems is to provide security. Private file sharing makes use of public key encryption to securely exchange data between users. Since a lot of data needs to be encrypted in case of private file sharing with multiple users, data is

61 62 CHAPTER 6. SUMMARY AND CONCLUSIONS

first symmetrically encrypted with a random key, which is afterwards encrypted with the receivers’ public keys. This method allows fast encryption for considerably high numbers of receivers. In addition, PiCsMu makes use of digital signatures in combination with the central server as identity provider to verify origins of contents. Therefore, a user can always be assured that share notifications originate from declared senders.

Besides its ability to privately share files, PiCsMu offers an integrated content-based file search to distinguish itself from other P2P file-sharing systems. By utilizing a similarity algorithm (i.e., P2PFastSS), it overcomes DHT limitations normally allowing only exact match search. The algorithm is adapted to seamlessly interact with the system’s design.

The implementation and designed concepts are evaluated in a controlled environment by running multiple instances of PiCsMu and building an interacting P2P network. The evaluation goal is an assessment of the system’s feasibility. Qualitative evaluation shows that the system performs as designed in selected test cases. In addition, results of unit tests and functional tests provide strong evidence of the system’s correctness. However, even a well-tested system is susceptible to errors. Functional testing identified unresolved problems in sharing large files. Furthermore, there are connectivity and network problems if the system is employed in real networks having firewalls and closed ports.

In the scope of quantitative evaluation, performance measurements of the DHT storage scheme and encryption concept are assessed in relation to total file upload and download times. An assessment of achieved times is difficult, since there is no data to compare. However, considering the total time of an operation (i.e, upload or download of a file), the complexity added by the DHT is minimal. Uploading a 180 Mbytes file in the test environment takes 45.4 seconds, requiring 2.8 seconds of DHT time. Download times are even more negligible, due to improved use of the DHT storage scheme. Retrieving information from the DHT takes 0.71 seconds out of total 32.12 seconds download time. Measurements show an average PUT of 30 milliseconds and GET of 14 milliseconds. En- cryption takes in average 6 milliseconds, where as decryption is twice as fast, requiring 3 milliseconds. Results confirm a strong correlation of the system’s performance to the number of file parts, especially in the scope of file upload. Therefore, reducing the number of file parts is the best way to increase performance.

Based on results achieved, this thesis concludes the proposed design to be valid and feasible, being a possible starting point for an open source project.

Future work should address the stability of PiCsMu in real networks, dealing with fire- walls, closed ports, and unstable connections. An investigation of TomP2P’s ability to use Universal Plug and Play (UPNP) Network Address Translation (NAT) and port forward- ing detection is of great interest. Related to security, share notifications can be improved to avoid flooding one’s domain with unwanted notifications. In order to tackle possible unwanted share notifications, a social capability could be introduced. PiCsMu users could have a friendship relation (mutually acknowledged) in order to build a trust relationship. However, a central server is recommended to support such task since friendship relations could be manipulated by malicious users in a distributed P2P network. Then, considering the DHT, each friendship relation could have a key pair -- having a shared private and public key, stored in the central server -- and, therefore, have its own share notification 63 domain. Domains can be protected with public keys, allowing only holders of a corre- sponding private key to read and write to it. Therefore, only acknowledged friends are able to share notifications between each other. As far as implementation is concerned, error handling needs to be improved and tested more before releasing a first version of the software. 64 CHAPTER 6. SUMMARY AND CONCLUSIONS Bibliography

[1] Amazon Web Services Website: “Products and Services”. Available at: http://aws. amazon.com. Last visited on: 05.01.2013.

[2] CSG Website: “Testbed Infrastructure for Research Activities”. Available at: http: //www.csg.uzh.ch/services/testbed.html. Last visited on: 26.01.2013.

[3] Dropbox Website: “Simplify your life”. Available at: ://www.dropbox.com. Last visited on: 06.12.2012.

[4] EclEmma Website: “Java Code Coverage for Eclipse”. Available at: http://www. eclemma.org. Last visited on: 20.01.2013.

[5] Eclipse Website: “The Eclipse Foundation open source community website”. Available at: http://www.eclipse.org. Last visited on: 25.01.2013.

[6] Freenet Website: “The free network”. Available at: https://freenetproject.org. Last visited on: 29.11.2012.

[7] FrostWire Website: “Share Big Files”. Available at: http://www.frostwire.com. Last visited on: 26.12.2012.

[8] Google Drive Website: “Keep everything. Share anything”. Available at: https: //drive.google.com/start. Last visited on: 06.12.2012.

[9] Google Picasa Website: “Organize, edit, and share your photos”. Available at: http: //picasa.google.com. Last visited on: 05.01.2013.

[10] JUnit Website: “Java Testing Framework”. Available at: http://junit. sourceforge.net. Last visited on: 20.01.2013.

[11] Microsoft Office 365 Website: “Get big-business tools at a small-business price”. Available at: http://www.microsoft.com/en-us/office365/ small-business-home.aspx. Last visited on: 05.01.2013.

[12] µTorrent Website: “A (very) tiny BitTorrent client”. Available at: http://www. utorrent.com. Last visited on: 29.11.2012.

[13] OAuth Website: “An open protocol to allow secure authorization in a simple and standard method from web, mobile and desktop applications”. Available at: http: //oauth.net. Last visited on: 05.01.2013.

65 66 BIBLIOGRAPHY

[14] Otixo Website: “One App for all Your Clouds”. Available at: http://otixo.com. Last visited on: 06.12.2012.

[15] PiCsMu Website: “Platform-independent Cross Storage System for Multi-Usage”. Available at: http://www.pics.mu. Last visited on: 29.01.2013.

[16] SkyDrive Website: “Free your files”. Available at: http://windows.microsoft.com/ en-US/skydrive/download. Last visited on: 06.12.2012.

[17] SpiderOak Website: “Making data backup, synchronization and sharing 100% pri- vate, flexible and secure”. Available at: https://spideroak.com. Last visited on: 06.12.2012.

[18] TomP2P Website: “A P2P-based high performance key-value pair storage library”. Available at: http://tomp2p.net. Last visited on: 11.01.2013.

[19] Vuze Website: “The Only Way to Download Torrents”. Available at: http://www. vuze.com. Last visited on: 08.01.2013.

[20] Wuala Website: “Secure Cloud Storage”. Available at: http://www.wuala.com. Last visited on: 06.12.2012.

[21] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, and Matei Za- haria. Above the Clouds: A Berkeley View of Cloud Computing. Technical report, University of California at Berkeley, Berkeley, CA, USA, February 2009.

[22] Carliss Y. Baldwin and Kim B. Clark. Design Rules, Vol. 1: The Power of Modu- larity. The MIT Press, first edition edition, March 2000.

[23] Thomas Bocek, Ela Hunt, David Hausheer, and Burkhard Stiller. Fast Similar- ity Search in Peer-to-Peer Networks. In IEEE, editor, 11th IEEE/IFIP Network Operations and Management Symposium (NOMS 2008), number 2008 in Network Operations and Management Symposium, pages 240–247, Los Alamitos, April 2008. IEEE.

[24] John Buford, Heather Yu, and Eng Keong Lua. P2P Networking and Applications. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008.

[25] Rajkumar Buyya, James Broberg, and Andrzej M. Goscinski. Cloud Computing Principles and Paradigms. Wiley Publishing, 2011.

[26] Simone Cirani and Luca Veltri. A Multicast-Based Bootstrap Mechanism for Self- Organizing P2P Networks. In Global Telecommunications Conference, 2009. GLOBE- COM 2009. IEEE, pages 1 –6, December 2009.

[27] Ian Clarke, Scott G. Miller, Theodore W. Hong, Oskar Sandberg, and Brandon Wiley. Protecting Free Expression Online with Freenet. Internet Computing, IEEE, 6(1):40 –49, February 2002. BIBLIOGRAPHY 67

[28] Ian Clarke, Oskar Sandberg, Matthew Toseland, and Vilhelm Verendel. Private Communication Through a Network of Trusted Connections: The Dark Freenet, 2010. Available at: https://freenetproject.org/papers.html. Last visited on: 04.12.2012.

[29] Bram Cohen. Incentives Build Robustness in BitTorrent. In Proceedings of the 1st Workshop on Economics of Peer-to-Peer Systems, 2003.

[30] Weverton Luis da Costa Cordeiro, Fl´avio Roberto Santos, Gustavo Huff Mauch, Mar- inho Pilla Barcelos, and Luciano Paschoal Gaspary. Securing P2P Systems from Sybil Attacks through Adaptive Identity Management. In Network and Service Manage- ment (CNSM), 2011 7th International Conference on, pages 1 –6, October 2011.

[31] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Longman Publish- ing Co., Inc., Boston, MA, USA, 1995.

[32] Ilaria Giannetti, Giuseppe Anastasi, and Marco Conti. Energy-Efficient P2P File Sharing for Residential BitTorrent Users. In Computers and Communications (ISCC), 2012 IEEE Symposium on, pages 000524 –000529, July 2012.

[33] Matthew Green. Napster Opens Pandora’s Box: Examining How File-Sharing Ser- vices Threaten the Enforcement of Copyright on the Internet. Ohio State Law Jour- nal, 63(799), 2002.

[34] Lei Guo, Songqing Chen, Zhen Xiao, Enhua Tan, Xiaoning Ding, and Xi- aodong Zhang 0001. A Performance Study of BitTorrent-like Peer-to-Peer Systems. IEEE Journal on Selected Areas in Communications, 25(1):155–169, 2007.

[35] Barbara Guttman and Edward A. Roback. SP 800-12. An Introduction to Computer Security: the NIST Handbook. Technical report, National Institute of Standards & Technology, Gaithersburg, MD, United States, 1995.

[36] Dick Hardt. The OAuth 2.0 Authorization Framework. RFC 6749 (Proposed Stan- dard), October 2012.

[37] Sjoerd Langkemper. Bootstrapping Gnutella Using UDP Host Caches. May 2005.

[38] Nikolaos Laoutaris, Damiano Carra, and Pietro Michiardi. Uplink Allocation Beyond Choke/Unchoke: or How to Divide and Conquer Best. In Proceedings of the 2008 ACM CoNEXT Conference, CoNEXT ’08, pages 18:1–18:12, New York, NY, USA, 2008. ACM.

[39] Kenji Leibnitz, Tobias Hoßfeld, Naoki Wakamiya, and Masayuki Murata. Peer-to- Peer vs. Client/Server: Reliability and Efficiency of a Content Distribution Service. In Proceedings of the 20th international teletraffic conference on Managing traffic performance in converged networks, ITC20’07, pages 1161–1172, Berlin, Heidelberg, 2007. Springer-Verlag.

[40] Vladimir Iosifovich Levenshtein. Binary Codes Capable Of Correcting Deletions, Insertions And Reversals. Soviet Physics Doklady., 10(8):707–710, February 1966. 68 BIBLIOGRAPHY

[41] Eng Keong Lua, Jon Crowcroft, Marcelo Pias, Ravi Sharma, and Steven Lim. A Survey and Comparison of Peer-to-Peer Overlay Network Schemes. Communications Surveys Tutorials, IEEE, 7(2):72 – 93, quarter 2005.

[42] Guilherme Sperb Machado, Fabio Hecht, Martin Waldburger, and Burkhard Stiller. Bypassing Cloud Providers Data Validation To Store Arbitrary Data. IFIP/IEEE International Symposium on Integrated Network Management (IM 2013), Ghent, Bel- gium, May 2013. (to appear).

[43] Sergio Marti and Hector Garcia-Molina. Identity Crisis: Anonymity vs. Reputation in P2P Systems. In Peer-to-Peer Computing, 2003. (P2P 2003). Proceedings. Third International Conference on, pages 134 – 141, September 2003.

[44] Petar Maymounkov and David Mazi`eres. Kademlia: A Peer-to-Peer Information System Based on the XOR Metric. In Peter Druschel, Frans Kaashoek, and Antony Rowstron, editors, Peer-to-Peer Systems, volume 2429 of Lecture Notes in Computer Science, pages 53–65. Springer Berlin Heidelberg, 2002.

[45] Goldreich Oded. Foundations of Cryptography: Volume 2, Basic Applications. Cam- bridge University Press, New York, NY, USA, 1st edition, 2009.

[46] National Institute of Standards and Technology (U.S.). Secure Hash Standard (SHS) . U.S. Dept. of Commerce, National Institute of Standards and Technology, Gaithers- burg, MD, 2008.

[47] Jiayin Qi, Hongli Zhang, Zhenzhou Ji, and Liu Yun. Analyzing BitTorrent Traffic Across Large Network. 2012 International Conference on Cyberworlds, 0:759–764, 2008.

[48] Matei Ripeanu. Peer-to-Peer Architecture Case Study: Gnutella Network. In Peer- to-Peer Computing, 2001. Proceedings. First International Conference on, pages 99 –100, August 2001.

[49] Ronald Linn Rivest. The MD5 Message-Digest Algorithm. RFC 1321 (Informational), September 2001.

[50] Antony Rowstron and Peter Druschel. Pastry: Scalable, decentralized object loca- tion, and routing for large-scale peer-to-peer systems. In Rachid Guerraoui, editor, 2001, volume 2218 of Lecture Notes in Computer Science, pages 329–350. Springer Berlin Heidelberg, 2001.

[51] Stefan Saroiu, Krishna P. Gummadi, and Steven D. Gribble. Measuring and Analyz- ing the Characteristics of Napster and Gnutella Hosts. Multimedia Syst., 9(2):170– 184, August 2003.

[52] Hans-Emil Skogh, Jonas Haeggstr¨om, Ali Ghodsi, and Rassul Ayani. Fast Freenet: Improving Freenet Performance by Preferential Partition Routing and File Mesh Propagation. In Cluster Computing and the Grid, 2006. CCGRID 06. Sixth IEEE International Symposium on, volume 2, pages 8 pp. –9, May 2006. BIBLIOGRAPHY 69

[53] William Stallings. Network Security Essentials: Applications and Standards. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition, 1999.

[54] Ralf Steinmetz and Klaus Wehrle. Peer-to-Peer Systems and Applications (Lecture Notes in Computer Science). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.

[55] Richard H. Stern. Napster: A Walking ? IEEE Micro, 20(6):4–5, December 2000.

[56] Ion Stoica, Robert Morris, David Liben-Nowell, David R. Karger, M.Frans Kaashoek, Frank Dabek, and Hari Balakrishnan. Chord: A Scalable Peer-to-peer Lookup Pro- tocol for Internet Applications. Networking, IEEE/ACM Transactions on, 11(1):17 – 32, February 2003.

[57] Daniel Stutzbach and Reza Rejaie. Understanding Churn in Peer-to-Peer Networks. In Proceedings of the 6th ACM SIGCOMM conference on Internet measurement, IMC ’06, pages 189–202, New York, NY, USA, 2006. ACM.

[58] Chiara Taddia and Gianluca Mazzini. A Multicast-Anycast Based Protocol for Track- erless BitTorrent. In Software, Telecommunications and Computer Networks, 2008. SoftCOM 2008. 16th International Conference on, pages 264 –268, September 2008.

[59] Toby Velte, Anthony Velte, and Robert Elsenpeter. Cloud Computing, A Practical Approach. McGraw-Hill, Inc., New York, NY, USA, 1 edition, 2010.

[60] David Issac Wolinsky, Pierre St. Juste, P. Oscar Boykin, and Renato Figueiredo. Addressing the P2P Bootstrap Problem for Small Overlay Networks. In Peer-to- Peer Computing (P2P), 2010 IEEE Tenth International Conference on, pages 1 –10, August 2010.

[61] Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John D. Kubiatowicz. Tapestry: A Resilient Global-scale Overlay for Service Deployment. Selected Areas in Communications, IEEE Journal on, 22(1):41 – 53, January 2004. 70 BIBLIOGRAPHY Abbreviations

AES Advanced Encryption Standard API Application Programming Interface C/S Client-Server CIA Confidentiality Integrity and Availability CSG Communication Systems Group DDoS Distributed-Denial-of-Service DES Data Encryption Standard DHT Distributed Hash Table GHz GigaHertz GIF Graphics Interchange Format GBytes Gigabytes GUI Graphical User Interface GUID Globally Unique Identifier HTTP Hypertext Transfer Protocol IaaS Infrastructure-as-a-Service ID Identifier IP Internet Protocol IT Information Technology JPEG Joint Photographic Experts Group JRE Java Runtime Environment JVM Java Virtual Machine Mbytes Megabytes MD Message Digest NAT Network Address Translation NIST National Institute of Standards and Technology P2P Peer-to-Peer P2PFastSS P2P Fast Similarity Search PaaS Platform-as-a-Service PBE Password-based Encryption PeerID Peer Identifier PiCsMu Platform-independent Cross Storage System for Multi-Usage PM-ID PiCsMu Identifier PNG Portable Network Graphics RAM Random Access Memory RSA Rivest, Shamir and Adleman SaaS Software-as-a-Service

71 72 ABBREVIATONS

SHA Secure Hash Algorithm TCP Transmission Control Protocol TTL Time-to-live UDP User Datagram Protocol UML Unified Modeling Language UPNP Universal Plug and Play URL Uniform Resource Locator UUID Universally Unique Identifier Glossary

Beans Beans are reusable software components that conform to a particular convention and are an often used concept in the Java programming language. Their main purpose is to encapsulate many objects into a single one (i.e., the Bean) that can be used to transmit as a sole entity.

Bootstrapping Establishing a initial connection to a P2P network. Bootstrapping can be done on an always reachable central entity or to active peers directly.

Cloud Computing Cloud computing refers to both the applications delivered as services over the Internet and the hardware and system software in the providing those services.

Cloud Overlay An overlay is a virtual or logical network on top of another network. A cloud overlay combines multiple cloud services to a single system view and allows to easily manage them.

Cloud Provider Companies that offer cloud computing as a service.

Cloud Service Applications delivered as services by cloud providers. Cloud services can be classified into three types: SaaS, IaaS, and PaaS.

Cloud Storage Service SaaS applications that provide networked online storage to host files of the service users.

Communication Systems Group Research group at the computer science department of the University of Zurich. Its key mission is to establish excellent research in com- munications, addressing communication protocols for distributed systems, charging, accounting, mobility, security, network management, and Peer-to-Peer systems.

Content-based Search Search technique that provides a variety of results without using exact search keywords. For example, Google provides a content-based search engine.

Content Key Identifier used in the DHT implementation of TomP2P. Content keys allow to store and retrieve multiple values under the same DHT location key.

Cryptography The practice of enabling secure communication over insecure communi- cation media and therefore supports construction of secure information systems that are robust against a variety of attacks.

73 74 GLOSSARY

Data Validation The process of checking data against predefined validation rules to ensure that a program operates on clean, correct and useful data. Data validation processes are used by cloud providers to restrict data pushed to their services.

Decoding The reverse process of encoding to restore encoded data to their original form.

Decryption A cryptographic method to revert encrypted data to its original form.

Digital Signature Digital signature is a mathematical scheme for demonstrating the authenticity of a digital message or document. They are an important concept in distributed systems to verify message origins.

Distributed Hash Table A DHT is used in distributed systems to provide a lookup service similar to a hash table. By storing key/value pairs, any participant of the network can retrieve the value associated with a given key.

Encoding Encoding is the process in which data is converted into a specific form.

Encryption Encryption applies encoding to transform data into a format that cannot be easily understood by unauthorized parties. It is an often applied method in cryptography.

File Sharing systems Centralized or distributed applications to provide access to dig- itally stored data,e.g., documents, audio, and video.

Fragmentation Describes the process of splitting a file into multiple parts. PiCsMu uses fragmentation to distribute files to cloud storage services.

Futures Logical devices to keep track of asynchronous DHT operations in TomP2P.

Location Key Identifier used in the DHT implementation of TomP2P. Location keys reflect typical DHT keys to store values. They can be used in combination with content keys.

Index data Organizational data in PiCsMu that describes information about the upload process of files. Index data is stored and shared using a central server or a distributed P2P network.

Internet Protocol The primary network protocol used on the Internet. It is used to label packets of data sent across the Internet, identifying both the sending and receiving computers.

Join The reverse process of fragmentation. Parts of a file are reassembled to their original form.

Peer-to-Peer System A P2P system is a self-organizing system of equal, autonomous entities (peers) which aims for the shared usage of distributed resources in a net- worked environment avoiding central services.

PiCsMu PiCsMu is a cloud storage service offering file type independent cloud storage using file type dependent cloud services. 75

TomP2P TomP2P is an advanced DHT implementation, which stores multiple values for a key (location key). Each peer has a table (either disk-based or memory-based) to store its values. A single value can be queried / updated with a secondary key (content key). The underlying communication framework uses Java NIO to handle many concurrent connections. 76 GLOSSARY List of Figures

3.1 Simplified PiCsMu protocol sequence diagram: File upload ...... 18

3.2 Simplified PiCsMu protocol sequence diagram: File download ...... 19

3.3 PiCsMu system architecture components and characteristics...... 21

3.4 Applied encryption scheme for private sharing ...... 31

3.5 Use case: Public sharing with all users of PiCsMu ...... 33

3.6 Use case: Private sharing between two PiCsMu user ...... 34

4.1 Simplified UML class diagram of the P2P system’s code architecture . . . . 38

4.2 Public file exchange protocol sequence ...... 39

4.3 Private file exchange protocol sequence ...... 40

4.4 Callback UML class diagram ...... 43

4.5 Asynchronous control flow of requesting a file ...... 44

5.1 Code coverage result ...... 53

5.2 Total time of public and private file-sharing file upload actions ...... 55

5.3 Time of public and private file-sharing DHT and encryption upload actions 56

5.4 Total time of public and private file-sharing file download actions . . . . . 57

5.5 Time of public and private file-sharing DHT and encryption download actions 58

5.6 DHT times for retrieving search results and share notifications ...... 59

77 78 LIST OF FIGURES List of Tables

2.1 A feature comparison of cloud storage services ...... 7

2.2 A comparison of P2P file-sharing systems ...... 14

3.1 Design of a peer ...... 24

3.2 Example of Put/Get operations using location and content keys ...... 26

3.3 Storage scheme for the DHT in PiCsMu ...... 27

3.4 Deletion neighborhood of the word fast with edit distance=1 ...... 28

3.5 User-defined attributes of a file ...... 28

3.6 Comparison of the deletion neighborhood of two words with edit distance=1 29

3.7 Conception of a personal share notification used in private sharing . . . . . 33

5.1 Evaluation setup ...... 49

5.2 Test cases simulating user behavior in a file-sharing system ...... 51

5.3 Functional evaluation results ...... 53

79 80 LIST OF TABLES List of Listings

4.1 Non-blocking communication using futures ...... 41

4.2 Non-blocking API method to request a file ...... 42

4.3 Public cryptography encryption ...... 44

4.4 Public cryptography decryption ...... 45

4.5 Content-based search pseudo code for indexing a file ...... 46

4.6 Content-based search pseudo code for searching a file ...... 46

4.7 Keyword generation from file description ...... 47

4.8 Method to generate the deletion neighborhood with an edit distance k=1 . 48

5.1 Method to illustrate code coverage ...... 52

81 82 LIST OF LISTINGS Appendix A

Installation Guidelines

The latest project configuration is available as Maven repository. In addition, source code can be accessed through Subversion (SVN), under author contact.

Repositories: Maven: http://www.pics.mu/maven/ SVN: www.pics.mu/svn/picsmu-share-p2p

System Requirements: Java version 1.6 or later (JDK or JRE)

83 84 APPENDIX A. INSTALLATION GUIDELINES Appendix B

Contents of the CD

PiCsMu Folder including source code and documents related to this thesis.

Masterthesis.pdf This thesis as PDF.

Abstract.txt Abstract in english.

Zusfsg.txt Abstract in german.

85