Design and Implementation of a Peer-To-Peer Based System to Enable the Share Functionality in a Platform-Independent Cloud Storage Overlay
Total Page:16
File Type:pdf, Size:1020Kb
Design and Implementation of a Peer-to-Peer based System to enable the Share Functionality in a Platform-independent Cloud Storage Overlay Michael Ammann Jona, Switzerland Student ID: 06-920-805 – Communication Systems Group, Prof. Dr. Burkhard Stiller HESIS T Supervisors: Guilherme Sperb Machado, Prof. Dr. Burkhard Stiller ASTER Date of Submission: January 31, 2013 M University of Zurich Department of Informatics (IFI) Binzmühlestrasse 14, CH-8050 Zurich, Switzerland ifi Master Thesis Communication Systems Group (CSG) Department of Informatics (IFI) University of Zurich Binzmühlestrasse 14, CH-8050 Zürich, Switzerland URL: http://www.csg.uzh.ch/ Abstract Cloud storage services have aroused great interest in modern information technology, providing a simple way of storing and sharing files. However, they are mostly built on centralized architectures and hence subject to drawbacks. Peer-to-Peer (P2P) systems are a good alternative to deal with the limitations of centralized systems in the scope of file sharing, due to shared usage of distributed resources and higher fault tolerance. Therefore, this thesis designs, implements, and evaluates a P2P-based file sharing system to be integrated into PiCsMu, a novel platform-independent cloud storage application. The proposed system features a modular design which relies on well-defined interfaces, high security, as well as a P2P storage concept with good performance. Additional features that try to distinguish this approach from other P2P file sharing systems include an integrated file search and the possibility to privately share files. The results of a qualitative and quantitative evaluation show that the proposed system performs well and as designed. Considering that a download of 750 Megabytes files takes in average 154.65 seconds, the complexity added by the proposed system (4.21 seconds) is minimal. Yet the system still leaves aspects to be optimized. i ii Zusammenfassung Cloud-Speicherdienste haben ein grosses Interesse in der modernen Informationstechno- logie erweckt, denn sie bieten eine einfache Methode an, um Daten zu speichern und zu teilen. Da solche Systeme jedoch meistens auf zentralisierten Architekturen aufbauen, un- terliegen sie auch ihren Nachteilen. Peer-to-Peer (P2P) Systeme stellen eine hervorragende Alternative fur¨ den gemeinsamen Zugriff auf Daten dar, um mit den Limitierungen zen- traler Systeme umzugehen. Das Ziel dieser Arbeit ist das Entwerfen, Implementieren und Auswerten eines P2P Systems fur¨ den gemeinsamen Zugriff auf Daten, welches in PiCsMu, einer neuartigen Cloud-Speicher Anwendung, integriert wird. Das vorgeschlagene System hat einen modularen Entwurf, der auf klar definierten Schnittstellen, hoher Sicherheit und einem leistungsf¨ahigen P2P Speicherkonzept basiert. Eine integrierte Suchfunktion sowie die M¨oglichkeit eines privaten Datenaustausches sind Zusatzfunktionen, die das System von anderen P2P Systemen hervorheben soll. Die Resultate der qualitativen und quanti- tativen Evaluation zeigen, dass das vorgeschlagene System funktioniert und leistungsf¨ahig ist. Der Download von 750 Megabytes grossen Dateien ben¨otigt im Durchschnitt 154.65 Sekunden, wobei die Zeitverl¨angerung mit dem entworfenen System nur 4.21 Sekunden betr¨agt. Die Verl¨angerung ist im Verh¨altnis zur Gesamtzeit minimal. Es besteht jedoch immer noch Potential zur Optimierung. iii iv Acknowledgments I would like to express my appreciation to Prof. Dr. Burkhard Stiller and the Communi- cation Systems Group of the University of Zurich for giving me the opportunity to work on such an interesting topic and be part of their research. I owe my deepest gratitude to my supervisor Guilherme Sperb Machado, whose research, guidance, enthusiasm, and persistent help allowed me to complete this thesis. His work showed me new prospects and again awaken my interest in computer science. In addition, a special thank to Dr. Thomas Bocek for providing TomP2P and for always being up to provide support in case of any problems. Lastly, I am indebted to my family, Karl-Heinz Saxer, Lyn Shepard, and everyone who supported me during this thesis. v vi Contents Abstract i Zusammenfassung iii Acknowledgments v 1 Introduction 1 1.1 Description of Work and Thesis Goals . .3 1.2 ThesisOutline..................................3 2 Terminology and Related Work 5 2.1 Terminology...................................5 2.2 Cloud Storage Services . .6 2.3 A Brief History of Peer-to-Peer File Sharing . .7 2.3.1 Napster . .8 2.3.2 Gnutella.................................8 2.3.3 BitTorrent................................9 2.3.4 Anonymous P2P and Freenet . 11 2.3.5 Comparison . 12 vii viii CONTENTS 3 Design 17 3.1 File Upload and Download Protocol . 17 3.2 Index....................................... 18 3.3 System Architecture . 20 3.3.1 Application . 20 3.3.2 Central Server . 22 3.3.3 Peer-to-Peer Network . 23 3.3.4 Cloud Services . 25 3.4 DistributedHashTable............................. 25 3.4.1 Storage Concept using a Distributed Hash Table . 26 3.4.2 Enabling a Content-Based Search . 27 3.5 Privacy and Secure Data Exchange . 29 3.5.1 Cryptography . 30 3.6 ShareFunctionality............................... 32 3.6.1 PublicSharing ............................. 32 3.6.2 Share Notification . 33 3.6.3 Private Sharing . 34 4 Implementation 37 4.1 An Overview of the P2P Code Architecture . 37 4.2 Data Exchange using Beans . 38 4.3 File-Sharing Protocols . 38 4.4 Asynchronous and Non-Blocking Communication . 40 4.4.1 Interacting with the DHT using Futures . 41 4.4.2 Asynchronous Callbacks and Event Handling . 42 4.5 Implementing Security Measures . 44 4.6 File Search Functionality . 46 4.6.1 Keyword Generation . 47 4.6.2 Fast Similarity Search . 48 CONTENTS ix 5 Evaluation 49 5.1 Testbed and Evaluation Setup . 49 5.1.1 Test Cases . 50 5.2 Qualitative Evaluation . 50 5.2.1 Code Coverage using JUnit Tests . 50 5.2.2 Functional Testing . 53 5.3 Quantitative Evaluation . 54 5.4 Additional Issues and Problems . 58 6 Summary and Conclusions 61 Bibliography 69 Abbreviations 71 Glossary 73 List of Figures 75 List of Tables 77 List of Listings 79 A Installation Guidelines 83 B Contents of the CD 85 x CONTENTS Chapter 1 Introduction Modern Information Technology (IT) strives to provide computing as a utility. Utility computing is a concept based on delivering services and resources (e.g., processing, stor- age, software) to end-users while hiding their internal mechanisms. The acquisition of these offerings should be possible at minimal costs, having the advantage of being pur- chasable from private users to large IT companies. As an example, developers can easily launch new Internet services without the need to primarily invest a large amount of money in infrastructure. Thus, organizations are able to first deploy services and later check if the demands meet predictions, e.g., number of users simultaneously accessing the service [21]. Multiple distributed technologies such as cluster, grid, and cloud computing have emerged from this paradigm. They allow access to large amounts of computing power in a fully virtualized manner, by aggregating resources and offering a single system view [25]. Cloud computing (or simply cloud) has been one of the most used buzzwords in IT over recent years. The name emerged as a metaphor for the Internet, which has been typically represented as a cloud symbol in network diagrams [59]. The symbol is an abstraction for the complex underlying mechanisms. A variety of cloud providers (i.e., companies offering cloud computing) have been incorporated over the last couple of years, including big names such as Amazon, Google, and Microsoft. They offer cloud services that can be classified into three types: Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Software-as-a-Service (SaaS) [59]. Well-known examples of cloud services are, among others, Amazon Web Services [1], Google Picasa [9], or Microsoft Office 365 [11]. Section 2.2 introduces cloud services relevant for this thesis. In the scope of SaaS, cloud providers offer applications with a specific purpose and responsibility. Consequently, data that is uploaded to the services is restricted by the corresponding cloud providers' data validation process. For example, Google Picasa only allows hosting images of a certain file type, resolution, and size. Machado et al. [42] investigated on how to bypass the data validation process of cloud providers to store arbitrary data without restrictions. By applying different encoders (i.e., Steganography-related, FileFormatHeader-related, Appending-related) data is transformed to a form accepted by the cloud provider. A weak data validation process does not recognize this change in data. Therefore, it is possible to upload any kind of data to a cloud service without violating the service-specific restrictions being noticed by the cloud provider, causing impacts in security, accounting, and charging. 1 2 CHAPTER 1. INTRODUCTION On the basis of the idea presented by Machado et al., its authors created a Platform- independent Cross Storage System for Multi-Usage (PiCsMu) [15]. PiCsMu is a proto- typical cloud-based overlay (i.e., network on top of another network) offering file type independent cloud storage using file type dependent cloud services. By applying encod- ing, the application can store arbitrary files on any cloud service. The way PiCsMu stores files is divided into three major steps: (1) fragmentation, (2) encoding, and (3) upload. Step (1) splits the file to be uploaded into multiple file parts. In the scope of (2), each file part is encoded specific to the service the file part will be uploaded. Finally, step (3) uploads all encoded file parts to their corresponding cloud service. Advantages of this method are discussed in Section 2.2; and the individual steps are explained in more detail in Section 3.1. In order to retrieve the file, the reversed process is applied. PiCsMu itself does not store data, but index data (i.e., index data, this is data about data), which con- tains information about the upload process, such as (1) number and order of fragmented file parts, (2) applied encoders, and (3) used cloud services.