Thesis.Pdf Oversize Pdf Additiontest.Pdf Average % of Original file Size Compression-Rate % CHAPTER EVALUAT I ONANDRESULTS
Total Page:16
File Type:pdf, Size:1020Kb
Faculty OF Science AND TECHNOLOGY Department OF Computer Science Seadrive Remote fiLE SYNCHRONIZATION FOR OffSHORE flEETS — Peter HarO INF-3981 Master’S Thesis IN COMPUTER Science - May 2016 “Sometimes a scream is better than a thesis” –Ralph Waldo Emerson AbstrACT File synchronization- and hosting services is not only an integrated service in everyday life, but also a powerful tool to support business and organizational activities. In order to provide users with a transparent experience, the sys- tems relies on sophisticated mechanisms to create a seamless integration. The problem with these systems is that they are designed for stable network con- nections with a low variety in latency, throughput and loss-rate. The systems optimized for low bandwidth networks are implemented to work on a small set of small text-based files, and assumes no prior knowledge of the contents on the receiver. Offshore vessels outside the range cellular networks employ a variety of satellite based communication suites and accommodating physical hardware. These networks are notorious for having poor upload- and download speed, high loss rate, poor latency with high variability and are subject to frequent dropped connections. Furthermore, the fiscal cost associated by using these connections are high, as the highest performing networks charge per kilobit transferred. These connections are unsuitable for modern file hosting services, and file synchronization frameworks, as they never complete synchronizing, often due to the assignment of new IPs. Therefore providing the naval fleet with a reliable file-synchronization protocol, and small in transmission overhead is of the utmost importance. In order to facilitate the needs for file hosting services, we created a file synchronization framework, which allows for different deduplication, file-synchronization and file transportation schemes. The idea was to support a computationally inex- pensive method emphasizing speed over reliability on Local Area Networks, and a robust but slower methodology for Wide Area Networks. This thesis presents Seadrive- a new file synchronization framework that targets offshore-based fleets and their land-based counterparts. By utilizing a file syn- chronization methodology inspired by binary patch distributions, and creating a novel reliable application level transport protocol, we are able to successfully synchronize large files through simulated satellite-based network topologies. In order to assess the capabilities of our framework, we performed various IV ABSTRACT experiments on the artifacts in the form of micro- and macro benchmarks, comparing them to both Rsync and Rdiff based protocols. Our results show that Seadrive is able to produce smaller patches than both Rsync and Rdiff based protocols, with fewer TCP and application layer requests necessary, saving up to 10 hours on the slowest network connection and is able to reliably transfer data through unreliable network topologies. AcknoWLEDGEMENTS I would like to express my first and foremost gratitude to my advisors Professor Otto Anshus, Svein Bertheussen, Vidar Berg and Asbjørn Pettersen, for your guidance, support and valuable insights. I would also express my thanks to Dua- log, for creating this project and providing an office for a measly student. On a more personal level I would like to thank my colleges at SINTEF Nord, especially Bård Hanssen for providing good humor and being my personal scapegoat, and of course the coffee machine providing me with necessary life support. Finally, I would like to thank my fiancée Maria Brattfjell and my family, for showing loving support throughout my madness. Contents Abstract iii Acknowledgements v List of Figures xi List of Tables xiii List of Listings xv List of Code Snippets xvii List of Abbreviations xix 1 Introduction 1 1.1 Problem definition ....................... 2 1.2 Targeted Applications ..................... 3 1.3 Contributions .......................... 4 1.4 Methods and materials ..................... 5 1.4.1 Methodology applied for this thesis .......... 5 1.4.2 Procedures ....................... 6 1.5 Context ............................. 7 1.6 Assumptions and Limitations ................. 7 1.7 Structure of the Thesis ..................... 8 2 Review of related literature 9 2.1 Data Deduplication ...................... 9 2.1.1 Taxonomy ....................... 10 2.1.2 Methodologies ..................... 11 2.1.3 Deduplication methodologies ............. 12 2.1.4 Fixed block hashing/Fixed-size Chunking ....... 13 2.1.5 Variable Block hashing/Variable-size chunking .... 13 2.2 Data differencing ........................ 16 2.2.1 Mathematical fundament ............... 17 VII VIII CONTENTS 2.3 Conflict resolution in file synchronizers ............ 17 3 Review of related Technologies 19 3.1 File synchronization protocols ................. 20 3.1.1 Widely used remote file synchronization algorithms . 21 3.1.2 Rsync .......................... 21 3.1.3 Unison ......................... 22 3.1.4 Dropbox ........................ 23 3.2 Distributed file systems .................... 24 3.2.1 Sun Network Filesystem ................ 24 3.2.2 Andrew File System .................. 26 4 Architecture 29 5 Design 33 5.1 The Data Abstraction Layer – I/O management ........ 35 5.2 Business Logic Layer – Core functionality ........... 35 5.3 Application Layer – Seadrive ................. 36 5.4 Data Deduplication ...................... 36 5.4.1 Delta difference data deduplication .......... 38 5.5 Filesystem monitor, change detection and the application facade 38 5.6 File synchronization and Transport Protocol ......... 38 5.7 Local synchronization protocol ................ 39 5.8 Remote Synchronization protocol ............... 40 6 Implementation 45 6.1 Data Abstraction Layer ..................... 45 6.2 Business Logic Layer ...................... 46 6.3 Application layer - Seadrive .................. 46 6.4 Deduplication ......................... 46 6.5 Seadrive artifacts implementation ............... 47 6.6 Clients ............................. 48 6.7 Remote Transport protocol .................. 50 6.8 Remote file synchronizer .................... 52 6.9 Local Server – Local Synchronization point .......... 53 6.9.1 Sending and receiving data .............. 53 6.9.2 Local server deduplication for variable-sized chunking synchronization .................... 53 6.9.3 Local server deduplication for binary difference syn- chronization ...................... 53 6.10 Primary Server ......................... 54 7 Experimental design and setup 55 7.1 Datasets ............................ 55 CONTENTS IX 7.2 Experimental design ...................... 56 7.2.1 Micro-Benchmarks ................... 56 7.2.2 Macro-Benchmarks .................. 57 7.2.3 Experimental setups .................. 58 8 Evaluation and results 59 8.1 Micro-Benchmarks ....................... 59 8.2 Macro-benchmarks ....................... 64 8.2.1 Full application usage – Window size 256 bytes ... 65 8.2.2 Full application 1024 byte window size ........ 67 8.2.3 Simulated delay sessions ................ 67 8.3 Analysis ............................ 69 8.4 Discussion ........................... 70 8.4.1 Lessons learned .................... 72 9 Concluding remarks 73 9.1 Future work .......................... 74 9.2 Conclusion ........................... 75 Bibliography 77 A Appendix 1 83 B Appendix 2 89 C SQL scripts 95 List OF FigurES 2.1 The generic deduplication process according to [4] ..... 11 2.2 The sliding window algorithm from [13] ........... 14 3.1 NFS architecture as outlined in [6] .............. 25 3.2 AFS process distribution as outlined in [6] .......... 27 4.1 Birds eye architecture of Seadrive. Clients are reciprocally synchronized within the LSP, and is continuously synchroniz- ing with the RSP whenever possible. Red rings indicate the LSP. ............................... 30 5.1 Shows a simplified model of the entire application stack ... 34 5.2 Shows the generic Seadrive data deduplication process ... 37 5.3 State diagram of the sender in the remote transport protocol 41 8.1 Shows the IO graph for the network communication between the local server and primary server. The green ring indicates where we killed the connection ................ 66 8.2 Shows the IO graph for the network communication between the local server and primary server with 1 second RTT. The green ring indicates where we killed the connection ..... 68 8.3 Shows the IO graph for the network communication between the local server and primary server with 3 second RTT, no retransmissions ........................ 68 XI List OF TABLES 8.1 Displays the compression rate on the test-set in bytes .... 61 8.2 Shows the average run time in order to create delta-differences in milli- and regular seconds ................. 63 8.3 Shows the time to transfer the delta-files over various data- plans in hours ......................... 64 XIII List OF Listings 6.1 private variables of the chunk class .............. 47 6.2 Shows the usage of the transport flags ............ 50 XV List OF Code Snippets 7.1 Python script to generate size random bytes ......... 56 8.1 C# code to measure time ................... 62 XVII List OF AbbrEVIATIONS ACM Association for Computing Machinery AFS Andrew File System API application programming interface CPU Central Processing Unit CSP Communicating Sequential Processes D3 Data-Driven Documents DAL Data Abstraction Layer DLL Dynamic Link Library GUID Globally Unique Identifier HTML5 version 5 of the HyperText Markup Language standard I/O Input/Output