A Model to Manage Shared Mutable Data in a Distributed Environment

THE SEA OF STUFF: A MODEL TO MANAGE SHARED MUTABLE DATA IN A DISTRIBUTED ENVIRONMENT Simone Ivan Conte A Thesis Submitted for the Degree of PhD at the University of St Andrews 2018 Full metadata for this thesis is available in St Andrews Research Repository at: http://research-repository.st-andrews.ac.uk/ Please use this identifier to cite or link to this thesis: http://hdl.handle.net/10023/16827 This item is protected by original copyright This item is licensed under a Creative Commons Licence https://creativecommons.org/licenses/by-nc-nd/4.0/ The Sea of Stuff: a Model to Manage Shared Mutable Data in a Distributed Environment Simone Ivan Conte This thesis is submitted in partial fulfilment for the degree of Doctor of Philosophy (PhD) at the University of St Andrews August 2018 ii Abstract Managing data is one of the main challenges in distributed systems and computer science in general. Data is created, shared, and managed across heterogeneous distributed systems of users, services, applications, and devices without a clear and comprehensive data model. This technological fragmentation and lack of a common data model result in a poor understanding of what data is, how it evolves over time, how it should be managed in a distributed system, and how it should be protected and shared. From a user perspective, for example, backing up data over multiple devices is a hard and error-prone process, or synchronising data with a cloud storage service can result in conflicts and unpredictable behaviours. This thesis identifies three challenges in data management: (1) how to extend the current data abstractions so that content, for example, is accessible irrespective of its location, versionable, and easy to distribute; (2) how to enable transparent data storage relative to locations, users, applications, and services; and (3) how to allow data owners to protect data against malicious users and automatically control content over a distributed system. These challenges are studied in detail in relation to the current state of the art and addressed throughout the rest of the thesis. The artefact of this work is the Sea of Stuff (SOS), a generic data model of immutable self-describing location-independent entities that allow the construction of a distributed system where data is accessible and organised irrespective of its location, easy to protect, and can be automatically managed according to a set of user-defined rules. The evaluation of this thesis demonstrates the viability of the SOS model for managing data in a distributed system and using user-defined rules to automatically manage data across multiple nodes. The code for this work can be found online at the following URL: https://github.com/sea-of-stuff (GNU GPL v3). iii iv Declaration Candidate's Declaration I, Simone Ivan Conte, do hereby certify that this thesis, submitted for the degree of PhD, which is approximately 66,400 words in length, has been written by me, and that it is the record of work carried out by me, or principally by myself in collaboration with others as acknowledged, and that it has not been submitted in any previous application for any degree. I was admitted as a research student at the University of St Andrews in September 2014. I received funding from an organisation or institution and have acknowledged the fun- der(s) in the full text of my thesis. Date Signature of candidate Supervisor's Declaration I hereby certify that the candidate has fulfilled the conditions of the Resolution and Regulations appropriate for the degree of PhD in the University of St Andrews and that the candidate is qualified to submit this thesis in application for that degree. Date Signature of supervisor v Permission for Electronic Publication In submitting this thesis to the University of St Andrews we understand that we are giving permission for it to be made available for use in accordance with the regulations of the University Library for the time being in force, subject to any copyright vested in the work not being affected thereby. We also understand, unless exempt by an award of an embargo as requested below, that the title and the abstract will be published, and that a copy of the work may be made and supplied to any bona fide library or research worker, that this thesis will be electronically accessible for personal or research use and that the library has the right to migrate this thesis into new electronic forms as required to ensure continued access to the thesis. I, Simone Ivan Conte, confirm that my thesis does not contain any third-party material that requires copyright clearance. The following is an agreed request by candidate and supervisor regarding the publication of this thesis: • No embargo on print copy. • No embargo on electronic copy. Date Signature of candidate Date Signature of supervisor vi Underpinning Research Data or Digital Outputs Candidate's declaration I, Simone Ivan Conte, hereby certify that no requirements to deposit original research data or digital outputs apply to this thesis and that, where appropriate, secondary data used have been referenced in the full text of my thesis. Date Signature of candidate vii viii Acknowledgments I would like to thank my supervisors, Alan Dearle and Graham Kirby, for their guidance and endless patience throughout my doctoral studies. Thanks goes to Adrian O'Lenskie and Ian Paterson, from Adobe Systems Inc., for their precious advice and support. I would like to acknowledge the School of Computer Science, which has made my stay in St Andrews, over the last eight years, enjoyable academically and has given me the opportunity to meet people that have been a true inspiration to me. Thanks goes to Stuart Norcross and the Fixit team who have helped me with the provision and management of the testbed used for the experiments. Thanks to my office mates Masih, Tom, Ward, and Ryo who have provided me with challenging, interesting, and fun discussions on a daily basis. I would like to thank my aunt Giusy for inspiring me, when I was ten, to pursue a career in science. I am forever grateful to my parents and my brother for their immense and invaluable love and support, which helped me arrive where I am today. Finally, the biggest thanks goes to Giulia, who makes me smile everyday. Funding This work was supported by Adobe Systems, Inc. and EPSRC [grant number EP/M506631/1]. ix x Contents Abstract......................................... iii Declaration.......................................v Permission for Electronic Publication......................... vi Underpinning Research Data or Digital Outputs.................. vii Acknowledgments.................................... ix 1 Introduction1 1.1 Introduction....................................1 1.2 The Three Challenges..............................2 1.2.1 Limitation of the Current Data Storage Abstractions.........3 1.2.2 Transparent Data Storage........................5 1.2.3 Data Ownership, Protection and Control...............7 1.3 Hypothesis....................................8 1.4 Thesis Contributions............................... 10 1.5 Thesis Structure................................. 11 2 Background 13 2.1 Data Storage Concepts.............................. 13 2.1.1 Data.................................... 13 2.1.2 Location and Naming.......................... 14 2.1.3 Metadata................................. 17 2.1.4 Caching.................................. 19 2.1.5 CAP Theorem.............................. 20 2.1.6 Replication, Erasure Coding, and Resiliency.............. 20 xi Contents 2.1.7 Scalability................................. 24 2.1.8 Security.................................. 25 2.2 Data Management Systems........................... 32 2.2.1 File Systems............................... 33 2.2.2 Database Systems............................ 41 2.2.3 Versioning in Storage Systems...................... 46 2.2.4 Networked File Systems......................... 49 2.2.5 Cloud Storage.............................. 54 2.2.6 Object Storage.............................. 55 3 Literature Review 57 3.1 File Systems.................................... 57 3.1.1 Extended Attributes Support...................... 58 3.1.2 Tagged Files............................... 59 3.2 Networked File Systems............................. 59 3.2.1 The Hadoop File System and Google File System........... 60 3.2.2 GlusterFS................................. 63 3.3 Versioning in Storage Systems.......................... 64 3.3.1 Manual Data Versioning......................... 64 3.3.2 Versioning in Backup Applications................... 65 3.3.3 Version Control Systems......................... 67 3.4 Cloud Storage................................... 77 3.4.1 Infrastructure as a Service Storage................... 77 3.4.2 Software as a Service Storage...................... 80 3.4.3 Multi-Cloud Storage........................... 86 3.5 P2P........................................ 87 3.5.1 Overlay Networks............................ 87 3.5.2 P2P Storage Systems.......................... 88 3.6 Context-Aware Storage............................. 99 xii Contents 3.6.1 The Semantic File System........................ 101 3.6.2 The quFile................................ 101 3.7 Conclusions.................................... 102 4 Design Requirements 105 4.1 End-User Requirements............................. 105 4.2 Model Requirements............................... 106 4.3 Architecture Requirements............................ 107 5 The Sea of Stuff 109 5.1 The Sea of Stuff Model.............................

A Model to Manage Shared Mutable Data in a Distributed Environment

Active @ UNDELETE Users Guide | TOC | 2

Gluster Roadmap: Recent Improvements and Upcoming Features

Storage Virtualization for KVM – Putting the Pieces Together

The Parallel File System Lustre

IPFS and Friends: a Qualitative Comparison of Next Generation Peer-To-Peer Data Networks Erik Daniel and Florian Tschorsch

A Fog Storage Software Architecture for the Internet of Things Bastien Confais, Adrien Lebre, Benoît Parrein

Privacy Enhancing Technologies 2003 an Analysis of Gnunet And

Red Hat Data Analytics Infrastructure Solution

Unlock Bigdata Analytic Efficiency with Ceph Data Lake

Comptia A+ Acronym List Core 1 (220-1001) and Core 2 (220-1002)

Andrew File System (AFS) Google File System February 5, 2004

HP Storageworks Clustered File System Command Line Reference