Decentralized Social Data Sharing
Total Page:16
File Type:pdf, Size:1020Kb
Decentralized Social Data Sharing by Alan Davoust A Thesis submitted to the Faculty of Graduate Studies and Post-Doctoral Affairs in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Electrical and Computer Engineering Ottawa-Carleton Institute for Electrical and Computer Engineering (OCIECE) Department of Systems and Computer Engineering Carleton University Ottawa, Ontario, Canada 2015 Copyright c 2015 - Alan Davoust Abstract Many data sharing systems are open to arbitrary users on the Internet, who are independent and self-interested agents. Therefore, in addition to traditional design goals such as technical performance, data sharing systems should be designed to best support the strategic interactions of these agents. Our research hypothesis is that designs that maximize the participants’ autonomy can produce useful data sharing systems. We apply this design principle to both the system architecture and the functional design of a data sharing system, and study the resulting class of systems, which we call Decentralized Social Data Sharing ((DS)2) systems. We formally define this class of systems and provide a reference implementation and an example application: a distributed wiki system called P2Pedia. P2Pedia implements a decentralized collaboration model, where the users are not required to reach a consensus, and instead benefit from being exposed to multiple viewpoints. We demonstrate the value of this collaboration model through an extensive user study. Allowing the users to autonomously control their data prevents the system archi- tecture from being optimized for efficient query processing. We show that Regular Path Queries, a useful class of graph queries, can still be processed on the shared data: although in the worst case such queries are intractable, we propose a cost estimation technique to identify tractable queries from partial knowledge of the data. Through simulation, we also show that the users’ control over network connections ii allows them to self-organize and interact with other users with whom their interests are best aligned. This may result in less data being available, and we study cases where this is in fact demonstrably beneficial to the users, as the available data to each user is the most relevant to them. This suggests that querying this reduced collection of shared data may lead to more tractable query processing without necessarily reducing the users’ utility. iii Acknowledgments First of all, I would like to acknowledge the valuable guidance of my thesis supervisor Dr. Babak Esfandiari, who helped me follow proper research methods while allowing me a significant amount of what I should call academic freedom. This freedom is one of the reasons why this thesis took so long, but I’m not sure I could have worked this hard on something I hadn’t largely chosen myself. Dr. Esfandiari was also instrumental in connecting me with useful elements such as funding and valuable aspects of my academic training, including peer reviewing activities, a teaching contract, and an industrial internship. I am indebted to Pr. Vincent Kazmierski of the Department of Law at Carleton University, who made possible the user study with P2Pedia. Pr. Kazmierski trusted us and fearlessly accepted to use our tool P2Pedia for academic writing tutorials in one of his courses, for two consecutive semesters, with close to 200 students each time, then participated in the data analysis. My work also benefited from my collaboration with Drs. Hala Skaf, Pascal Molli and Khaled Aslan from the University of Nantes, France. While the results of that work only occupy a small place in this dissertation in terms of actual text, they greatly contributed to shaping my understanding of collaborative systems. A number of undergraduate students also participated more or less directly to this research with their development work, including (in approximate chronological order) Alexander Craig, Denis Dionne, Matthew Smith, Cameron Blanchard, Jeremy iv Dunsmore, and Ethan Aubuchon. Finally, I could not have achieved this without the loving support of my wife Sylviana. v Table of Contents Abstract iii Acknowledgments v Table of Contents vii List of Tables xvi List of Figures xvii List of Acronyms xx 1 Introduction 1 1.1 Motivation................................. 1 1.2 ResearchProblem............................. 5 1.2.1 Definition ............................. 6 1.2.2 Efficiency: Data Availability and Query Processing . .. 7 1.2.3 Self-Interested and Cooperative Behaviour . ... 9 1.3 ContributionsandConclusions. 12 1.3.1 Contributions ........................... 12 1.3.2 Conclusions ............................ 13 1.4 OrganizationoftheThesis . 14 vi 2 State of the Art 16 2.1 Introduction................................ 16 2.2 TheWebofData ............................. 18 2.2.1 SystemModel........................... 18 2.2.2 Functionality ........................... 19 2.2.3 Queries .............................. 20 2.3 OnlineSocialNetworks. 21 2.3.1 SystemModel........................... 21 2.3.2 Functionality ........................... 22 2.3.3 Queries .............................. 24 2.4 CollaborativeRepositories . 24 2.4.1 SystemModel........................... 24 2.4.2 Functionality ........................... 25 2.5 Peer-to-peerDataSharingSystems . 26 2.5.1 SystemModel........................... 27 2.5.2 Functionality ........................... 29 2.5.3 Queries .............................. 29 2.6 Comparison ................................ 30 2.6.1 TechnicalCosts .......................... 30 2.6.2 ConflictsoverSharedData. 31 2.7 Conclusion................................. 36 2.7.1 Summary ............................. 36 2.7.2 Decentralized Social Data Sharing (DS)2 ............ 38 3 Decentralized Social Data Sharing Applications: Formal Definition 40 3.1 Introduction................................ 40 3.2 ConceptualModelofSocialDataSharing. .. 41 vii 3.2.1 SocialNetwork .......................... 41 3.2.2 DataResourcesandGraphofData . 41 3.2.3 DataSharing ........................... 43 3.2.4 SystemModel........................... 43 3.3 SystemEvolution ............................. 44 3.3.1 ModifyingOperations . 44 3.3.2 Deletionvs.Modification. 46 3.3.3 Queries .............................. 46 3.4 ModelRefinementforaP2PDeployment . 47 3.4.1 Churn ............................... 48 3.4.2 TheProblemofQueryProcessing . 49 3.4.3 QuerySemantics ......................... 49 3.4.4 StrategicBehaviour. 50 3.5 Conclusion................................. 50 4 P2Pedia, a (DS)2 Wiki 52 4.1 Motivation................................. 52 4.2 RelatedWork ............................... 54 4.2.1 DecentralizedContentManagement . 54 4.2.2 DecentralizedWikis. 55 4.2.3 WikipediaLanguageEditions . 56 4.3 FunctionalityofP2Pedia . 56 4.3.1 Overview ............................. 56 4.3.2 ExampleScenario......................... 58 4.3.3 UseCases ............................. 59 4.3.4 RegularPathQueries. 62 4.4 TrustIndicatorsinP2Pedia . 63 viii 4.4.1 MotivationandFundamentalAssumptions . 63 4.4.2 SocialNetworkingTrustIndicators . 64 4.4.3 Similarity ............................. 64 4.4.4 PageIndicators .......................... 65 4.5 Conclusion................................. 65 5 A P2P Platform for social data-sharing applications 67 5.1 Introduction................................ 67 5.2 Related Work: Frameworks for Data Sharing Applications . ..... 69 5.3 Background:REST............................ 71 5.4 LayeredArchitecture ........................... 73 5.5 BaseP2PLayer.............................. 74 5.5.1 LayerImplementation . 74 5.5.2 Resources ............................. 75 5.6 SocialNetworkingLayer . 76 5.6.1 Implementation .......................... 76 5.6.2 Resources ............................. 76 5.7 BaseStorageLayer ............................ 77 5.7.1 Implementation .......................... 77 5.7.2 Resources ............................. 78 5.8 Data-SharingLayer............................ 78 5.8.1 Implementation .......................... 79 5.8.2 RESTresources.......................... 79 5.9 StructuredDataManagementLayer. 81 5.9.1 DataModel ............................ 81 5.9.2 ModifyingtheGraphofData . 81 5.9.3 Queries .............................. 82 ix 5.9.4 RESTResources ......................... 82 5.10Conclusion................................. 82 6 Processing Regular Path Queries in a (DS)2 84 6.1 Introduction................................ 84 6.2 DefinitionsandAlgorithmsforRPQs . 88 6.2.1 Notations ............................. 88 6.2.2 Definitions............................. 88 6.2.3 RegularPathQueryWithInverse . 89 6.2.4 Example.............................. 90 6.2.5 BasicRPQAlgorithm ...................... 91 6.2.6 ApplicationtoRPQI ....................... 92 6.2.7 Complexity ............................ 92 6.2.8 Optimizations........................... 93 6.3 RPQProcessingonDistributedData . 94 6.3.1 Overview ............................. 94 6.3.2 ExampleDatabaseandQuery . 97 6.3.3 Query Processing Strategies for the (DS)2 setting ....... 98 6.3.4 ComplexityComparison . 103 6.3.5 Conclusion............................. 106 6.4 Asynchronous Iterative Centralized Query Processing . ....... 107 6.4.1 SequentialAlgorithm . 108 6.4.2 Example.............................. 110 6.4.3 Complexity ............................ 111 6.4.4 ParallelAlgorithm . 112 6.5 CostComparisononReal-worldQueries . 118 6.5.1 Datasetandqueries. 118 x 6.5.2 BroadcastandUnicastcosts . 120 6.5.3 Results..............................