Design and Initial Implementation of a Decentralized Social Networking Application Using Apache Cassandra and Peer to Peer Data

Design and Initial Implementation of a Decentralized Social Networking Application Using Apache Cassandra and Peer to Peer Data Transfers Campbell Boswell and Rowen Felt, Advised by Peter Johnson Middlebury College, Department of Computer Science Summer 2018 Summary Our objective for this summer’s work was to create the infrastructure to support a decentralized social networking application. A decentralized application has several benefits over a centralized system, the most important being data privacy, security and platform independence. Centralized applications, such as Facebook and Instagram, profit by datamining users for marketing information. In a decentralized system however, users own their data and share the operating costs of the network by managing their own data stores and networking overhead which eliminates the need for corporate ownership We began by reading a report from our advisor’s previous summer research session (2017), which was produced by student researchers. This report presented the initial conceptual research for the project, and included an analysis of user operations performed in a variety of social networking contexts and applications. After discussing the report and exploring several academic papers related to goals of the project, we decided that the nexts steps would be designing a distributed application oriented around an underlying distributed hash table. Our advisor left the specific design and behavior of the system for us to define and implement ourselves. A distributed hash table (or DHT) is “a decentralized, distributed system that provides a lookup service very similar to a hash table”.1 Most DHTs describe a system of nodes connected in a ring-like graph such that each node is responsible for a given range of hashed keys. Each node is aware of a certain number of other nodes throughout the ring and maintains some notion of the overall state of the network and distribution of keys. When key-value pairs are inserted into the DHT, a logarithmic-time search function similar to binary search allows the inserting node to locate the node responsible for the hashed key and pass the appropriate value to be inserted. Looking up values by their corresponding keys functions in the same manner, except values are retrieved rather than inserted. Most DHTs include support for mirroring data on other nodes in case of system error and provide functionality for nodes to join and leave the network with minimal computational overhead. DHTs allow for consistent, reliable access to a large amount of data than might be unmanageable on a single server. There are several notable benefits of DHTs, including scalability, fault tolerance, and flexibility. Because each node maintains only knowledge of a constant or logarithmic number of other nodes, and because each node is responsible for only a portion of the overall data, a DHT can easily scale to thousands or millions of nodes and billions or trillions of data points. DHTs 1 https://en.wikipedia.org/wiki/Distributed_hash_table are also fault tolerant in that most systems allow nodes to join and leave the network without a substantial penalty in locating relevant data points. This fault tolerance is invaluable in both the context of a datacenter, where heavy traffic consistently leads to node failure, and in the context of nodes distributed across a wide user base, where nodes often disconnect due to network connectivity and maintenance. Many DHTs are also highly usable regardless of network topology or physical proximity. This flexibility has proven invaluable in previous iterations of decentralized social networking applications which also used DHTs. After two weeks of reading research publications concerning various DHTs and distributed system implementations, we decided to recreate the functionality of an existing social networking platform through our own design. We chose Instagram because its primary operations, such as making posts, writing comments, tagging users, and sending direct messages, seemed fairly straightforward. We began by defining Instagram’s user actions and decomposing them into computational operations on user profile objects and user content. At this point we decided that the difference between posts, comments, shares, tags, and messages are trivial enough that all of these data points can be abstracted into a single object class that we chose to call a dispatch. The dispatch object is composed of all the fields needed to describe any of the above content, including such fields as the user id, image data, text, tags, user tags, audience, and a globally unique dispatch id. The dispatch object also contains the parent type and parent id fields to identify the dispatch as either a post or comment. In the case of a post, the parent id would be the user id of the poster. In the case of a comment, the parent id would be the dispatch id of the original post which is being commented upon. The audience field can also be used to specify the type of dispatch as either a public post, direct message, or group message, in which case the audience field would be populated with the user ids of relevant parties. Using this new abstraction, we were able to break down all user actions into operations on dispatch objects and user objects. When the time came to implement our design, we had to decide which pre existing DHT we would use to store global user identifiers and what kind of local database we would use to store user data. Fortunately, the operations and measured efficiency of most DHTs are essentially equivalent, so we could choose an implementation based on the language we wanted to use and the support provided. We chose to write all of the server software for this application in C because this project presented a good opportunity to gain familiarity with the language and work with third-party C libraries. Due to this decision, we chose Apache Cassandra as our DHT for it’s extensive API and C driver support. We similarly chose MongoDB for our local database because of its noteable speed, document-based flexibility, and substantial support for C drivers. We spent the last four weeks of the summer coding our implementation. We started by writing C libraries for storing, retrieving, and updating user identification information in the Apache Cassandra database. We then wrote C libraries for insertion, deletion, and search methods on user and dispatch objects in the MongoDB database. The majority of this code was contained in methods which converted user and dispatch structs to BSON then JSON formats and vice versa. We then decided to write our own application-layer networking protocol to facilitate the peer-to-peer communication which would comprise the bulk of network overhead. We chose to make this protocol text-based for ease of testing and because the data being transferred between instances of MongoDB was conveniently stored in the text based JSON format. We used these networking protocols to describe another layer of abstraction that more closely resembled real user actions. These protocols implemented behavior such as pushing a user object to a node or pulling all dispatch objects with a given field. On top of this layer we were able to build a server that responds to incoming requests and a client that reads protocol commands from a file. Our last project was writing python wrapper functions that describe individual user actions, such as making a post, sending a message, or viewing a profile. These functions format the appropriate information as network protocol commands to be executed by the client process. We then wrote testing infrastructure that would randomly generate an arbitrary number of user actions with a given distribution of probability and execute them sequentially. While we were able to test a large number of inputs without system failure, we were unable to truly test the system for a variety of reasons. We were unable to acquire a dataset of user actions with the data we required, and we had no dataset representing performance standards for centralized social networks to which we could compare the performance of our system. Additionally, all of the nodes to which we had access are operating on a local area network, which is not an accurate representation of how the system would be deployed in the wild, and we had neither the time nor resources to simulate realistic network topology on the systems available to us. Ultimately, our aim was to leave this project in a clean state with concise documentation so that others may continue the work in the future with relative ease. With that in mind, the system model, setup, API, testing infrastructure, and proposals for future work have been outlined below. Reference past work (soup, reclaim) We began our project by researching implementations of DHT’s, peer-to-peer technologies, and past attempts at decentralized social networks (SOUP and ReClaim). Listed below are the publications we reviewed. Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications, Ion Stoica, et al, 2001 Link: https://pdos.csail.mit.edu/papers/chord:sigcomm01/chord_sigcomm.pdf Democratizing Content Publication with Coral, Michael J. Freedman, et al, 2004 Link: http://www.coralcdn.org/docs/coral-nsdi04.pdf SOUP: an Online Social Network by the People, for the People, David Knoll, et al, 2014 Link: https://dl.acm.org/citation.cfm?id=2663324 HyperDex: A Distributed, Searchable Key-Value Store, Robert Escriva et al, 2012 Link: http://conferences.sigcomm.org/sigcomm/2012/paper/sigcomm/p25.pdf

Load more