Design and Initial Implementation of a Decentralized Social Networking Application Using Apache Cassandra and Peer to Peer Data Transfers Campbell Boswell and Rowen Felt, Advised by Peter Johnson Middlebury College, Department of Science Summer 2018

Summary Our objective for this summer’s work was to create the infrastructure to support a decentralized social networking application. A decentralized application has several benefits over a centralized system, the most important being data privacy, security and platform independence. Centralized applications, such as Facebook and Instagram, profit by datamining users for marketing information. In a decentralized system however, users own their data and share the operating costs of the network by managing their own data stores and networking overhead which eliminates the need for corporate ownership We began by reading a report from our advisor’s previous summer research session (2017), which was produced by student researchers. This report presented the initial conceptual research for the project, and included an analysis of user operations performed in a variety of social networking contexts and applications. After discussing the report and exploring several academic papers related to goals of the project, we decided that the nexts steps would be designing a distributed application oriented around an underlying distributed hash table. Our advisor left the specific design and behavior of the system for us to define and implement ourselves. A distributed hash table (or DHT) is “a decentralized, distributed system that provides a lookup service very similar to a hash table”.1 Most DHTs describe a system of nodes connected in a ring-like graph such that each node is responsible for a given range of hashed keys. Each node is aware of a certain number of other nodes throughout the ring and maintains some notion of the overall state of the network and distribution of keys. When key-value pairs are inserted into the DHT, a logarithmic-time search function similar to binary search allows the inserting node to locate the node responsible for the hashed key and pass the appropriate value to be inserted. Looking up values by their corresponding keys functions in the same manner, except values are retrieved rather than inserted. Most DHTs include support for mirroring data on other nodes in case of system error and provide functionality for nodes to join and leave the network with minimal computational overhead. DHTs allow for consistent, reliable access to a large amount of data than might be unmanageable on a single server. There are several notable benefits of DHTs, including scalability, fault tolerance, and flexibility. Because each node maintains only knowledge of a constant or logarithmic number of other nodes, and because each node is responsible for only a portion of the overall data, a DHT can easily scale to thousands or millions of nodes and billions or trillions of data points. DHTs

1 https://en.wikipedia.org/wiki/Distributed_hash_table

are also fault tolerant in that most systems allow nodes to join and leave the network without a substantial penalty in locating relevant data points. This fault tolerance is invaluable in both the context of a datacenter, where heavy traffic consistently leads to node failure, and in the context of nodes distributed across a wide user base, where nodes often disconnect due to network connectivity and maintenance. Many DHTs are also highly usable regardless of network topology or physical proximity. This flexibility has proven invaluable in previous iterations of decentralized social networking applications which also used DHTs. After two weeks of reading research publications concerning various DHTs and distributed system implementations, we decided to recreate the functionality of an existing social networking platform through our own design. We chose Instagram because its primary operations, such as making posts, writing comments, tagging users, and sending direct messages, seemed fairly straightforward. We began by defining Instagram’s user actions and decomposing them into computational operations on user profile objects and user content. At this point we decided that the difference between posts, comments, shares, tags, and messages are trivial enough that all of these data points can be abstracted into a single object class that we chose to call a dispatch. The dispatch object is composed of all the fields needed ​ ​ to describe any of the above content, including such fields as the user id, image data, text, tags, ​ user tags, audience, and a globally unique dispatch id. The dispatch object also contains the ​ ​ ​ ​ ​ parent type and parent id fields to identify the dispatch as either a post or comment. In the case ​ ​ ​ of a post, the parent id would be the user id of the poster. In the case of a comment, the parent ​ ​ ​ ​ ​ id would be the dispatch id of the original post which is being commented upon. The audience ​ ​ ​ ​ field can also be used to specify the type of dispatch as either a public post, direct message, or group message, in which case the audience field would be populated with the user ids of ​ ​ ​ ​ relevant parties. Using this new abstraction, we were able to break down all user actions into operations on dispatch objects and user objects. When the time came to implement our design, we had to decide which pre existing DHT we would use to store global user identifiers and what kind of local database we would use to store user data. Fortunately, the operations and measured efficiency of most DHTs are essentially equivalent, so we could choose an implementation based on the language we wanted to use and the support provided. We chose to write all of the server software for this application in because this project presented a good opportunity to gain familiarity with the language and work with third-party C libraries. Due to this decision, we chose Apache Cassandra as our DHT for it’s extensive API and C driver support. We similarly chose MongoDB for our local database because of its noteable speed, document-based flexibility, and substantial support for C drivers. We spent the last four weeks of the summer coding our implementation. We started by writing C libraries for storing, retrieving, and updating user identification information in the Apache Cassandra database. We then wrote C libraries for insertion, deletion, and search methods on user and dispatch objects in the MongoDB database. The majority of this code was contained in methods which converted user and dispatch structs to BSON then JSON formats and vice versa. We then decided to write our own application-layer networking protocol to facilitate the peer-to-peer communication which would comprise the bulk of network overhead. We chose to make this protocol text-based for ease of testing and because the data being

transferred between instances of MongoDB was conveniently stored in the text based JSON format. We used these networking protocols to describe another layer of abstraction that more closely resembled real user actions. These protocols implemented behavior such as pushing a user object to a node or pulling all dispatch objects with a given field. On top of this layer we were able to build a server that responds to incoming requests and a client that reads protocol commands from a file. Our last project was writing python wrapper functions that describe individual user actions, such as making a post, sending a message, or viewing a profile. These functions format the appropriate information as network protocol commands to be executed by the client process. We then wrote testing infrastructure that would randomly generate an arbitrary number of user actions with a given distribution of probability and execute them sequentially. While we were able to test a large number of inputs without system failure, we were unable to truly test the system for a variety of reasons. We were unable to acquire a dataset of user actions with the data we required, and we had no dataset representing performance standards for centralized social networks to which we could compare the performance of our system. Additionally, all of the nodes to which we had access are operating on a local area network, which is not an accurate representation of how the system would be deployed in the wild, and we had neither the time nor resources to simulate realistic network topology on the systems available to us. Ultimately, our aim was to leave this project in a clean state with concise documentation so that others may continue the work in the future with relative ease. With that in mind, the system model, setup, API, testing infrastructure, and proposals for future work have been outlined below.

Reference past work (soup, reclaim) We began our project by researching implementations of DHT’s, peer-to-peer technologies, and past attempts at decentralized social networks (SOUP and ReClaim). Listed below are the publications we reviewed.

Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications, Ion Stoica, et al, ​ 2001 Link: https://pdos.csail.mit.edu/papers/chord:sigcomm01/chord_sigcomm.pdf ​

Democratizing Content Publication with Coral, Michael J. Freedman, et al, 2004 ​ Link: http://www.coralcdn.org/docs/coral-nsdi04.pdf ​

SOUP: an Online Social Network by the People, for the People, David Knoll, et al, 2014 ​ Link: https://dl.acm.org/citation.cfm?id=2663324 ​ ​

HyperDex: A Distributed, Searchable Key-Value Store, Robert Escriva et al, 2012 ​ Link: http://conferences.sigcomm.org/sigcomm/2012/paper/sigcomm/p25.pdf ​ ​

BubbleStorm: Resilient, Probabilistic, and Exhaustive Peer-to-Peer Search, Wesley W. ​ Terpstra et al, 2007 Link: http://www.sigcomm.org/node/2624 ​ ​

Making Gnutella-like P2P Systems Scalable, Yatin Chawathe et al, 2003 ​ Link: http://conferences.sigcomm.org/sigcomm/2003/papers/p407-chawathe.pdf ​ ​

Comet: An Active Distributed Key-Value Store, Roxana Geambasu et al, 2010 ​ Link: https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Geambasu.pdf ​ ​

Fully Dynamic Betweenness Centrality Maintenance on Massive Networks, Takanori ​ Hayashi et al, 2015 Link: http://www.vldb.org/pvldb/vol9/p48-hayashi.pdf ​ ​

ReClaim: a Privacy-Preserving Decentralized Social Network, Niels Zeilemaker et al, 2014. ​ Link: https://www.usenix.org/system/files/conference/foci14/foci14-zeilemaker.pdf ​ ​

Scalable Consistency in Scatter, Lisa Glendenning et al, 2011 ​ Link: https://www.sigops.org/sosp/sosp11/current/2011-Cascais/02-glendenning-online.pdf ​ ​

Storage management and Caching in PAST, a Large-Scale, Persistent Peer-to-Peer Storage Utility, Antony Rowstron et al, 2001 ​ Link: http://sosp.org/2001/papers/rowstron.pdf ​ ​

Ceph: A Scalable, High-Performance Distributed File System, Sage A. Weil et al, 2006 ​ Link: https://www.usenix.org/legacy/events/osdi06/tech/full_papers/weil/weil.pdf ​ ​

Viceroy: A Scalable and Dynamic Emulation of the Butterfly, Dahlia Malkhi et al, 2001 ​ Link: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.7629&rep=rep1&type=pdf ​ ​

Pastry: Scalable, Decentralized Object Location and Routing for Large-Scale Peer-to-Peer Systems, Antony Rowstron et al, 2001 ​ Link: http://rowstron.azurewebsites.net/PAST/pastry.pdf ​ ​

System Model

General structure Apache Cassandra is the DHT we chose to form the backbone of our application, such that every node is connect to the Cassandra cluster. We borrowed SOUP’s design philosophy that the DHT can be used as an identification tool for facilitating peer-to-peer communication rather than as a datastore for user and dispatch information. With this in mind, we designed our Cassandra database to store the IP addresses of users with a compound primary key of user_id ​

and username, in which user_id is a globally unique identifier and username is not necessarily ​ ​ ​ ​ ​ ​ unique. Each user has a home node containing a server that controls access to an instance of MongoDB. This database houses all of a user’s profile and dispatch data, as well as related dispatches such as user tags and direct messages. The entirety of a user’s data is also stored on a constant number of mirror nodes owned and maintained by other users to ensure constant access (mirroring functionality has not yet been implemented). All user actions involve a four step process. First, the user action is formatted into a number of network protocols to be executed by the client. The users to be contacted are then pulled by user_id or username from ​ ​ ​ ​ Apache Cassandra. The client then establishes a TCP connection to each required server, including a localhost connection to the server running on the home node if needed. The data is either pushed from the client to the contacted server or pulled from the contacted server and pushed by the client to the local server.

User Stories As a user, I want to create posts with attached photos, captions, tags, and user tags. Tagged users should be notified. This action will necessitate creating a new dispatch object. Creating the dispatch object will require the following information: the file path for the photo to be added or “no image”, the text of the caption, the user_id of the posting user, the audience (0 for public post), a list of tags, and ​ ​ a list of user tags. This information will be used to create the dispatch object in JSON, which will involve opening the file path for the image and converting it into binary and filling out the other fields of the dispatch object from the given information. In this case, the parent_type of the ​ ​ dispatch would be 0, indicating a post, and the parent_id would be the user_id of the posting ​ ​ ​ ​ user. A “push dispatch” network protocol will be formatted for the posting user’s home server and for each tagged user’s server. The client will then execute network operations with each server to push the dispatch JSON object to instances of the MongoDB dispatch collection on each node. Ideally, the users’ home nodes will then update the relevant mirror nodes with the new information, but this step has not yet been implemented.

As a user, I want to comment on posts, reply to comments, and tag users in comments, notifying the user. The action will require creating a new dispatch object with the same fields as above, but in this case the parent_type would be 1, indicating a comment or tag. If the dispatch represents a ​ ​ comment, the parent_id would be the dispatch_id of the original post. If the dispatch represents ​ ​ ​ ​ a reply to a comment, then the parent_id would be the dispatch_id of the comment being ​ ​ ​ ​ replied. A “push child***” network protocol will be formatted for the commenting user’s home server and the original poster’s home server. A “push user_tag” network protocol will be formatted for each tagged user’s server. The client will then execute network operations with each server to push the dispatch object to instances of the MongoDB dispatch collection on each node.

As a user, I want to send photos and messages directly to another user or to a group of user.

The action will require creating a new dispatch object with the same fields as above, but in this case the parent_type would be 0, the parent_id would be the sender’s user_id, and the ​ ​ ​ ​ ​ ​ audience field would be a list of user_ids to which the message should be sent. A “push ​ ​ ​ message*” network protocol will be formatted for the sender’s home server and for the server of each user in the audience field. The client will then execute network operations with each server ​ ​ to push the dispatch object to instances of the MongoDB dispatch collection on each node.

As a user, I want to update my feed with posts from other users. This action requires pulling a user’s following list from their user object, and a “pull all*****” ​ ​ network protocol will be formulated for each user followed. The client will then execute the network operations for the server of each user in the following list to pull all dispatches from ​ ​ instances of the MongoDB database on each node that have the node’s home user’s user_id as ​ ​ parent_id. All returned dispatch JSON objects will then be inserted into the home node’s ​ MongoDB dispatch collection. Ideally, this method would pull only most recent dispatches first and pull older dispatches dependent on some user action, but this behavior has not been implemented. Because updating feeds comprises a large portion of user actions, prefetching this data would be invaluable in improving the user experience, but this behavior has also not been implemented because this application is currently entirely experimental and does not support a user interface.

As a user, I want to search for users by username. This action requires querying Apache Cassandra for all users with a given username. For each returned user, a “pull user****” network protocol must be formulated. The client will then execute network operations for each returned user’s server. The user JSON objects returned will be inserted into the home node’s MongoDB user collection.

As a user, I want to search the posts of users I follow for instances of a tag. This action requires pulling all user_ids from a user’s following list. For each user_id, a “pull ​ ​ ​ ​ ​ ​ tags****” network protocol must be formulated with the given search term. The client will then execute network operations for each returned user’s server. The dispatch JSON objects returned will be inserted into the home node’s MongoDB dispatch collection.

As a user, I want to visit another user’s page and view their user information, their posts, and posts in which they have been tagged. This action requires formulating “pull child***,” “pull user_tag,” and “pull user****” network protocols for a given user_id. The client will then execute network operations for each protocol ​ ​ defined with the given user’s server. All returned dispatch JSON objects will be inserted into the home node’s MongoDB dispatch collection via the “push dispatch” protocol with the home node’s server. And all returned user JSON objects will similarly be inserted into the home node’s MongoDB user collection via the “push user****” protocol with the home node’s server.

Hypothesis on viability

Based on the proven efficiency of Apache Cassandra and the general use cases of a social network like Instagram, we can make some hypotheses on the viability of our proposed system. The first consideration is the amount of data being stored. If one were to download their entire data profile from a site like Facebook or Instagram, they would see that the majority of the stored data is related to user behavior, such as search history, location data, and marketing data, but the amount of intentionally uploaded user data is relatively small. Because our system does not datamine users, the amount of data that needs be considered in the context of search and insertion operations within the Mongo database is small and will likely not impact performance. The second consideration is the efficiency of the DHT for lookup operations of identification and connection data. In this case, the efficiency of Apache Cassandra has already been proven; no lookup should take more than a logarithmic number of hops between nodes in the worst case. The third consideration is server load. We base our assumption that the peer-to-peer server design is scalable on our use cases and demonstrated behavior, in which a relatively constant number of users will request resources from any given user server as defined by the social relationships of “followers” and “following.” Therefore, even as the number of overall users will increase, the average number of “follower” and “following” relationships per user, and thus the number of user servers contacted per user, will remain relatively constant. The exception to this behavior is the proposed existence of a superuser, such as a social media celebrity, who might have millions of followers. To combat the strain this would place on that user’s server, we propose that each superuser would be fragmented into a number of smaller user objects with a constant number of followers. The fragmented user’s data would be identical and stored either on the same machine while running multiple servers in parallel or on a number of mirroring nodes. However, this behavior has not yet been implemented. Based on these considerations, we believe that our system would be highly scalable and efficient in terms of network and database usage.

Dependencies

Cassandra Cassandra is a “distributed wide column store NoSQL” database created and supported by the Apache Software Foundation. It is referred to as a “wide column store” because it has the ability for columns to be dynamically added to rows in the data table which it uses as part of its organisational philosophy. This means that Cassandra is able to replicate the search efficiency of a more traditional distributed hash table through its use of primary keys which define the storage location of rows, but also muster additional flexibility through its ability to add multiple values or columns to a given row or primary key. While we did not directly take advantage of Cassandra’s flexibility, it might be useful to consider how to utilize Cassandra more effectively as a platform to prefetch relevant and high-demand data so that it might be accessed directly instead of connecting to additional user nodes and querying individual Mongo databases. Under our current design, Cassandra functions simply as a key and value store which ties user_id values to user ip addresses, while ​ ​

also storing a non-unique username that allows for the more general querying of users about whom less specific information (i.e. their user_id) is known. ​ ​

Installation and Configuration: When building our software, we worked with Cassandra version 3.11.2, which we accessed from cassandra.apache.org. This site has relatively helpful documentation, though the ​ ​ documentation process is community sourced, and still very much a work in progress in some sections when we last looked at it. Luckily, there are plenty of other helpful resources which guided us through the download and configuration process. Links for such resources can be found throughout this paper. Download the compresses Cassandra file from cassandra.apache.org/dowload/, then ​ ​ uncompress the pack and extract its contents to the desired location. We extracted the file contents to a directory named apache-cassandra-3.11.2, which was within the “instaclone” directory. The instaclone directory contained all the dependencies and source code of our project. This process is described in a bit more detail in a paper from IBM that acts as a rather effective introductory guide to installing and interfacing with Cassandra.2 Cassandra requires both Java 8 and Python 2 as dependencies, so verify that both of these packages are installed before attempting to build Cassandra. Once Cassandra is downloaded and built, there are several important points regarding its organization and limitations which must be discussed. First, it is necessary to set the following parameters in the “cassandra.” file that is generated when the software is built (initially, “cassandra.yaml” is located in the “conf” folder of the primary cassandra directory): ● cluster_name -- A name which identifies the cluster you are creating. This name must be common across all of the nodes you plan to configure. ● seeds -- The IP addresses of the the “bootstrap” nodes in your cluster. Include the IP of the first node your are setting up, as all other nodes you setup will rely on it to join the cluster. ● listen_address -- The IP address with which other nodes will contact this node (its own IP). ● native_transport_port -- The TCP port number for client software to connect to this node. Ensure that the the port is not blocked by a firewall. There are many other configuration options in the cassandra.yaml file, and some are bound to be useful if further development and tweaking of with Cassandra is done, but the stated parameters should be the only changes necessary to get started. Second, it is important to note that by default only one Cassandra node can be running on a single machine without the use of third party software such as Docker to “containerize” or isolate each instance of Cassandra from one another. Additionally, by default, only a single node may be running on each user’s filespace (because a user shares the same files across the entire Middlebury computer science network, there is an initial limitation of a single node per user account). This limitation is a result of Cassandra storing all its runtime data and configuration specifications in common directories called “data” and “conf” respectively. To get

2 https://www.ibm.com/developerworks/library/ba-set-up-apache-cassandra-architecture/index.html

around this, we created a “nodes” directory within the apache-cassandra-3.11.2 directory. Within this new nodes directory, we created a series of subdirectories for each of the hostname of the machines that we planned on using for testing. We then added copies of the conf and data folders from the primary cassandra directory to these node subdirectories. This allows each node to have a unique configuration (cassandra.yaml) file, and a unique directory to write data to. We then wrote a script called cassandra_setup.sh which initializes nodes based on their hostnames and sets the CASSANDRA_CONF environment variable to point to the configuration subdirectory for a given node name before running an instance of cassandra on that local machine. Cassandra also requires the following TCP ports to be unblocked to be successfully run: 7000, 7001, 7199, 9042, and 9160. ​ ​ Finally, it is worth mentioning some of the tools which allow you to interface with Cassandra. In the bin directory (where the cassandra executable is also stored), there are a wide array of other executables that might be of use -- though two stood out as being particularly helpful for our circumstances. The first is the “nodetool” program and the status command. Running “nodetool status” will output the current status of all nodes in the cluster which the current machine is connected to. This can be particularly useful if you are unsure if your host is currently connected to a cluster, or if you believe that any node in the group has suffered some type of error. If a node has failed and needs to be removed from the cluster, the nodetool command “removenode” can be run from the host of the downed node. This will successfully remove the downed node from the the cluster. The next useful tool is the Python shell that allows you to directly interface with the cassandra cluster that your node is connected to. The shell is called cqlsh (Cassandra Query Language shell), and it allows you to write commands in the cassandra query language, which in turn allow for the manual adding, removing, and querying of the cluster you are connected to. For a brief tutorial on using the Cassandra Query Language, we will again recommend the IBM guide to getting started with ​ ​ Cassandra. It describes using CQLSH in parallel with Docker, but still provides a useful orientation to the CQL syntax.

DataStax C Drivers In order to interface with Cassandra using a programming language, we had to track down a library of driver software which would allow us to programatically interact with Cassandra. Apache does not offer any first party supported drivers to link Cassandra to a programming language. Instead, drivers are provided through third party developers. We used the DataStax C drivers, because Datastax offered the best documentation and most frequently ​ ​ updated software.3 Additionally, most of the other drivers we evaluated were produced by students, academics, or hobbyist, while DataStax offed an entire suite of Cassandra drivers in a variety of languages as part of its business model. It is also worth noting that DataStax is responsible for a large number of code commits for Apache Cassandra (which is an open-source project), making DataStax’s products the closest available approximations to first party drivers.

3 https://docs.datastax.com/en/developer/cpp-driver/2.9/api/

The DataStax API provided clear example code outlining how to connect to a running instance of Cassandra and to add, query, and remove any desired data. These three operations -- add, query, and remove -- along with connecting and tearing down connection to Cassandra, were the only relevant functionality that we required in our work. For more information about Cassandra’s API, visit the DataStax C/C++ Driver API Documentation Index.4 ​ ​ When installing the drivers, we elected download and build them within our own directories, instead of attempting to have the driver installed across all accounts on all Middlebury CS lab machines. This process was very similar to the process of downloading and building Cassandra. First however, we had to ensure that the following dependencies for the C driver were installed: CMake v2.6.4+, libuv v1.x, and OpenSSL v1.1.x or v1.0.x. Once we had gathered and linked the necessary dependencies we built the C drivers using CMake in a process that is outlined rather clearly on DataStax’s website.5 Finally, we had to dynamically link ​ our code to the C drivers at compile time within our makefile. To achieve that, we added a variable to the makefile which included the cassandra.h header file of the C drivers in the C driver ‘include’ directory and linked the necessary files from the C driver ‘build’ directory. We also added a line to our bashrc which set the LD_LIBRARY_PATH environmental variable to the location of the Cassandra C driver's build directory. We found it convenient to tie this assignment to a bashrc alias defined as “insta”, which also changed our current directory to our primary development directory.

Cassandra User Database API All functions and structs related to the use of Cassandra can be found in cass_user.c.

Data Structure

The user objects in the Cassandra cluster are stored in a the user table of the insta ​ ​ ​ keyspace, accessed by the dot notation table name insta.user. The user object is ​ ​ created with the following fields and associated types: bigint user_id, text username, inet ​ ​ ​ ​ ip_addr. ​ Here, bigint refers to a 64-bit integer, text refers to a string, and inet is a CassInet type or ​ ​ ​ ​ ​ ​ a big-endian binary representation of a IPv4 address. The table is created with a compound primary key composed of (user_id, username) with the notion that while user_ids are always unique, usernames are not necessarily. With this structure, our queries can search by user_id as well as username.

Functions

Session Connection

4 https://docs.datastax.com/en/developer/cpp-driver/2.9/api/ 5 https://docs.datastax.com/en/developer/cpp-driver/2.9/topics/building/

static int session_connection(struct cass_connect *connection);

This function establishes a session connection to the Cassandra cluster running on localhost. It establishes the basic infrastructure and error checking necessary for any function to execute queries on the Cassandra cluster. It takes as argument a pointer to an instance of struct cass_connect, outlined below: ​ ​

struct cass_connect { CassError err_code; CassCluster* cluster; CassSession* session; CassFuture* connect_future; };

Returns 0 on success, or CassError error code on failure. At the end of the any querying function, tear_down_connection must also be called with a pointer to the cass_connect struct in order to the free instances of CassCluster, CassSession, and CassFuture. Failure to do so will almost surely cause a segmentation fault.

static int tear_down_connection(struct cass_connect *connection)

Keyspace Table Init

int keyspace_table_init(char *keyspace, char *table);

The keyspace_table_init function defines and initializes the insta.user table in the ​ ​ Cassandra cluster. First, after calling session_connection, the function defines and initializes the keyspace insta with options replication_factor: 2 and class: simplestrategy. ​ ​ ​ ​ ​ ​ This means that any values stored in the keyspace will be replicated in two locations throughout the cluster, assuming enough space exists, and that the strategy for replication will be simple and assumes no responsibilities beyond load balancing nodes. The function then defines and creates the insta.user table with the fields and keys ​ ​ outlined above. The function closes by calling tear_down_connection. Returns 0 on success, or error code on failure.

Add User

int add_user(uint64_t user_id, char* username, char* ip);

The add_user function connects to the Cassandra cluster using the session_connection ​ ​ function. The function calls keyspace_table_init to initiate the keyspace if it doesn’t ​ ​

already exist. The function then builds an UPDATE cql query using the cass_statement_new function and binds the user_id, username, and ip_addr parameters ​ ​ ​ ​ to it using iterations of the cass_statement_bind_x function family. The UPDATE query will update any user entry with a matching user_id or create one if none are found. The ​ ​ query is executed, and the CassStatement and CassFuture variables are freed. The function then closes by calling tear_down_connection. ​ ​ Returns 0 on success, or error code on failure.

Get User IP by Username

uint64_t *get_user_id_by_username(char *keyspace, char *table, char *username, int *result);

This functions take as arguments a keyspace, table, username, and a pointer to result ​ ​ ​ (an integer to be set to the number of query matches). This function queries the Cassandra database for all user entries with matching username fields. Similar to ​ ​ add_user, this function calls session_connection, builds a query with ​ ​ ​ cass_statement_new, executes the statement with cass_session_execute, and frees ​ ​ ​ resources with tear_down_connection. ​ ​ This function returns a pointer to an array of user ids, or NULL if no users with the specified username are found. The result integer will be set with the number of results ​ ​ found. The buffer containing the list of user_ids has been malloced and must be freed by the caller.

Get User IP by ID

char * get_user_ip_by_id(char *keyspace, char *table, uint64_t user_id);

This function takes as argument a keyspace, table, and user_id. It then searches the ​ ​ ​ ​ Cassandra database for the user with unique user_id using a similar method to ​ ​ get_user_id_by_username. The function returns a char * pointer to an array containing ​ the returned inet value as a string, or NULL on failure. The returned pointer is malloc' and must be freed at a later time.

MongoDB MongoDB is a document-based NoSQL database that supports the storage of JSON like documents in a binary format (BSON documents -- short for binary JSON). While Mongo has the capability for distributed storage through its sharding feature, we explored and utilized its non-distributed functionality, using it as the primary database which stores the entirety of dispatch and user data generated in our system. We decided to integrate the Mongo platform in to our project because we were interested in the possibility of gaining experience with yet another database platform. Beyond

the desire to build new skills, we were also interested in utilizing Mongo’s convenient query format that allowed us to search the database by any combination of fields that were part of the documents we planned on storing. This would allow for more complex, but effective querying of the user and dispatch objects -- Mongo was well suited to the goals of our project because its query format conformed to our initial vision of our primary data structures. Mongo was also an attractive choice for our primary database platform because of its integration with the text based JSON format, and the wide variety of programming languages for which there was first party driver support. While we initially conjectured that Mongo would simplify our development process because it was so closely tied to JSON format, we could not have predicted just how valuable this would be when we began to write and debug our network infrastructure. We went on to produce entirely text based network protocols, largely in thanks to Mongo’s ease on conversion between the readable JSON format and the storable BSON format.

Installation and Configuration: Luckily, Mongo was already installed on the Middlebury computer science systems, so we had no need to actually install the platform ourselves. We did run into an issue configuring Mongo on our systems though. Without sudo access, we were unable to create the generic /data/db directory that stores data from a running instance of Mongo. To circumvent this issue we added a /data/db directory to our local development directory. This change meant that we had to manually specify the location of the /data/db directory whenever we launched an instance of mongo. We achieved this by starting mongo using the following line: “mongo --dbpath ”. Unlike Cassandra, Mongo’s drivers were produced by the same company which produce the Mongo platform. Again we selected the C drivers for development, and went about installing them. The processes for installing the C drivers is outlines in a helpful guide provided by the ​ MongoDB C driver API (which can be sparse at times but in this case is rather well defined and ​ specific).6 Again we found that OpenSSL was a prerequisite for the Mongo C drivers, though we also found that this was only for authentication and required no special steps on our part to implement or configure. In order to work with the Mongo C drivers we had to include lines in our makefile that specified the C driver package which our code depended upon at compile time and when linking. These lines were: ● ‘pkg-config --cflags libmongoc-1.0’ (which we referred to with the variable MONGOCOMP) ● ‘pkg-config --libs libmongo-1.0’ (which we referred to with the variable MONGOLINK) We included the MONGOCOMP variable in all Makefile recipes that compiled code but were not a part of a greater shared library. We included both MONGOCOMP and MONGOLINK variables ​ ​ in all the Makefile recipes which had to be both compiled and linked as part of a shared library. We also included header files for both the Mongo C library and the BSON library -- mongoc.h and bson.h respectively. While working with Mongo we often found it necessary to refer back to

6 http://mongoc.org/libmongoc/current/installing.html

the API reference for both libmongoc7 (the library for the mongo c drivers) and libbson8 (the ​ ​ library for the BSON document type).

Mongo Connect API In order to efficiently interface with Mongo, were developed a series of functions in the mongo_connect class. We also defined a called monogo_connection which kept track of all the data relevant to an application's active connection to Mongo.

Mongo Connection Object: The mongo_conection object contains a series of pointers that specify the information ​ ​ which defines an active connection to a Mongo database.

uri_string: a string identifier in the URI format which defines the connection between our ​ ​ application and Mongo database.9

uri: an identifier which defines the connection between our application and Mongo ​ ​ database. The uri variable is of type mongoc_uri_t, a type which is standardized to ​ ​ MongoDB.10

client: a pointer to a mongc_client_t struct that manages sockets and routing to any ​ ​ potential nodes in the MongoDB system.11

database: a pointer to a mongoc_database_t struct that provides access and acts as a ​ ​ handle for the Mongo database.12

collection: a pointer to a mongo_collection_t struct that provides access and acts as a ​ ​ handle for the Mongo collection.13

error: a bson_error_t struct that will be set with the appropriate error code should any ​ ​ aspect of the mongo_connect process or the mongo_teardown process fail.14 ​ ​ ​ ​

Mongo Connection Functions:

Mongo Connection

7 http://mongoc.org/libmongoc/current/index.html 8 http://mongoc.org/libbson/current/index.html 9 https://docs.mongodb.com/manual/reference/connection-string/ 10 http://mongoc.org/libmongoc/current/mongoc_uri_t.html 11 http://mongoc.org/libmongoc/current/mongoc_client_t.html 12 http://mongoc.org/libmongoc/current/mongoc_database_t.html 13 http://mongoc.org/libmongoc/current/mongoc_collection_t.html 14 http://mongoc.org/libbson/current/bson_error_t.html

int mongo_connect(struct mongo_connection *cn, char *db_name, char *coll_name);

This function populates an instance of a mongo_connection struct (cn) and creates a ​ ​ connection to a Mongo database and collection, both of which are specified by name as strings (db_name and coll_name respectively). Returns 0 upon the establishment of a ​ ​ ​ ​ successful connection, -1 otherwise.

Mongo Teardown int mongo_teardown(struct mongo_connection *cn);

This function closes the connection between an application and the mongo database and collection that the application is connected to as specified by the contents of the mongo_connection struct which is passed as an argument via pointer (cn). Returns 0 ​ ​ ​ upon success, -1 otherwise.

Build JSON char* build_json(mongoc_cursor_t *cursor, int req_num, int *result, int *length);

This function creates a dynamically allocated string of JSON documents that reflect the search results generated by a query of a Mongo collection. It takes a pointer to mongoc_cursor_t object that points to the first result in the series of results generated by ​ querying a Mongo collection (cursor). The function also takes a maximum number of ​ ​ results to append to the JSON string (req_num), as well as a pointer to an integer that ​ ​ will be filled with the actual number of resulting JSONs that were appended (result), and ​ ​ a pointer to an integer that will be filled with the length of the resulting JSON string (length). Returns a pointer to the dynamically allocated JSON string that has been ​ ​ generated, or NULL if the function fails or no results were found be be associated with cursor. ​

Insert JSON from fd int insert_json_from_fd(int fd, char *collection_name);

This function takes a file descriptor (fd) that is associated with a buffer of JSON string ​ ​ data. It reads from the file descriptor and converts the JSON data in to BSON documents. Once in the form of BSON documents, these documents which describe user or dispatch objects, are inserted in to the appropriate mongo collection that is specified by name (collection_name). If a duplicate user or dispatch object exists in the ​ ​ collection specified by collection_name, then that BSON document is deleted and the ​ ​ more recently created BSON document is inserted. Returns the number of BSON documents inserted successfully, or -1 upon failure.

User Class The user object contains all of the necessary fields to describe a given user and to perform social networking operations.

User Object: The user object is defined with the following fields in user_definitions.h:

user_id: a globally unique 64-bit unsigned integer. Ideally, the user_id field would be ​ ​ ​ randomly generated when a new user object is created, but in the context of our networking infrastructure, it benefitted us to always have defined user_ids. ​

username: a pointer to a character string containing the username of the user. ​

image_length: a 32-bit unsigned integer that describes the length of the binary data in ​ the image field. ​ ​

image: a pointer to an array of 8-bit unsigned integers containing a binary ​ representation of the user’s profile image.

bio: a substructure of the user object, this field is an instance of struct personal_data ​ ​ and is composed of the following fields: name: a pointer to a character string containing the name of the user. ​ date_created: a timestamp given as milliseconds since the Unix epoch that ​ describes when a user object was first created. This field is of type time_t and is ​ ​ currently unused. date_modified: a timestamp defined the same as date_created but ideally ​ ​ ​ describes when a user object was last modified. The intention behind this timestamp was to only return user objects in “pull user****” networking protocols that have been modified since the last time they were pulled in order to limit network usage. This field is of type time_t and is currently unused. ​ ​

fragmentation: an integer representing whether a user’s user object has been ​ fragmented into multiple user objects. The intention behind this field is that a superuser with a large number of followers would generate a substantial amount of network traffic and strain on their server. Our solution would be to fragment a user into multiple smaller users with identical data whose increasing number of mirrors would be spread out over a wider area. We never implemented this behavior and the fragmentation field is currently ​ ​ unused.

followers: a substructure of the user object, this field is an instance of struct ​ ​ insta_relations and is composed of the following fields: ​

direction: an integer representing the direction of the relationship. 0 for ​ followers. count: an integer representing the number of integers in the user_ids field. ​ ​ ​ user_ids: a pointer to an array of 64-bit unsigned integers representing the ​ user_ids of the users following the a given user. ​

following: a substructure of the user object, also an instance of struct insta_relations: ​ ​ ​ direction: an integer representing the direction of the relationship. 1 for following. ​ count: an integer representing the number of integers in the user_ids field. ​ ​ ​ user_ids: a pointer to an array of 64-bit unsigned integers representing the ​ user_ids of the users followed by a given user. ​

Note: several fields are almost always stored as heap variables, so it is important to call user_heap_cleanup() on user structs returned or populated by functions. ​

User Functions: The user functions can be defined in three groups: primitives for manually inserting, deleting, and printing entries and structs, user search functions, and higher level functions used by the server.

Group 1: Primitives for Manually Modifying and Viewing Entries

Insert User

int insert_user(struct user *new_user);

This function takes a user struct, new_user, as an argument, and inserts it into a ​ ​ user’s user collection, which is contained in the Mongo database insta. In order to ​ ​ ​ ​ be inserted, the dispatch struct is transposed to a BSON file (Mongo’s binary JSON format). Returns 0 upon successful insertion, -1 on failure or if new_user is ​ ​ already contained in the user collection. ​ ​

Delete User

int delete_user(uint64_t user_id);

This function takes a user_id as argument and deletes a user object from a ​ ​ user’s user collection, contained in the Mongo database insta. Returns 0 on ​ ​ ​ ​ success, or -1 on failure.

User Heap Cleanup

void user_heap_cleanup(struct user *user);

This function takes a pointer to a user struct as argument and frees all malloced memory within, including bio->name, bio, followers->user_ids, followers, ​ following->user_ids, and following. ​

Print User Struct

void print_user_struct(struct user *user);

This function takes a pointer to a user struct as argument and prints all fields in a formatted list.

Group 2: User Search Functions

Search User by Name Mongo

char * search_user_by_name_mongo(char *username, int req_num, int *result, int *length);

This function takes a pointer to a string that will be used to search the collection (username), an upper limit on the number of dispatches to query for (req_num), a ​ ​ ​ ​ pointer to an integer which is updated to reflect the number of dispatches that were found as a result of the query (result), and a pointer to an integer which is ​ ​ updated when the final size of the buffer of dispatches (in JSON format) is generated and returned as a result of this query (length). This function queries ​ ​ the user collection in the insta Mongo database for any user objects such that the ​ ​ ​ ​ username or bio.name field matches the queried username. If the req_num is ​ ​ ​ ​ ​ ​ ​ equal to -1, then the limit on the number of dispatches to query for is set to MAX_INT, allowing for an exhaustive search of the user collection. This function ​ ​ returns a pointer to a JSON string that contains all of the resulting BSON documents that were found by the query, or NULL on failure. This pointer must be free'd by the calling function.

Search User by Id Mongo

char * search_user_by_id_mongo(uint64_t user_id, int req_num, int *result, int *length);

This function takes a uint64_t that will be used to search the user collection ​ ​ (user_id), an upper limit on the number of dispatches to query for (req_num), a ​ ​ ​ ​ pointer to an integer which is updated to reflect the number of dispatches that were found as a result of the query (result), and a pointer to an integer which is ​ ​ updated when the final size of the buffer of dispatches (in JSON format) is

generated and returned as a result of this query (length). This function queries ​ ​ the user collection in the insta Mongo database for any user objects such that the ​ ​ ​ ​ user_id field matches the queried user_id. If the req_num is equal to -1, then the ​ ​ ​ ​ ​ limit on the number of dispatches to query for is set to MAX_INT, allowing for an exhaustive search of the user collection. This function returns a pointer to a ​ ​ JSON string that contains all of the resulting BSON documents that were found by the query, or NULL on failure. This pointer must be free'd by the calling function.

Group 3: Higher Level Definitions Used by the Server

Handle User Bson

int handle_user_bson(bson_t *doc);

This function takes a pointer to a BSON document that contains the data describing a user (doc). The function parses this data to an instance of a user ​ ​ struct, then uses the user_id field and the search_user_by_id_mongo function to ​ ​ ​ ​ identify whether the referenced user is a duplicate of any other user BSON document currently in the user collection. If a duplicate user is found, then it is ​ ​ deleted, and the newly populated user struct (which was generated by the specifications defined by doc) is inserted in its place. This function returns 0 upon ​ ​ success, -1 otherwise.

Parse User Bson

int parse_user_bson(struct user *user, const bson_t *doc);

This function takes a pointer to a user struct (user), which will be populated with ​ ​ the information stored in the BSON document pointed to by doc. This function ​ ​ populates user and returns 0 if the struct has been successfully populated, ​ ​ otherwise -1 is returned. It is important to note that several fields in the user struct always exist as heap variables: user->bio->name, user->bio, ​ user->followers->user_ids, user->followers, user->following->user_ids, user->following. It is up to the caller to free these variables using ​ user_heap_cleanup(). ​

Dispatch Class The dispatch object was our abstraction for any post, comment, or message in a conversation thread. After reviewing the work completed during the 2017 summer research session, we felt that it was key that we abide by their philosophy that most internet sites can be

reduced to a common set of actions and data structures. In applying this rational, we focused on creating a single data structure that would handle all user generated content. We hypothesized that while a post containing a photo and a comment of said photo might appear very different from a users point of view, the underlying data that organized and defined the two pieces of media were very similar. The definition for the dispatch struct and the definitions for all dispatch methods can be found in dispatch_definitions.h and dispatch_definitions.c respectively.

Dispatch Object: With the initial goal of creating a common data structure for all user generated content, or dispatches, we built a dispatch object definition which included the following fields in dispatch_definitions.h:

dispatch_body: A substructure of the dispatch object, the dispatch body ​ ​ includes all fields which define the primary media of a dispatch object: media_size: A simple integer value expressing the size of the media ​ ​ binary that is pointed to by the media field. There is no explicit maximum value for the media size, but our client and server programs cap the amount of data they read or write in a single operation to 16000 bytes, which implies that the maximum media size under current configurations is less than 16000 bytes (as some of the bytes being transmitted in any given network communication must be utilized by network protocols and the other fields in a dispatch object). media: A pointer to binary data for the primary media of a dispatch. In the ​ ​ case of a post, it would likely be a pointer to the binary data for a photograph, in the case of a comment or message which lacks primary media, the media field would point to a binary hashing of the string “no image”. text: A pointer to a character string that acts as a caption for the primary ​ ​ media in the case where a dispatch describes a post. When the dispatch describes a comment or a message, then the text field is the ostensibly the primary media.

user_id: A 64-bit random integer. Each user has a globally unique 64-bit user_id ​ ​ ​ which is the primary identifier for their user object. When used in reference to a dispatch, the user object describes the user who created the given dispatch.\

timestamp: A timestamp given as the time since the Unix epoch that marks ​ ​ when a dispatch object was initially created. There are currently no functions which take advantage of the timestamp field, but it is well suited to allow for optimization of the dispatch based functions and queries outlined later in this section. This is of type time_t. ​

audience_size: An integer which broadly specifies the intended audience a ​ ​ dispatch, or indicates if there are select subset of users who a dispatch is meant for. In the case where the audience size is equal to 0, the dispatch is meant to be viewed by all followers and the subsequent audience field of the dispatch is going to consist of an empty array. If audience size is greater than zero, then it is expected that the audience field will consists of an array of user_ids such that the ​ ​ number of user_ids is equal to audience size. Audience size is capped at a ​ ​ functional maximum value of 32, meaning that there should never be more than 32 user_ids stored in the audience field’s array. When we were initially ​ ​ formulating the functionality of the audience and audience size filed for the dispatch object, we planned to use the fields as a means to define the privacy of a dispatch. In the end, we did not implement formal privacy procedures or protections, but we had planned on designing our system such that an audience size of 0 would indicate the a dispatch would only be viewable by followers, an audience size greater than 0 and less than or equal to 32 would only be viewable by the users specified by user_id in the audience field’s array as a group ​ ​ message (mirroring Instagram’s group messaging functionality), and an audience size greater than 32 would be viewable by the “public” -- with no privacy restrictions place on who could access the dispatch. audience: An array of user_ids which specifies which users are meant to view a ​ ​ ​ ​ dispatch. This field is intended to define a list of group members who would be able to view a dispatch functioning as a group message. This array is capped at 32 users as per Instagram spec, and out of a desire for future privacy specifications as outlined in the definition of the audience size field. The number of users in this list is specified by audience_size. num_tags: An integer indicating the number of tags in the array of tags that is ​ ​ defined in the subsequent tags field. The maximum number of tags, and therefore the maximum value of num_tags is 30. This value reflects the same cap on the number of tags which Instagram places on its posts, and was selected for the sake of both convenience and efficiency. tags: An array of key word strings or tags that can be used for querying. There is ​ ​ a maximum of 30 tags which can be assigned to a single dispatch, and each tag is capped at a maximum of 50 characters. num_user_tags: An integer indicating the number of users tagged via id in the ​ ​ array of user_ids that is defined in the subsequent user_tags field. The maximum ​ ​ number of tagged users and therefore the maximum value of num_user_taged is 30. This is the same as the maximum number of traditional tags which can be associated with a dispatch.

user_tags: An array of user_ids of tagged users. Tagged users will have the ​ ​ ​ ​ dispatch object that they are tagged in pushed to the machine and database that their user_id is associated with. There is a maximum of 30 user_tags which can ​ ​ ​ ​ be assigned to a dispatch object.

parent: A substructure of the dispatch object, the parent describes the context of ​ the dispatch: type: An integer which contextualizes the subsequent id field of the ​ parent substructure, and provides context for the dispatch by extension. If type is set to 0, the id field is interpreted as a user_id. By extension, any ​ ​ dispatch with a parent substructure that contains a user_id is a dispatch ​ ​ that should be interpreted as a traditional port. The user who “posted” is the parent of the dispatch and as such, the dispatch object is directly associated with the user. If type is set to 1, the id field is interpreted as a dispatch_id. Any dispatch with a dispatch_id as its parent id should be ​ ​ ​ interpreted as a comment. The dispatch which is referenced by the parent id is the head of the comment thread that this dispatch is a part of.

id: A 64-bit integer that is either a user_id or dispatch_id as specified by ​ ​ ​ ​ ​ ​ type.

fragmentation: An integer which indicates across how many user’s systems a ​ ​ given dispatch is replicated. This field is included for the purpose of future optimization, but is currently unused, making its value arbitrary.

dispatch_id: A 64-bit integer that is assigned randomly to each dispatch and ​ ​ assumed to be globally unique within the scope of our current platform (though truly, the dispatch_id must only be unique when reference in combination with the user_id field a dispatch. In other words, a user should never create multiple ​ dispatch with the same dispatch_id as one of their other dispatches). ​ ​

Dispatch Functions: The dispatch functions can be sorted in to three primary groups: those functions which changed the state of the Mongo dispatch collection (which consisted of inserting and deleting dispatches), those functions which searched the Mongo dispatch collection (which consisted of querying for dispatches by specific fields) and those which interfaced with higher levels of testing code and other elements of our system (such as parsing between BSON format and dispatch structs). All of the following functions are found in dispatch_definitions.c

Group 1: Changing the State of the Mongo Dispatch Collection

Insert Dispatch

int insert_dispatch(struct dispatch *dis); ​ ​ This function takes a dispatch struct, dis, as an argument, and inserts it into a ​ ​ user’s dispatch collection, which is contained in the Mongo database insta. In ​ ​ ​ ​ order to be inserted, the dispatch struct is transposed to a BSON file (Mongo’s binary JSON format). Returns 0 upon successful insertion, -1 otherwise.

Delete Dispatch

int delete_dispatch(uint64_t dispatch_id);

This function queries the dispatch collection of the insta database for a BSON ​ ​ ​ ​ document that contains a unique dispatch_id. Returns 0 if such a dispatch is ​ ​ found and deleted, otherwise returns -1.

Search Dispatch by ID

char *search_dispatch_by_id(uint64_t dispatch_id, int eq_num, int *result, int *length);

This function queries the dispatch collection of the insta database for a bson ​ ​ ​ ​ document that contains the dispatch id specified by '''dispatch_id'''. If such a dispatch exists, then it is converted from a bson document to a string. A pointer to this json string is returned, and the number dispatches found/contained in the json string is reflected in the integer pointed to by result. The functions returns a NULL pointer and sets result to 0 if no dispatches are found. Integer argument req_num is a cap on the number of result to search for (i.e. a query would terminate after finding a single result if req_num == 1). If req_num is set to -1, then the cap on search results is set to INT_MAX, making the limiting factor on the number of possible results the number of matching dispatches in the collection being queried.The length argument is a pointer to an integer that will be populated with final size of the resulting buffer of dispatches in JSON format that are found from this query. If no dispatches are found, then length will be set as 0.

Group 2: Searching the Mongo Dispatch Collection

Search Dispatch by User Audience

char *search_dispatch_by_user_audience(uint64_t user_id, uint64_t *audience, int audience_size, int req_num, int *result, int* length);

This function takes a user_id, a pointer to a list of user_ids who define the ​ ​ ​ ​ audience of a dispatch (audience), the size of the audience list (audience_size), ​ ​ ​ ​ ​ an upper limit on the number of dispatches to query for (req_num), a pointer to ​ ​

an integer which is updated to reflect the number of dispatches that were found to match the query (result), and a pointer to an integer which is updated with the ​ ​ final size of the buffer of dispatches (in JSON format) is generated and returned as a result of this query (length). If the req_num is equal to -1, then the limit on ​ ​ ​ ​ the number of dispatches to query for is set to MAX_INT. This function returns a pointer to a JSON string that contains all of the results that were found during this query. This pointer must be free'd by the function which calls search_dispatch_by_user_audience. If no dispatch objects are found, NULL is ​ returned and length is set to 0. This return condition is consistent across all dispatch search functions.

Search Dispatch by Parent ID

char *search_dispatch_by_parent_id(uint64_t dispatch_id, int req_num, int *result, int *length);

This function takes a dispatch_id, an upper limit on the number of dispatches to ​ ​ query for (req_num), a pointer to an integer which is updated to reflect the ​ ​ number of dispatches that were found as a result of the query (result), and a ​ ​ pointer to an integer which is updated with the final size of the buffer of dispatches (in JSON format) is generated and returned as a result of this query (length). If the req_num is equal to -1, then the limit on the number of dispatches ​ ​ ​ ​ to query for is set to MAX_INT. This function returns a pointer to a JSON string that contains all of the BSON documents that were found during the query, or NULL on failure. This pointer must be free'd by the function which calls search_dispatch_by_parent_id. ​

Search Dispatch by Tags

char *search_dispatch_by_tags(const char *query, int req_num, int *result, int *length)

This function takes a pointer to a string that will be used to search the collection (query), an upper limit on the number of dispatches to query for (req_num), a ​ ​ ​ ​ pointer to an integer which is updated to reflect the number of dispatches that were found as a result of the query (result), and a pointer to an integer which is ​ ​ updated with the final size of the buffer of dispatches (in JSON format) is generated and returned as a result of this query (length). If the req_num is equal ​ ​ ​ ​ to -1, then the limit on the number of dispatches to query for is set to MAX_INT. This function returns a pointer to a JSON string that contains all of the resulting BSON documents that were found by the query, or NULL on failure. This pointer must be free'd by the function which calls search_dispatch_by_parent_id. ​ ​

Search Dispatch by User Tags

char *search_dispatch_by_user_tags(uint64_t query, int req_num, int *result, int *length);

This function takes a user_id that will be used to search the collection for all ​ ​ instances where the id appears in the user_tags field of a dispatch (query), an ​ ​ ​ ​ upper limit on the number of dispatches to query for (req_num), a pointer to an ​ ​ integer which is updated to reflect the number of dispatches that were found as a result of the query (result), and a pointer to an integer which is updated with the ​ ​ final size of the buffer of dispatches (in JSON format) is generated and returned as a result of this query (length). If the req_num is equal to -1, then the limit on ​ ​ ​ ​ the number of dispatches to query for is set to MAX_INT. This function returns a pointer to a JSON string that contains all of the resulting BSON documents that were found by the query, or NULL on failure. This pointer must be free'd by the function which calls search_dispatch_by_parent_id. ​

Group 3: Transitional Functions

Parse Dispatch BSON

int parse_dispatch_bson(struct dispatch *dis, const bson_t *bson_dispatch);

This function takes a pointer to a dispatch struct (dis), which will be populated ​ ​ with the information stored in the BSON document pointed to by bson_dispatch. ​ ​ This function populates dis', and returns an 0 if the struct has been successfully ​ ​ filled, otherwise -1 is returned. Note that because the audience, user_tags, and tags arrays are of a set size, the parse_dispatch_bson function only initializes a number of entries equal to audience_size, num_user_tags, or num_tags, respectively. Any other data found in these arrays is invalid. Any time a user iterates through these arrays, they should only iterate up to the number of objects these fields state.

Print Dispatch Struct

int print_dispatch_struct(struct dispatch *dis);

This function takes a pointer dispatch struct (dis), and pints its contents in a ​ ​ formatted list.

Dispatch Heap Cleanup

void dispatch_heap_cleanup(struct dispatch *dis);

This function frees the memory associated with a dispatch referenced passed by pointer (dis), as well as its sub structs (body and parent structs) and string fields ​ ​ ​ ​ ​ ​ that have variable length. This function should be called after calling the parse_dispatch_bson function, once the user is finished referencing the memory ​ associated with the BSON struct that was populated.

Handle Dispatch BSON

int handle_dispatch_bson(bson_t *doc);

This function takes a pointer to a BSON document that contains the data describing a dispatch (doc). The function parses this data to an instance of a ​ ​ dispatch struct, then uses the dispatch_id field and the search_dispatch_by_id ​ ​ ​ function to identify whether the referenced dispatch is a duplicate of any other dispatch BSON document currently in the dispatch collection. If a duplicate dispatch is found, then it is deleted, and the newly populated dispatch struct (which was generated by the specifications defined by doc) is inserted in its ​ ​ place. This function returns 0 upon success, -1 otherwise.

Network Setup and Protocols

Network Setup Once Apache Cassandra is configured, the network setup for our project is relatively straightforward. Almost all layers of abstraction described in the different libraries culminate in the server, a simple single-thread application that accepts incoming TCP connections and services requests to instances of the Mongo databases running on nodes. The client that connects to the server is also fairly straightforward. It looks up the IP address of a node in the Cassandra cluster from a given user_id, initiates a TCP connection to the node of that user, ​ ​ reads network protocol commands from a file, and writes the commands into the connection file descriptor. In the case of a push protocol, it simply disconnects after all data has been written. In the case of a pull command, it waits for a response, aggregates all data as a single heap variable, formulates a “push dispatch” or “push user****” protocol, opens a connection to the instance of server on the local machine, executes the push protocol, and disconnects. In this fashion the server proctors all access to any instance of a Mongo database, even instances on the same machine as the client. We chose to design the system in this way to avoid premature optimization via multithreading and because the distribution of C drivers we are using to access Mongo makes no promises about the thread safety of various critical functions.

Network Protocols

There are two classes of networking protocols in this system: pushes and pulls. Pulls are used to update your database with information from another database, and pushes are to force an update to an item in another user's database. All protocol fields are 14 characters long

(including the space separating the protocol field from the content of the message). All network protocol functions are defined in network_protocols.c. All of the network protocol functions take two file descriptors as arguments, and in file descriptor to read data and an out file descriptor to ​ ​ ​ ​ write data; we chose to do this for testing purposes. In the server, both file descriptors point to the TCP connection. We chose to keep this behavior in the case that the server someday becomes a more complicated apparatus that requires reading and writing to multiple locations.

Pull Protocols

Pull Child

pull child***

This request pulls all children of a given dispatch object, i.e. comments, replies to comments, and shared posts.

int pull_child(int in, int out);

This function reads a parent dispatch_id from the in file descriptor, queries the database ​ ​ for all dispatches with this dispatch_id as parent, and writes the results to file descriptor out as JSON objects. ​ This function returns the number of results on success or -1 on failure.

Pull All

pull all*****

This request pulls all dispatches such that the parent_id field matches the user_id ​ ​ ​ supplied, e.i. all public posts by a user.

int pull_all(int in, int out);

This function reads a user_id from the in file descriptor, queries for all dispatches by that ​ ​ ​ ​ user, and writes them to file descriptor out as JSON objects. ​ ​ This function returns the number of results on success or -1 on failure.

Pull Dispatch

pull dispatch

This request pulls a dispatch specified by dispatch_id. ​ ​

int pull_dispatch(int in, int out);

This function reads a dispatch_id from the in file descriptor, queries the database for that ​ ​ ​ ​ dispatch, and writes it to the out file descriptor as a JSON object. ​ ​ This function returns the number of results on success or -1 on failure.

Pull User

pull user****

This request pulls a user object specified by user_id. ​ ​

int pull_user(int in, int out);

This function reads a user_id from the in file descriptor, queries the database for that ​ ​ ​ ​ user, and writes the user object to the out file descriptor as a JSON object. ​ ​ This function returns the number of results on success or -1 on failure.

Pull User Tag

pull user_tag

This request pulls all dispatches from a given node with a user specified by user_id ​ included in the user_tags field of the dispatch object. ​ ​

int pull_user_tags(int in, int out);

This function reads a user_id from the in file descriptor, queries the database for all ​ ​ ​ ​ dispatches with user_id included in the user_tags field, and writes the results to the out ​ ​ ​ ​ ​ file descriptor as a JSON object. This function returns the number of results on success or -1 on failure.

Pull Tags

pull tags****

This request pulls all dispatches from a given node with the string tag included in the ​ ​ tags field of the dispatch object. ​

int pull_tags(int in, int out);

This function reads a string query from the in file descriptor (delimited by a space), ​ ​ queries the database for all dispatches tagged with that string, and writes the results to the out file descriptor as a JSON object. ​ ​

This function returns the number of results on success or -1 on failure.

Push Protocols Note: all the push protocols currently function the exact same way. We decided to create different functions for each protocol because the behavior of the server will need to change in the future to update the client process depending on what type of data it receives.

Push Child

push child*** {json dispatch}

This request sends a child dispatch, e.i. a comment or reply to a comment, as a JSON object to another user.

int push_child(int in, int out);

This function allows the server to receive a child dispatch as a JSON object and insert it into the insta.dispatch Mongo database. ​ ​ This function returns 0 on success or -1 on failure.

Push User Tag

push user_tag {json dispatch}

This request sends a dispatch as a JSON object to a user whose user_id is included in ​ ​ the user_tags field of the dispatch object. ​ ​

int push_user_tag(int in, int out);

This function allows the server to receive a user_tag dispatch as a JSON object and insert it into the insta.dispatch Mongo database. ​ ​ This function returns 0 on success or -1 on failure.

Push Message

push message* {json dispatch}

This request sends a dispatch as a JSON object to a user whose user_id is included in ​ ​ the audience field of the dispatch object. ​ ​

int push_message(int in, int out);

This function allows the server to receive a direct message dispatch as a JSON object and insert it into the insta.dispatch Mongo database. ​ ​ This function returns 0 on success or -1 on failure.

Push Dispatch

push dispatch {json dispatch}

This request sends a dispatch as a JSON object to a specified user’s database. The intention of this protocol is that a user would use this protocol to push a dispatch to the instance of the Mongo insta.dispatch database on the local machine. ​ ​

int push_dispatch(int in, int out);

This function allows the server to receive a dispatch as a JSON object and insert it into the insta.dispatch Mongo database. ​ ​ This function returns 0 on success or -1 on failure.

Push User

push user**** {json user}

This request sends a user as a JSON object to a specified user’s database. The intention of this protocol is that a user would use it to update their user object.

int push_user(int in, int out)

This function allows the server to receive a user as a JSON object and insert it into the insta.users Mongo database. ​ This function returns 0 on success or -1 on failure.

Parse Server Command

int parse_server_command(int in, int out);

This function is a wrapper for all of the network protocol functions and is the core function of the server. This function reads the first 14 bytes (the protocol field length) from the in file descriptor and compares it to the each defined protocol definition. The ​ ​ function then calls the corresponding protocol function. This function returns -1 on failure, or the result of the corresponding network protocol function on success.

Testing Design and Utilities In order to test our platform, we needed to be able to produce a series of user actions and user and dispatch data which we could use to simulate user interaction. We pivoted from C to python which allowed us to efficiently write simple wrapper functions that directly correlated to the user stories we originally outlined while also utilizing the mid-level C code that we had previously developed. We then wrote programs which would allow us to call a series of these user action functions bases on semi-randomly generated csv files which contained references to the functions and the necessary arguments to run them. We also wrote python programs to build user and dispatch objects as JSON files, which we could programmatically populate with semi-random data. Through this process we were able programmatically generate posts, messages, comments, user updates, and other user-generated content while mimicking the behavior of user actions in the running process.

Build User JSON def build_user(user):

This python function takes a dictionary (user) which is based on the original definition of a user ​ ​ object and contains the following data fields: user_id: a 64-bit integer representing a user object’s globally unique ID. ​ username: a string representation of a user’s username as defined by their username ​ ​ ​ ​ entrie in the Cassandra database. image_path: a string that describes the pathway to a user’s icon/image file. ​ fragmentation: an integer describing the fragmentation attribute of the user object that ​ ​ ​ is being constructed (currently unused, and is set to 0 by default in other testing programs). followers: an array of user_ids describing the followers associated with the user object ​ ​ ​ being constructed. following: an array of user_ids describing the users following the user who “owns” the ​ ​ ​ user object being constructed. From the provided user data, this function builds a JSON document that is formatted in an extended style that is canonical with the expectations of MongoDB.15 This JSON string is return if built successfully, otherwise None is returned upon an error.

Build Dispatch JSON

Def build_dispatch(dis):

This python function takes a dictionary (dis) which is based on the original definition of a ​ ​ dispatch object and contains the following data fields:

15 https://docs.mongodb.com/manual/reference/mongodb-extended-json/

media_path: a string that describes the pathway to the primary media of a dispatch. If ​ the dispatch object is functioning as a comment or message and no primary media path is provide, then this field should be set to the string ‘no image’. body_text: a string that acts as a caption for the dispatch’s primary media. ​ user_id: a 64-bit integer representing a user’s globally unique ID. ​ audience: an array of user_id’s. This array is limited to a maximum of 32 entries as ​ ​ ​ defined in the dispatch object definition. tags: and array of strings that act as queryable tags. This array is limited to a maximum ​ of 30 entries as defined in the dispatch object definition. user_tags: an array of user_ids of tagged users. This array is limited to a maximum of ​ ​ ​ 30 entries as defined in the dispatch object definition. parent_type: the type of the id that is referenced by the parent_id field of this dictionary. ​ ​ ​ Refer to the parent object definition that is part of the dispatch object definition section in this document for further information. parent_id: either a user_id or dispatch_id depending on the value of the parent_type ​ ​ ​ ​ ​ ​ field. fragmentation: an integer describing the fragmentation attribute of the dispatch object ​ ​ ​ that is being constructed (currently unused, and is set to 0 by default in other testing programs). dispatch_id: a 64-bit integer that can be used to explicitly reference a single dispatch ​ object. A dispatch_id should be unique to a user, meaning that there are no duplicates of ​ ​ dispatch_id and user_id fields when considered as a pair. ​ ​ ​ From the provided dispatch data, this function builds a JSON document that is formatted in an extended style that is canonical with the expectations of MongoDB. This JSON string is returned if built successfully, otherwise None is returned upon an error.

Python User Action Functions (Petepics.py) Petepics.py contains a library of python functions which act as wrapper functions that directly describe the user actions we outlined when first laying the groundwork of this project. The library also contains several helper for formatting user and dispatch data. These functions take advantage of the underlying C functions which we wrote to interact with Mongo and perform network communication. The following functions are defined in petepics.py:

Helper Functions

Search User

def search_user(user_id)

This function searches the Mongo user collection for a user document with an id matching the one specified by user_id. This function returns a JSON document ​ ​ generated by the user object specified by user_id if found. ​ ​

Decode Following

def decode_following(user)

This function takes a user object in the form of a dictionary (as defined in the documentation for the build_user_json function), and returns an array of all the user_ids ​ ​ ​ ​ that were contained in the in the set users who are listed as part of the following field in ​ ​ the user object.

Search Dispatch

def search_dispatch(dispatch_id)

This function searches the Mongo dispatch collection for a dispatch document with a matching dispatch_id. This function returns a JSON document generated by the dispatch ​ ​ object specified by user_id if found. ​ ​

Create Dispatch Command

def create_dispatch_command(dispatch_type, dispatch_json)

This function creates a network protocol string based on the string dispatch type. It ​ ​ returns a correctly formulated network protocol string comprised of the specified network protocol and an appended JSON document (dispatch_json) describing a dispatch object. ​ ​

User Action Functions Most user action functions use the client.c program to execute formulated network protocol commands. The path “./client” and port “3999” are hardcoded in every instance.

Lookup User

def lookup_user(username)

This function executes the search_user.c program, which searches the Cassandra ​ ​ hashtable for users with a given name (username), then creates a “pull user****” ​ ​ ​ ​ command for each user_id that is returned as a result of the search. This function acts ​ ​ solely as a wrapped for the search_user.c program, and has no added functionality. ​ ​

Write Dispatch

def write_dispatch(media_path, body_text, user_id, audience, tags, user_tags, parent_type, parent_id, fragmentation, dispatch_id, dispatch_type)

This function takes values for all of the fields of a dispatch dictionary as defined in the documentation for the build_dispatch_json function. It calls that function, then sends the ​ ​

resulting JSON to any user who should have the dispatch pushed to them (which includes the user who is posting the dispatch, any tagged user, and any user specified by the dispatch’s audience field). This function calls the client.c program to execute its ​ ​ push action. It returns the returncode attribute from a python call function. ​ ​ ​ ​

Write User def write_user(user_id, username, image_path, name, fragmentation, followers, following, ip)

This function takes values for all of the fields of a user dictionary as defined in the documentation for the build_user_json function. It calls that function, then sends the ​ ​ resulting JSON to the Mongo collection of the user that was just generated. This function calls the client.c program to execute its push action. It returns the returncode attribute ​ ​ ​ ​ from a python call function. ​ ​

Update Feed def update_feed(user_id)

This function updates the feed of the user specified by user_id by first pulling the ​ ​ corresponding user object from the Mongo collection, the calling the “pull all*****” ​ ​ network command for each user_id in the following array. This function calls the client.c ​ ​ ​ ​ program to execute each network protocol command.

Search Tags def search_hashtags(user_id, tag)

This function allows a user to search the dispatches of the individuals they follow for instances of the specified tag. The function pulls the user object from the Mongo insta.user collection specified by user_id, formulates a “pull tags****” network protocol ​ ​ ​ command for each user in the user’s following field and the tag specified by tag. This ​ ​ ​ ​ function calls the client.c program to execute each network protocol command.

View Profile def view_profile(user_id)

This function pulls all dispatches for which the user specified by user_id is the parent, all ​ ​ dispatches in which the given user is tagged, and the user’s user object. This accomplished by sequentially formulating “pull child***,” “pull user_tags,” and “pull

user****” network protocol commands. Each command is executed by the client.c program.

Add User Cass

def add_user_cass(user_id, username, ip)

This function adds a new user or updates a user object in the Mongo insta.user ​ collection by calling the add_user.c program. The add_user.c program simply calls keyspace_table_init to initialize the keyspace and table insta.user if it doesn’t already ​ ​ ​ exist and calls add_user to insert or update the given user object. ​ ​

Generating User Actions All of the python functions used to programmatically generate user actions can be found in create_user_action_csv.py which consists of one function create_csv and several helper ​ ​ functions.

Create CSV

def create_csv(user, lines, post_ids, usernames)

This functions takes a dictionary user representing a user object and populates a csv ​ ​ with random user actions, beginning with inserting a user object. The second argument, lines, refers to the number of user actions to be generated. The third argument, post_ids, ​ ​ is a list of all dispatch_ids corresponding to posts created across all users as a global ​ ​ variable within the function. The idea here is that the csv files for all users will be generated sequentially in main. The last argument, usernames, is a list of the usernames ​ ​ ​ ​ ​ for all users whose actions are being generated. Possible user actions are specified in the options variable, which are hardcoded as [‘message’, ‘feed’, ‘profile’, ‘search_tags’, ​ ​ ‘lookup’]. These options are chosen at random with respective probabilities specified in the probability variable as [.1, .7, .1, .05, .05]. These probability variables can be ​ ​ changed in experiments to represent different distributions of user actions. For each line to be generated, a random action is selected and the corresponding helper function is called. All dispatches and user updates user the same image path, as the data size is more important than the data content of the image fields. All arrays in the csv are space ​ ​ delimited strings.

Write Message Row

def write_message_row(csvfile, user, overlap, words, post_ids)

This functions takes as argument a handler csvfile for writing to a csv, a dictionary user ​ ​ ​ ​ ​ containing a user object, an array overlap of user_ids representing the intersection of the ​ ​ ​ ​

followers and following sets for the defined user object, a array words of common words ​ ​ ​ ​ ​ to be used as test data for the tags field of dispatches, and the post_ids field defined ​ ​ ​ ​ above. This function performs a similar weighted random choice of options as in create_csv, but in this case the different options are the types of dispatches as specified ​ in the write_dispatch method of petepics.py. ​ ​ ● If the type chosen is “post,” the function writes a line of the csv indicating a post with a random sampling of between 0 and 4 tags and a random sampling of between 0 and 4 user tags chosen from the following field of the user object. The ​ ​ dispatch_id of the post is appended to post_ids. ​ ​ ​ ● If the type chosen is “comment,” the function writes a line of the csv indicating a comment on a post chosen randomly from post_ids. ​ ​ ● If the type chosen is “message,” the function chooses first between a group message or a direct message. In the case of a group message, between 0 and 10 users are chose from the overlap array. In the case of a direct message, a ​ ​ random user_id is chosen from the overlap array. A line is then written to the csv ​ ​ ​ ​ with the appropriate user_ids specified in the audience field. ​ ​ ​ ​ ● If the type chosen is “user_tag,” a random user_id is chosen from the user’s ​ ​ following field, and a random dispatch_id is chosen from post_ids. A line is then ​ ​ ​ ​ ​ written to the csv with the appropriate user_tag and selected dispatch_id as ​ ​ ​ ​ ​ ​ parent_id. ​

Split Tags def split_tags(tags)

This is a helper function to reformat arrays of user_tags and tags as space delimited ​ ​ ​ ​ strings as used in the build_user and build_dispatch functions. ​ ​ ​ ​

Write Feed Row def write_feed_row(csvfile, user)

This function writes a row to the csv representing a feed update for a user specified by user. ​

Write Profile Row write_profile_row(csvfile, user)

This function selects a random user_id from the union of the given user’s following and ​ ​ ​ ​ followers arrays and writes a row to the csv representing viewing the profile of the ​ selected user.

Write Search Tags Row

def write_search_tags_row(csvfile, user, words)

This function chooses a random tag from the words array and writes a row to the csv ​ ​ representing a search for the selected tag by the given user.

Write Lookup Row

def write_lookup_row(csvfile, user, usernames)

This function chooses a random user from the usernames array for which to search. This ​ ​ could be a user for which user is neither a follower nor following. ​ ​

Main Main should contain an initially empty array of all post_ids, a list of all usernames to be used, and dictionaries for each user. For each user object, simply call create_csv with ​ ​ the specified number of lines to write.

Executing User Actions from a CSV To utilize the programmatically generated testing data that we developed, we built a python application called csv_reader.py. This program would take the pathway of a csv file that was generated using the create_csv function. Once given a pathway, csv_reader would open the csv file and read it line by line, executing user actions based on the initial command string that was located at the beginning of each line in the csv file. The following command strings are accepted as valid, and lead to the execution of their corresponding python wrapper functions which mirror user actions: ‘message’ - a line beginning with the action string ‘message’ results in a call to the ​ ​ ​ petepics.py function write_dispatch. ‘user’ - a line beginning with the action string ‘user’ results in a call to the petepics.py ​ ​ ​ function write_user. ‘feed’ - a line beginning with the action string ‘update_feed’ results in a call to the ​ ​ ​ petepics.py function update_feed. ‘profile’ - a line beginning with the action string ‘profile’ results in a call to the petepics.py ​ ​ ​ function view_profile. ‘search_tags’ - a line beginning with the action string ‘search_tags’ results in a call to the ​ ​ ​ petepics.py function search_hashtags. ‘lookup’ - a line beginning with the action string ‘lookup’ results in a call to the ​ ​ ​ petepics.py function lookup_user. If any of the lines of the csv have a valid action string, but the string is followed by an incorrect number of arguments necessary to call the corresponding petepics function, then csv_reader will print an error statement and return. This premature return is justified because it would

indicate a bug in csv creation software that might be pervasive beyond the single line which csv_reader failed on.

Future Work

Encryption: When working on a project that holds user privacy as one of the ultimate long term goals, it follows that the network communication and data storage aspects of the project should implement some form of encryption. During the course of our work, we were only able to implement relatively low level and concise network protocols that send data over the network as unencrypted text. While this was suitable for our immediate needs (building a system that functioned as a proof of concept and basic testing framework), it must be addressed during the development of a user-facing application. It is essential that encryption and user privacy go hand-in-hand as part of a system that values data protection as one of its core principles.

Privacy and Permissions: One of the key long term goals for our project was to work towards developing a platform that gave users privacy for their personal data. The first facet of privacy which we had in mind was the privacy from corporations who profit off of the user data they host. Creating a decentralized platform was the principal step towards achieving this notion of privacy. During the course of our work however, we did not have the opportunity to explore and implement another key element of digital privacy -- privacy from other users. Contemporary social networks place a large emphasis on outward privacy. The ability to control who sees personal data and posted content has become a deeply ingrained user expectation of social media platforms like Instagram and Facebook. To satisfy this expectation and our own goals of user privacy, we initially designed our platform to allow for the implementation of user-defined privacy. The audience and audience_size fields of the dispatch object were our first attempt at defining user ​ ​ ​ privacy on our platform. Our intention was to use those two fields to specify whether a dispatch was to be viewable to all followers, a specific subset of users, or any user. Within our querying functions though, we decided not to perform the necessary checks to verify that the querying user had “permission” to view the dispatches they were requesting. We elected to forego this aspect of privacy as we had not yet created meaningful functionality to enforce social relationships. Under our current design, there is no functionality for requesting to follow another user, or denying such a request. This implies that there are no protocols in place to prevent one user from seeing another user’s information. This is an area which requires further thought and development, which may or may not take advantage of the groundwork we laid in the initial designs of our data structures and base functionality. Beyond the lack of user privacy resulting from the limited functionality surrounding user defined relationships, we have also not yet implemented a method for updating lists of followers and following. On contemporary social network platforms, there typically exists some form of mutual agreement that one user will enter a relationship with another. For example, on the Instagram platform a user who has a private account can be found by other users, but other

users must request to follow a given user before they can access any of their personal data (aside from a name and thumbnail picture) or posts. Under our current implementation, we have not yet devised a way to allow for a similar mutual procedure. Our system currently requires that all interacting users have their platform running simultaneously to interact with one another. This, coupled with our limited ability to actively test use cases that rely on the cooperative action of multiple nodes, led to us postponing the implementation of this functionality.

Mirroring, Bloom Filters, and Fragmentation: “Premature optimization is the root of all evil” ​ ​ -Campbell Boswell ​ -Rowen Felt ​ -Peter Johnson ​ -Donald Knuth Another primary area of future work stems from the potential for optimization in our system. To satisfy the expectations of users, the system must be optimized for fault tolerance and lookup efficiency. Under our current implementation, a user’s data is only stored on the node indicated by their associated IP address in the Cassandra database. Under these conditions, if a node goes offline, or changes IP addresses without updating Cassandra, there is no way for any other user to find their data. This is an extreme limitation that could be resolved by mirroring a given node’s data on another node. This would involve storing a user’s data on multiple other host nodes such that at least one complete copy of a user’s data is always available. Mirroring a node could be achieved using a variety of different strategies, ranging from storing a complete duplicate of a user’s Mongo collection on another user’s node, to fragmenting sets of dispatches and user data across a range of other nodes. Additionally, it is worth considering how much control a user has over where their data is mirrored -- whether on the nodes of their friends, followers, or a random set of users. Testing these different strategies and gauging their viability and efficiency is one of the most critical and necessary fields of future work on this project. We also recognize that the strategies and action of our primary functions can be improved by reducing the amount of redundant data they communicate. To better understand the scope of this issue, we can examine the current strategies for duplication handling which we employ in our user operations. When a user requests to view another user’s profile, the view_profile function request all of the dispatches that a given user for which the user is the ​ dispatch_parent as well as all the dispatches in which a given user is tagged. Under these ​ conditions, the requesting user will almost necessarily receive a large amount of data already contained in their database. All duplicate entries are reinserted upon arrival, resulting in an enormous amount of redundant network and database traffic. One solution we considered would be to use a bloom filter to store a concise record of which dispatches a user has already seen. A bloom filter is a compact, probabilistic data structure that tests whether a given item is already a member of a set by applying the modulo of the filter size to several different hashes of each item in a set and setting the corresponding indexes in a bit field. Using this method, a bit array of only a few kilobytes can be used to test whether a given item is a member of a set with tens of thousands of entries, where each entry is an arbitrary size. Using this data structure,

each user could send a bloom filter as an argument of profile pull requests to limit network traffic. The last optimization we considered is user fragmentation. The vast majority of users on social media platforms have a relatively modest number of followers and create an insubstantial amount of content. However, the presence of superusers, such as social celebrities and politicians, would place an inoperable level of strains on our server design. Therefore, we propose that the data of superusers be fragmented identically across a number of users who connect with a much smaller number of followers. These fragmented users would then operate their own servers either in parallel on the superuser’s home node or on other mirror nodes. Content created by the superuser would be pushed to all of the superuser’s fragmented servers, where it could be further dispersed to the superuser’s large number of followers. While we included the fragmentation field in the user and dispatch object definitions, the field is currently ​ ​ ​ ​ ​ ​ ignored by all functions and there is no functionality in place for fragmenting a user object over multiple servers or nodes.