<<

Universit´ecatholique de Louvain

Louvain School of Engineering

Computing Science Engineering Department

Designing an elastic and scalable application

Promoter: Master thesis presented for the Pr. Peter Van Roy obtention of the grade of master in Readers: computer engineering, option Pr. Marc Lobelle networking and security, by Boris Mej´ıas Xavier De Coster and Matthieu Ghilain.

Louvain-la-Neuve

Academic year 2010 - 2011 ii Acknowledgments

The Bwitter team would like to thank Pr. Peter Van Roy for his help and insightful comments. We also want to thank Boris Mej´ıasfor his guidance and availability during the whole project. We thank Florian Schintke, member of the Scalaris developing team, for his help during our analysis of Scalaris and the numerous answers he provided to our questions. We also thank Quentin Hunin for his support and constructive feedback during the last few weeks of the redaction. Finally, we would also like to thank our families, our friends and our girlfriends, In`esand Lorraine, for their unconditional support and encouragement.

iii Abstract

The amount of traffic on web based social networks is very difficult to predict. In order to avoid wasting resources during low traffic periods or being overloaded during peak periods, it would be interesting to adapt the amount of resources dedicated to the service. In this work we detail the design and implementation of our own social network application, called Bwitter. Our first goal is to make Bwitter performance scales with the number of machines we dedicate to it. Our second goal is linked to our first one, we want to make Bwitter elastic so that it can react to flash crowds without suspending its services by adding resources in order to handle this load. To achieve the desired scalability and elasticity, Bwitter is implemented on a scalable key/value datastore with transactional capabilities running on the Cloud. During our tests we study the behaviour of Bwitter using the Scalaris datastore and having both running on Amazon’s Elastic Compute Cloud. We show that the perfor- mance of Bwitter increases almost linearly with the number of resources we allocate to it. Bwitter is also able to improve its performance significantly in a matter of minutes. ii Contents

I The Project vii

1 Introduction 1 1.1 Social networks ...... 1 1.2 Scalable Data Stores ...... 2 1.3 The Cloud ...... 3 1.4 The Bwitter project ...... 3 1.4.1 ...... 3 1.4.2 Bwitter ...... 5 1.4.3 Contributions ...... 5 1.5 Roadmap ...... 6

2 State-of-the-art 7 2.1 Scalable datastores ...... 7 2.1.1 Key/value Stores ...... 7 2.1.2 Document Stores ...... 8 2.1.3 Extensible Record Stores ...... 9 2.1.4 Relational Databases ...... 9 2.2 Peer-to-peer systems ...... 11 2.3 DHT ...... 13 2.4 Study of scalable key/value stores properties ...... 13 2.4.1 Network topology ...... 14 2.4.2 Storage abstraction ...... 15 2.4.3 Replication strategy and consistency model ...... 16 2.4.4 Transactions ...... 17 2.4.5 Churn ...... 18

iii 2.4.6 Security ...... 19 2.5 The Cloud ...... 20 2.6 Conclusion ...... 21

3 The Architecture 23 3.1 The requirements ...... 23 3.1.1 Non-Functional requirements ...... 23 3.1.2 Functional requirements ...... 25 3.1.3 Conclusion ...... 26 3.2 Architecture ...... 26 3.2.1 Open peer-to-peer architecture ...... 27 3.2.2 Cloud Based architecture ...... 29 3.2.3 The popular value problem ...... 30 3.2.4 Conclusion ...... 32

4 The Datastore 33 4.1 The datastore choice ...... 33 4.1.1 Identifying what we need ...... 33 4.1.2 Our two choices ...... 34 4.2 General Design ...... 36 4.3 Design of the datastore ...... 38 4.3.1 Key uniqueness ...... 39 4.3.2 Push approach design details ...... 39 4.3.3 The Pull Variation ...... 44 4.3.4 Conclusion ...... 46 4.4 Running multiple services using the same datastore ...... 46 4.4.1 The unprotected data problem ...... 47 4.4.2 Key already used problem ...... 50 4.4.3 Conclusion ...... 51

5 Algorithms and Implementation 53 5.1 Implementation of the cloud based architecture ...... 53 5.1.1 Open peer-to-peer implementation ...... 53 5.1.2 First cloud based implementation ...... 54 5.1.3 Final cloud based implementation ...... 55

iv 5.2 Nodes Manager ...... 57 5.3 Scalaris Connections Manager ...... 59 5.3.1 Failure handling ...... 61 5.4 Bwitter Request Handler ...... 62 5.4.1 The push approach ...... 63 5.4.2 The pull approach ...... 75 5.4.3 Theoretical comparison of Pull and Push approach ...... 78 5.5 Conclusion ...... 86

6 Experiments 87 6.1 Working with Amazon ...... 87 6.1.1 Choosing the right instance type ...... 87 6.1.2 Choosing an AMI ...... 89 6.1.3 Instance security group ...... 89 6.1.4 Constructing Scalaris AMI ...... 89 6.2 Working with Scalaris ...... 90 6.2.1 Launching a Scalaris ring ...... 90 6.2.2 Scalaris performance analysis ...... 91 6.3 Bwitter tests ...... 107 6.3.1 Experiment measures discussion ...... 108 6.3.2 Push design tests ...... 110 6.3.3 Pull scalability test ...... 125 6.3.4 Conclusion: Pull versus Push ...... 127 6.4 Conclusion ...... 129

7 Conclusion 131 7.1 Further work ...... 132

II The Annexes 137

8 Beernet Secret API 139 8.1 Without replication ...... 139 8.1.1 Put ...... 139 8.1.2 Delete ...... 139 8.2 With replication ...... 140

v 8.2.1 Write ...... 140 8.2.2 CreateSet ...... 140 8.2.3 Add ...... 141 8.2.4 Remove ...... 141 8.2.5 DestroySet ...... 142

9 Bwitter API 143 9.1 User management ...... 143 9.1.1 createUser ...... 143 9.1.2 deleteAccount ...... 144 9.2 Tweets ...... 144 9.2.1 postTweet ...... 144 9.2.2 reTweet ...... 145 9.2.3 reply ...... 145 9.2.4 deleteTweet ...... 146 9.3 Lines ...... 147 9.3.1 addUser ...... 147 9.3.2 removeUser ...... 147 9.3.3 allUsersFromLine ...... 148 9.3.4 allTweet ...... 148 9.3.5 getTweetsFromLine ...... 149 9.3.6 createLine ...... 150 9.3.7 deleteLine ...... 150 9.3.8 getLineNames ...... 151 9.4 Lists ...... 151 9.4.1 addTweetToList ...... 151 9.4.2 removeTweetFromList ...... 152 9.4.3 getTweetsFromList ...... 152 9.4.4 createList ...... 153 9.4.5 deleteList ...... 154 9.4.6 getListNames ...... 154

10 The paper 155

vi Part I

The Project

vii

Chapter 1

Introduction

The web 2.0 offers many new services to the users of the Internet. They can now share, generate and upload content online faster and more easily than ever before. All those services require computing, bandwidth and storage resources. To predict the required amount of those resources can be tricky, especially if a service wants to avoid wasting them while being at the same time able to face high usage peaks. We are going to take a closer look at the scalability and elasticity of perhaps the most famous of those web 2.0 services, namely social networks.

1.1 Social networks

Social networks such as and Twitter are an increasing popular way for people to interact and express themselves. Facebook, for instance has 600 million active users [6]. People can now create content and easily share it with other people. Social networks are now a mean of communication on their own, used by politicians, artists and brands to easily reach large communities to promote themselves or their products. It also allows people to quickly organise social events like barbecues or even nationwide revolutions like what happened in Tunisia [21, 40] or Egypt [16]. Social networks are also a powerful tool to communicate during natural disasters. Twitter and Facebook were very useful to find updates from relatives and friends when the mobile phone networks and some of the telephone landlines collapsed in the hours following the 8.9 scale earthquake in Japan. The US State Department even used Twitter to publish emergency numbers [35]. Other examples are the Haiti [18] and Chile [9] earthquake that were covered in real time thanks to social networks, with photos sent out to the rest of the world directly via Twitter. It is thus critical that social networks do not crash when their users need it the most. But, the servers of those social networks can only handle a given number of requests at the same time, therefore, if there are too many requests, the server can become overloaded. A typical result of overloading is Twitter suspending its services and displaying the “Fail Whale” shown in Figure 1.1.

1 Figure 1.1: “Lifting a Dreamer” aka Fail Whale, illustration by Yiying Lu displayed when Twitter is overloaded.

To avoid overloading efficiently is a tricky problem as this load is related to many social factors, some of which are impossible to predict. For instance we want to be able to handle the high amount of people sending Christmas or New Year wishes but also reacting to natural disasters. This is why we want to turn towards scalable and elastic solutions, allowing the system to add and remove resources on the fly in order to fit the required load. Social networks are also platforms where users can share personal information, destined to be seen only by some specific peers. Other personal information such as coordinates are sometimes also stored into their system too. More and more users begin to worry who ultimately has access to this information and what they can do with it. It is thus important to have a system that is secure and enforces the privacy of the end user.

1.2 Scalable Data Stores

The web 2.0 called for a different kind of database than the previous Relation Database Management Systems (RDBMS) solutions. They needed data stores able to host huge amounts of data and handle many parallel requests at the same time. There are now numerous scalable and elastic storage solutions to answer this demand. These scalable data stores allow to store increasingly more data and to handle more requests as we allocate more resources to them. This is because they have been build to share their load on the different machines allocated to them. Those data stores have also elastic properties, allowing them to add or remove resources to gracefully upscale or downscale without having to be rebooted. This elasticity is crucial in order to upscale to face sudden increases in traffic but also to downscale when the hype is over in order to avoid wasting resources.

2 As our work revolves around the scalability and elasticity of social network appli- cations we are bound to work with those scalable data stores. Many different kinds of scalable data stores exist and we are going to present them in our state-of-the-art in Chapter 2.

1.3 The Cloud

The cloud is a phenomenon that is hard to ignore these days as most web appli- cations tends to rely on it in order to provide their services. The cloud refers to on demand resources such as storage, bandwidth and processing power but also to on de- mand services such as mail service or word processing [2]. The computation can thus be transferred from the users’ machine, as it was the case in the past, to the machines forming the cloud. This allows users to have machines with very little computational re- sources or storage but still be able to execute heavy calculations or store huge amounts of data. A typical analogy to the usage of cloud resources is the usage of public utilities such as water or electricity. Specialised companies provide those services at a fraction of the price it would cost us if we needed to deploy and maintain ourselves all the infrastructure required for that service. The cloud is thus the ideal platform to use if we do not want to invest in costly hardware and maintenance. This is especially true if we do not know beforehand if our service will be successful. We can start small and only pay for a little amount of resources. If our service is popular we can easily grow by requesting more resources and thus paying a higher price. But, if our service does not manage to attract many people we did not waste our money investing in powerful servers. Furthermore, the resources the cloud offers are elastic, meaning you can increase or decrease them on the fly and only pay for the amount you really need to keep your service going. We are going to use the scalability and elasticity properties of the cloud during our work, this is why we will detail it even further in our state-of-the-art in Chapter 2.

1.4 The Bwitter project

Bwitter is a lighter version of Twitter, the famous social network. Some readers might be unfamiliar with Twitter, we will thus introduce it briefly before going further with the description of Bwitter.

1.4.1 Twitter

Twitter is a micro-blogging system that allows users to post small text messages of 140 characters called tweets. An enormous amount of tweets is posted each day, according to Twitter themselves [33], 177 millions of tweets were posted in march 2011 and the record is 6939 tweets per second 4 seconds after midnight in Japan on New Year’s Day.

3 Users can choose to display the messages of other users they find interesting in their line by following them. In Figure 1.2, you can see the home screen of Twitter with user Zulag (aka Xavier De Coster, co-author of this Master thesis) logged in, the messages of the users he follows are displayed as a stream in his “Timeline”.

Figure 1.2: Home screen of Twitter’s web interface.

Twitter offers additional functionalities to this message posting, for instance a user can reply to or retweet (share) any message he wants. He can also address a message directly to another user by starting his message with “@destinationUser”. The main difference of Twitter, and now also Google+ 1, in comparison to other social networks such as Facebook is the asymmetry of the social connections. This means that the connection does not go in both directions: a user A can follow a user B without user B having to follow user A. This is unlike the Facebook system where two users can be- come “Friends” and automatically see each others updates. This behaviour encourages Twitter to be used as a place where fans can follow their favourite stars. Up to such a point that 10% of the users accounts for 90% of the traffic [17]. The tweets can also be tagged by users using hashtags. The tweets containing an hashtag are automatically added to a group of tweets associated to this hashtag.

1https://plus.google.com/, last accessed 13/08/2011

4 1.4.2 Bwitter

We decided to develop Bwitter as an elastic and scalable social network application and to study how it behaves faced to flash crowds and heavy traffic. Bwitter is an open source version of Twitter based on a scalable data store and developed to run on a highly elastic cloud architecture. Bwitter thus presents similar functionalities to Twitter. We decided for Twitter as it is one of the more basic social networks and because it is now incredibly famous. The data store used by Bwitter is a key/value store which we will present in detail later in Chapter 4. Bwitter was thought so that other services could run on this data store without interfering with each other’s data. Bwitter is developed in multiple layers that are loosely coupled allowing for maximal modularity. We added an optional cache layer on the top of the key/value data store in order to maximize performance. Bwitter manages the cloud machines on which it runs as well as the data store nodes, restarting them when needed. During the implementation we took advantage of existing and proven technologies leading to an efficient and robust implementation.

1.4.3 Contributions

The main contributions of this work are:

Design of a scalable social network for microblogging. • Improvement of Beernet’s API • Helping to improve the bootstrapping of Scalaris and studying its behaviour on • the Amazon Elastic Compute Cloud .

During the development of Bwitter we identified some potential improvements with one of the datastores we were using, namely Beernet [19, 23] and designed a new API allowing to protect and manage the rights to the data stored. This new API supporting secrets is now implemented and supported in Beernet version 0.9. In order to further understand the behaviour of Bwitter we did performance tests with the Scalaris [29] data store on the Amazon’s Elastic Compute Cloud (EC2) test- ing its scalability and elasticity. We also studied the impact of the machine resources, the number of parallel requests and having conflicting operations on Scalaris’ perfor- mance. During our discussions with the developers of Scalaris we helped them locate an instability in the booting of their system. Ultimately, we implemented two different designs for Bwitter and tested both on Amazon’s EC2 showing really good scalability and elasticity properties. During the course of the development we presented a demo of our project at the “Foire du Libre” held the 6th of April at Louvain-la-Neuve2 at the Beernet stand. We have also co-

2“Foire du Libre” is a fair celebrating open source software and organised by the Louvain-li-nux: http://www.louvainlinux.be/foire-du-libre/, last accessed 05/08/2011

5 written an article, along with Peter Van Roy and Boris Mej´ıas, entitled “Designing an Elastic and Scalable Social Network Application” in which we detail some of the observations and design decisions we developed in this master thesis. This article, that can be found in Chapter 10 of our annexes, has been accepted for The Second International Conference on Cloud Computing, GRIDs, and Virtualization3 organized by the IARIA and held the 25th to the 30th of September 2011 in Rome, Italy.

1.5 Roadmap

We start with our state-of-the-art in Chapter 2, we discuss the different technologies we used and explored during the development of Bwitter such as scalable data stores and the cloud services. We then identify the main requirements of our project and discuss the general architecture of Bwitter in Chapter 3. We then explain why we choose to base it on the cloud instead of letting it run in the wild on an open peer-to-peer system. In this chapter we also explain how a cache could solve potential problems due to values being too popular. The next step is to take an in depth look at the data store we are going to use in Chapter 4. We detail our main objectives in terms of data representation and explain how we decided to store the different data abstractions we use in our data store. We also take a look at how we can avoid conflicts between two different applications using the same data store. We detail the different modules composing the Bwitter system in Chapter 5, high- lighting their purpose and the main algorithms developed to implement them. We also compare more thoughtfully the two different approaches, push and pull, for our appli- cations most crucial functions, the post tweet and read tweets. We end this chapter with a global overview of the implemented architecture and detail how the different modules fit together. We carry on by doing a series of experiments in Chapter 6. We start by testing Scalaris and measuring the impact of a few chosen parameters on its performance, scalability and elasticity. We then continue by measuring the performance, scalability and elasticity of Bwitter and compare the results for the push and pull approaches. We finish this master thesis with a conclusion in Chapter 7 where we reflect on the achieved work, the lessons learned and the possible further improvements that could be made to our application. In the annexes you find the new API we designed for Beernet, the API of Bwitter and a section for our mathematical demonstrations.

3CLOUD COMPUTING 2011, http://www.iaria.org/conferences2011/CLOUDCOMPUTING11.html, last accessed 13/08/2011

6 Chapter 2

State-of-the-art

In this section we are going to take a look at the relevant technologies that could be useful to the Bwitter project. We start with the different existing scalable datastores in order to decide which kind is most appropriate for our application. From there, we take a closer look at peer-to-peer systems and their look up performances and further study the properties of Distributed Hash Tables (DHT). Finally, we overview the different services the cloud has to offer.

2.1 Scalable datastores

We start our state-of-the-art with a section about scalable datastores. As our application is going to heavily rely on a datastore, it is important to understand the different kinds that are available today as well as their pros and cons [7]. There are several kinds of scalable datastores available each with their own speci- ficities, but four main classes can be put forward: Key/value Stores, Document Stores, Extensible Record Stores and Relational Databases. We are going to compare the functionalities they provide and the way they achieve scalability. Most of those datastores do not provide ACID properties, but BASE properties. ACID stands for Atomicity, Consistency, Isolation, Durability and BASE stands for Basically Available, Soft state, Eventually consistent. This eventual consistency is often said to be a consequence of Eric Brewer’s CAP theorem [29], which states that a system can have only two out of three of the following properties: consistency, availability, and partition-tolerance. Most of the scalable datastores decide to give up consistency but some of them decide to have more complex trade offs.

2.1.1 Key/value Stores

These are the simplest kind of datastore, they store values at a user defined indexes called keys and behave as hash tables. They are very useful if you need to lookup objects based on only one attribute, otherwise you might want to use a more complex datastore. Some key/value stores provide key/set abstractions allowing to store multiple values at

7 a single key. Key/value stores all support insert, delete, and lookup operations, but they also generally provide a persistence mechanism and additional functionalities such as versioning, locking and transactions. Replication can be synchronous or asynchronous, the second option allows faster operations, but some updates may be lost on a crash and consistency cannot be guaranteed. Their scalability is ensured through key distribution over nodes, and some present ACID properties. In conclusion, this solution, by its simplicity, allows to easily scale. But every rose has its thorn, this simplicity comes at the cost of poor data structures abstractions. Notable examples are Scalaris, Riak, Voldemort, Redis and Beernet.

Figure 2.1: Data organisation in a key/value datastore.

2.1.2 Document Stores

These systems store documents and index them. A document can be seen as an object that has attribute names that are dynamically defined for each document at runtime. Those attributes are not necessarily predefined in a global schema, unlike, for instance, SQL that imposes to define the schema beforehand. Moreover, those attributes can be complex, meaning that nested and complex values are allowed. It is possible to explicitly define indexes to speed up research. Replication is asynchronous in order to increase the speed of the operations. Of- ten scalability can be ensured by reading only one replica, and thus sacrificing strong consistency, but some document stores, like MongoDB, can obtain scalability without that compromise. MongoDB allows to split parts of a collection across several nodes in order to increase scalability instead of relying on replication. This technique is called sharding.

Figure 2.2: Data organisation in a document store.

A popular abstraction, called domain, database, collection or bucket, depending of the document store, is often provided to allow the user to regroup documents together. Users can query collections based on multiple attribute value constraints. Document stores are useful to store different kinds of objects and to make queries on attributes

8 those objects share. Other notable examples are CouchDB, SimpleDB and TerraStore.

2.1.3 Extensible Record Stores

These systems, also known as wide column stores, and probably motivated by Google’s success with BigTable, store extensible records. Extensible records are hy- brids between tuples, that are simple rows of relational tables with predefined attribute names, and documents that have attribute names defined on a per-record basis. In- deed, extensible record stores have families of attributes defined in a global schema, but inside these families new attributes can be defined at run-time. The extensible record store data model relies on rows and columns that can be partitioned vertically and horizontally across nodes to ensure scalability. Rows are split across nodes based on the primary key, usually they are grouped by key range and not randomly. Columns of a table are distributed across nodes based on user defined “column groups” regrouping attributes that are usually best stored together on the same node. For instance all attributes of an employee concerning his address (address, city, country) will be placed in one column group and all the attributes concerning the means of contacting him (email, phone number, fax number) will be stored in another column group. As the document stores, extensible record stores are useful to store different kinds of objects and to make queries on shared attributes. Moreover, they can provide higher throughput at the cost of a bit more complexity for the programmer when defining the column groups. Notable examples are HBase, Cassandra and HyperTable.

Figure 2.3: Data organisation in an extensible record store.

2.1.4 Relational Databases

These systems store, index, and query tuples via the well known SQL interface. They offer less flexibility than document stores and extensible record stores because tuples are fixed by a global schema, defined during the design of the database. Moreover, the classical relational database model system is not well suited for scalability [29]. There are several proposed solutions to scale the database [13], but they all suffer from disadvantages. A classical solution is to use a master/slave approach to distribute the work: the slaves handle the reads and the master server is responsible for the writes. The first drawback is the eventual consistency. Each slave has its own copy of the data, and even if we normally have near real-time replication, we do not have strong

9 consistency which is sometimes needed. The second immediate drawback is that the master server quickly becomes a bottleneck when the amount of writes increases. Cluster computing solutions improve this by using the same data for several nodes but with only one node responsible for writing. They thus provide strong consistency, but the bottleneck problem remains. Finally, the share nothing architecture, introduced by Google [10], should scale to an infinite number of nodes because each node would share nothing at all with the other nodes. In this approach, each node is responsible for a different part of the database and has its own memory, disk and CPU. In order to divide the database, sometimes called sharding the database, we split the tables into several non overlapping tables and dispatch these tables to different shards, that thus share nothing, so that the load is divided between them. Usually the cutting of the tables is done horizontally. This means that different rows are assigned to different shards given a partition criteria [39] based on the value of a primary key. The partition criteria can be range partitioning (the shard is responsible for a range of keys), list partitioning (the shard is responsible for a given list of keys) or hash partitioning (hash of the key determines the shard responsible for the key). To achieve redundancy each shard is replicated, in MySQL Cluster [20] for example each shard is replicated two times. But, to implement this solution correctly, several challenges have to be solved. Particularly, how to partition the data into multiple non overlapping shards and with load fairly divided between the shards? The answer to this question is closely related to the application area. The splitting is natural if, for example, the table to split is a table containing data for Americans and Europeans people, but in most of the cases this can be quite tricky.

Figure 2.4: In a relational database data can be subdivided and accessed via fixed fields.

Those relational databases have presented improved horizontal scalability, provided that the operations do not span many nodes. While they are not as scalable as the pre- viously mentioned datastores, they might be in a near future. The appeal to relational database is obvious, it has a well established user base and support from its commu- nity, which means there are already multiple tools existing ready to be used with it. Furthermore it has ACID properties, this makes life generally easier for the program- mer. Notable examples are MySQL Cluster, VoltDB, Clustrix, ScaleDB, ScaleBase and NimbusDB.

10 2.2 Peer-to-peer systems

We also decided to give a close look at peer-to-peer (P2P) systems. They are an interesting alternative to classical client/server systems because they allow a more ef- ficient use of resources like bandwidth, CPU and memory. This is because every peer is equivalent in the application and has a dual client/server role, and therefore can serve content as a classical server sharing load between the members of the network. Moreover, because of this dual role, availability of the content increases with the net- work size encouraging scalability of the system, which is a property we are very much interested in. P2P systems have also the crucial property that they do not have any central point of failure, neither a central point of coordination which often becomes a bottleneck when the system needs to grow. These properties are extremely important in distributed computing because they increase robustness of the system as well as scalability. There are three main categories of P2P systems [31] which vary according to their topologies and their look-up performances. The first and oldest relies on a central index maintaining mapping between the files reference and the peers holding the file. This index is managed by central servers that provide the look-up service. This contradicts what we just told about peer equivalence and implies that this generation is not a true peer-to-peer system. Therefore, a peer wanting to access some file must first connect to this server to find the peers responsible for this data, and then it can connect to the peer holding the data. This is shown in Figure 2.5. This is the solution developed by , the famous file-sharing system.

Figure 2.5: P2P system relying on a central index to look up files: A) Searching Node (0) asks the Central Server (CS) where it can find a given file. B) The CS gives the address of node 3 to node 0. C) Node 0 retrieves the file directly from node 3.

11 The second category does not rely on any server to perform queries and has an unstructured topology. Therefore, the connections between the peers in the network are established arbitrarily. In this category of P2P systems, there is no relation between the node and the data for which it is responsible. It follows that the look-up mechanism must be a flooding like mechanism. In Gnutella, the flooding algorithm has a limited scope in order to limit the number of messages exchanged. Therefore, it could happen that a value present in the network is not found. Indeed, it is possible that a query could not reach the peer holding the value because the flooding diameter was too small. This is illustrated in Figure 2.6.

Figure 2.6: P2P system using flooding to look up files: A) Searching Node (0) floods the network with a request for a file B) A query reaches node 2 which hosts a corresponding file and responds directly to 0. Note that if a query has a time to live of 2 and if nodes 1, 2 and 3 host a corresponding file, only nodes 1 and 2 will respond to 0 as 3 is too far away from 0.

In order to provide look-up consistency, the flooding diameter must be N, with N being the number of peers in the network, however this would not scale in large systems. In order to resolve this problem, the third generation of P2P systems changed from an unstructured to a structured topology improving drastically the look-up performances. Distributed hash tables (DHT) are the most frequent abstraction used by P2P with a structured topology. We take a closer look at them in the next section.

12 2.3 DHT

DHTs were designed in order to solve the look-up problem present in many P2P systems [3]. They provide the same operations to store and retrieve key/value pairs as a classical hash table. A key is what identifies a value, the value is the data you want to as- sociate with this key. As an example, consider a movie named “Why DHTs are fun.avi” and the actual file containing the movie, the key would logically be the title of the film and the value is the file. Each peer in a DHT system can handle key look-ups and key/value pair storing requests avoiding the bottleneck of central servers. Another problem addressed by those systems is the repartition of the key/value pairs responsibility between the peers. Each key/value pair and peer has an identifier. The identifier domain can be anything, taking the example of Chord-like DHT, the identifier is an integer between 0 and N, where N is a chosen parameter. Those identifiers are used to determine which peer are responsible for which key/value pairs. Each peer has an interval it is responsible for, this interval is computed based on its identifier and other peers in the network. Once again we take the example of Chord, a peer is responsible of all the identifiers between its identifier and the identifier of the next peer in the network not included. A peer stores all the key/value pairs with an identifier in its interval. The identifiers are computed most often using a consistent hash function. Assuming each peer has an IP address associated, its identifier is computed by taking the result of the application of this function to its IP address. Some systems allows a peer to chose an identifier. The identifier of a key/value pair is computed taking the hash function of this key. The use of a consistent hash function to compute identifiers allows a roughly fair division of the key space between peers, which is a crucial point for scalability. Moreover, this kind of hash function has the advantage that adding a peer to the system does not cause a lot of identifiers to be remapped to other peers, which improves the elasticity of the system. DHTs, as said in the peer-to-peer point, are the third generation of peer-to-peer systems. Compared to the previous generation, they mainly solve the scalability prob- lems of the look-ups mechanism. Indeed, we now have a relation between the key of a value and a peer, which permits us to achieve better look-ups performance by rooting the look-up request to the responsible peer instead of flooding the network, which was not scalable.

2.4 Study of scalable key/value stores properties

Bwitter is built on the top of a key/value datastore. Key/value datastores are sys- tems that implement a DHT and offer other services on the top of it. We now compare possible design choices when implementing systems offering a DHT abstraction. The comparison is based on the following criteria: consistency model, replication strategy, storage abstraction, network topology, churn, transactional support and finally security.

13 2.4.1 Network topology

The network topology refers to how peers are organized in the network, there may be important differences between the different DHTs implementations. It is also a crucial design point because it will influence deeply the performance of the look-up mechanism as well as the fault-tolerance of the network. We will take a look at some important network topologies. In Chord-like topologies [30], nodes are organized in a ring (see Figure 2.7) and keep a list of successors and predecessors as well as a routing table, which is filled with fingers chosen according to various policies. We call a finger a reference to another peer in the system, usually it is the IP address of this peer. The size of the routing table varies among the systems, Chord keeps log2(N) fingers, where N is the number of nodes int the system. DKS, which is a generalization of Chord, keeps logk(N), where k is predefined constant. This is a trade-off between better look-up performance and bigger routing tables. We summarize the most common choices in Table 2.1. Each Chord node also keeps in its successor list log2(N) successors, this in order to recover from node failures. This topology is widespread because it allows for efficient routing as well as easy self-organization upon joins, leaves and failures. Beernet topology is similar to Chord but differs in one crucial point. In Chord, nodes must be connected with their direct predecessor, in Beernet they only need to know the key of their predecessor creating a branch when a node cannot join its direct predecessor. This property is the reason why the topology of Beernet is called relaxed ring (see Figure 2.7). Indeed, when a node does not have the link toward its predecessor the ring is not perfect. This topology is more resistant because it makes less assumptions while preserving the consistent look-up. You can find more information about Beernet topology in [19]. Scalaris currently relies on a Chord topology too. The Scalaris team is currently working in order to use another Chord like topology too called Chord#, which is very much like a classic Chord except it stores keys in lexicographical order. Furthermore, the routing is not done in the key space but rather in the node space. This allows range queries and allows the application to choose where to place the data in the ring [32].

Number of fingers Look-up performances O(1) O(n) O(log(n)) O(log(n)/log(log(n))) O(log(n)) O(log(n)) (more common) O(√N) O(1)

Table 2.1: Number of fingers versus Look-up performances for N nodes in the network.

Chord, as well as Beernet, does not take advantage of the underlying physical topology. Pastry [28], Tapestry, and Kadmelia [15] also assume a circular key space but try to tackle this problem by keeping a list of nodes which they can join with low latency. They choose their fingers giving preference to nodes in that list.

14 We finally detail the topology of CAN [24] because it differs significantly from the topology of the other DHTs. Nodes are organized so that they divide a virtual d- dimensional Cartesian coordinate space. Each node is responsible for a part of this space. In order to join the network a node, that we call A, chooses a random point in the space. It then contacts the node responsible for this point, called B. Finally, B splits its zone in two giving the half of the zone it was responsible for to A. Nodes only maintain routes towards their immediate neighbours. In CAN two nodes are neighbours if their zones touch along d 1 dimensions. In order to better understand, imagine a − square (2 dimensions) divided into rectangle chunks, which correspond to zones, two nodes would be neighbours if the rectangles they are responsible for have an edge in common. According to the results in [24], for a d dimensional space partitioned in n zones, the average routing path length is (d/4) n1/d and the average path length × is O(n1/d). You can observe that the average path length decreases as the number of dimensions increases, but this comes at the cost of higher space complexity for maintaining routing tables. Moreover, each join and leave becomes more costly as the number of dimensions increases. Indeed the number of neighbours of a node increases, and thus the complexity to maintain routing table consistency grows. However, the topology is not linked with the physical topology of the nodes. You can see an example of how nodes are organized in CAN for d = 2 in Figure 2.7, each rectangle represents a zone controlled by a node.

Figure 2.7: From left to right, the ring overlay (CHORD) , the relaxed ring overlay (Beernet) and a 2-dimension CAN overlay.

2.4.2 Storage abstraction

As mentioned before, key/value stores allow to do all the operations provided by classical hashtables on key/value pairs, namely look-up, store and delete operations. In order to be clear, a key is uniquely associated to a value, storing another value with the same key will erase any previously stored value. Beernet, Redis [25] and OpenDHT [26] allow to work with key/set pairs, where each key can be associated to a set that can contain multiple values, and thus when a look-up on a key associated to a set is done it returns all the values in the set. OpenDHT only works with key/set pairs which leads to more complex algorithms for the applications using it.

15 2.4.3 Replication strategy and consistency model

In order to provide redundancy, these systems often provide replication services. Those vary according to guarantees they offer: improved reliability of the system and/or availability. The replication is done by storing a value at k different nodes instead of only one, k is called the replication factor. Beernet and Scalaris offer symmetric replication with strong consistency using a transactionnal layer build on the top of their DHT implementation. Strong consistency means that read operations always return the latest correctly written value, this is achieved by always writing and reading from a majority of the replica set. In symmetric replication [12], each node identifier is associated with a set of (k 1) other node − identifiers, which we call the replica set. When using replication, a key/value pair is stored at the node responsible for the identifier of “key” and at all the nodes which are responsible for an identifier inside the replica set. Nodes maintain routes toward nodes with “symmetric identifier” so that they can contact directly any of the replicas of the key/value pairs they are responsible for. Strong consistency between replicas does not come for free. Indeed, each time a value is accessed a majority of the replicas must be contacted. Thus in such a scheme, it is not possible to increase availability of the content through replication. We address this problem in section 3.2.3. Beernet currently does not handle the restoration of the replication factor when a node fails abruptly. CAN does not have consistency problems because it works with immutable content, meaning that values cannot be updated. This is a clear limitation when implementing a social network where updates are frequent. CAN proposes replication through what they call realities. A node, when joining the network, join r coordinate spaces and is in charge of a different zone in a different space, each space coordinate is called a reality. When a key/value pair is added, it is added in all the realities. Therefore, because the nodes are in charge of different zones in different realities, different nodes are in charge of the new added pair. To create these realities, a different hash function is applied to map the node to different coordinates in each reality. This strategy, as every strategy relying on different hash functions, has two major drawbacks compared to symmetric replication [12]. First, the inverse of the hash function is not computable. Therefore, it is not possible to recover the original key before hashing, while it is needed to fetch the value of the remaining replicas. Moreover, because of hash function distribution properties, and even if it was possible to find the inverse, other replicas would be spread all over the remaining nodes. This forces the node in charge of restoring the replication factor to contact a multitude of nodes. In conclusion, because we can not find the inverse of the hash function, the replication degree of pairs thus decreases at each node failure. Pastry uses a different approach based on leaf-set which is close to the successor set approach. As for CAN, Pastry assumes that values are immutable, and so there is no problem of consistency between the replicas, but it is at the cost of no updates of values. Pastry stores the replicas at nodes that have the closest ids with respect to the value’s key. So if the replication factor is k, you have k/2 replicas before and after the key. In

16 the successor set approach all the replicas are stored at the k successors of the key. This strategy allows to maintain the replication factor because it is possible to find other replicas at the contrary of CAN strategy. But the algorithms to maintain the replication factor are expensive compared to the cheap symmetric replication strategy [12].

2.4.4 Transactions

While not many key/value datastores offer the possibility to do transactions, it is a crucial feature. “A transaction is a group of operations that have the following properties: atomic, consistent, isolated, and durable (ACID)” 1. A transaction can have two results: abort or commit. When a transaction commits we can be sure that all the operations inside the transaction have been done successfully. On the other hand if a transaction aborts we know none of the operations have been done. We know only two key/value datastores that implement transactions : Beernet and Scalaris. Transactions are usually achieved using a two phase commit (2PC) algorithm. The two phases are the validation phase and the write phase. Both phases are supervised by a Transaction Manager (TM), while all the nodes responsible for the involved items become Transaction Participants (TP). During the validation phase, the TM try to lock the involved resources on every TP. If the TM receives an abort message the operation is aborted. Otherwise, the TM sends a commit message to all the TPs making the update permanent and releasing the lock.

Figure 2.8: Two-Phase Commit protocol (left) reaching termination and (right) not reaching termination, image taken from [19].

A serious problem could arise if the TM fails during this operation. Indeed, the locks would not be released as you can see on Figure 2.8. This is why some systems, such as Beernet and Scalaris, decided to add a replicated Transaction Managers that can take over in case the TM fails. This transaction algorithm is based on the Paxos concensus algorithm. Beernet adds a phase to the 2PC algorithm before registering the lock. In the first phase, the client, who is the original TM, will do read and write operations without taking any locks. In a second phrase, and before committing the transaction, it registers

1MSDN, what is a transaction? http://msdn.microsoft.com/en-us/library/aa366402(VS.85).aspx, last accessed 13/08/2011.

17 to a bunch of transaction managers that can, as said before, takeover the transaction if the main TM fails. It then does the prepare phase of the 2PC, and send a message to all the TPs in order to take the lock on the required items. The TPs send the result of the transaction to each of the RTM, which will then send their results to the main TM. It can then take a decision to commit or abort the transaction if the majority of the RTM have voted the same. When the TM has taken its decision, it sends a final message to the TPs so that they can release the locks. This algorithm is said to be eager because modifications are done optimistically before requesting any lock in the read phase. This algorithm makes the assumption that the majority of the TPs and the TM will survive during the transaction. You can find more details about this algorithm in Jim Gray and Leslie Lamport’s “Consensus on transaction commit article” [14] .

2.4.5 Churn

We define the churn as John Buford, Heather Yu and Eng K. did it in “P2P Net- working and Applications” [5]. The churn is ’the arrival and departure of peers to and from the overlay, which changes the peer population of the overlay’. This is important for DHTs who want to have good elastic properties to handle high rate of churn. Let’s take a look at how a classical Chord-like DHT handle joining of nodes. As in any peer-to-peer network, a joining node needs to know how to contact a node already in the network. It first contacts this node that routes it toward the node responsible to insert it in the ring. This last node is the successor in the ring of the joining node, thus the node that has the identifier that follows the identifier of the joining node. There are now two steps to perform to enter the ring: contact the successor to warn it that its predecessor has changed, and contact the predecessor to warn it that the joining node is its new successor. This is not robust as a failure of the nodes, or a networking problem that prevents the new node to join its predecessor, can create a broken ring. Beernet solved this problem with its relaxed ring as explained when discussing the network topology in section 2.4.1. It adds a phase to this protocol during which the joining node signals to the successor node that it has correctly contacted the predecessor. The successor can then remove its pointer to the old predecessor as you can see in Figure 2.9. Therefore, this algorithm maintains look-up consistency and tolerates network failures. After joining the ring, the new node has to retrieve the key/value pairs it is responsible for. It can do it by contacting its successor that was in charge of those values before. When a node wants to leave the ring the opposite operations are done. If it is a gently leave, the node sends the data to the nodes now responsible for the values it hosted, and it tells its neighbours to update their pointers. However, if it is an abrupt leave, the other nodes have to detect the absence of the node, and have to execute more complex algorithms to find the remaining nodes responsible for the data the missing node hosted. This operation varies a lot according to the replication strategy, as explained in the replication strategy point. Anyway, it is an heavy and complex operation that should be avoided if possible by leaving gently the network.

18 Figure 2.9: The join algorithm: A) Q contacts the successor R. B) R accepts the insertion and replies with P’s address, R now considers Q as predecessor but keeps P in its predecessor list, Q contacts the predecessor P. C) Q tells P he is the new successor and P accepts it. D) Q tells R the insertion was successful and R drops R from its predlist. Image taken from [19].

It is thus obvious that, while those mechanisms ensure the survivability of the system in a environment where nodes can fail or disconnect abruptly, the performances are going to be better if the nodes use gentle leaves.

2.4.6 Security

There are numerous of known attacks against DHT based systems [34]. Many DHTs are able to work under the assumption that the number of malicious nodes stays lower than a certain fraction f of the total number of nodes. In the case of Sybil attacks, a malicious user inserts many malicious nodes into the system in order to go over that limit. Once the attacker has enough malicious nodes into the system, it can easily interfere with the the routing and replication algorithms. In the case of Eclipse attacks, a malicious node can “eclipse” a correct node by manipulating all the neighbours responsible to point to that node in order to skip it, meaning no one can access it anymore. Those attacks can lead to routing and storage disruption if malicious nodes works together to deny requests or to return different values than the ones expected. Assuring security in such systems when they are running in open environments such Internet is thus a though challenge. Note that those attacks are possible only if the DHT accepts nodes from untrusted users. While most of the DHT based systems we know do not currently provide such a security level, we have good reasons to believe those issues are being worked on. Still we need to keep those issues in mind when designing our architecture. If a malicious user has access to the datastore he can also try to delete, edit, or forge data causing damage to the application using that data. Those attacks are generally

19 avoided by using capability based security. The idea is that if the attacker does not know where to look he will not be able to find the data as it is stored at unguessable keys. OpenDHT goes even further and offers a secret mechanism, allowing users to asso- ciate a secret to a given value. If anyone wants to delete that value, he has to provide that secret. Note that in OpenDHT you can not replace a value by another, as multiple values can be stored at a given key. Doing a put on a key will thus only add the value to the set.

2.5 The Cloud

Bwitter is destined to run on the cloud in order to take advantage of its scalable and elastic nature. Everyone has heard about the cloud, but ultimately many different definitions exists. So we decided to explicit here the definition of the could that we are going to use trough this work. We decided to use the National Institute of Standards and Technology (NIST) definition of cloud computing [22]: “Cloud computing is a model for enabling ubiquitous, convenient, on-demand net- work access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and re- leased with minimal management effort or service provider interaction. This cloud model promotes availability and is composed of five essential characteristics, three ser- vice models, and four deployment models.” The five essential characteristics mentioned are on-demand self-service, broad net- work access, resource pooling, rapid elasticity and measured service. On-demand self- service means that the users can adjust the amount of resources whenever they need without having to go trough a service provider’s employee. Broad network access means that the resources can be accessed from a broad range of mechanisms or devices. Re- source pooling means that the resources of the provider can be assigned and reassigned to different clients dynamically in order to meet the clients requirements in the most effective way. Rapid elasticity means that resources can be allocated and removed in a transparent way in order to meet the amount of resources needed by the client. Mea- sured service means that the resources provided are monitored and can be recorded transparently. The three service models are Cloud Software, Cloud Platform or Cloud Infrastruc- ture as a Service (SaaS, PaaS, IaaS). In the SaaS case the client has access to a software running on the cloud, but has not access to the underlying cloud infrastructure. This application can usually be accessed either via a web browser. In the PaaS case, the client is able to deploy applications and manage them on the cloud infrastructure but does not manage it. Finally, in the IaaS case, the client can manage the basic resources such as network, processing power and storage. The client can furthermore deploy applications and manage them. These models are compared in Figure 2.10. The four deployment models are private, community, public or hybrid cloud. A

20 Figure 2.10: The three service models compared to a classic model, image taken from [8].

private cloud is owned by an organisation and is used only by it, unlike a community cloud that is shared between a few selected organisations. Those solutions may provide better privacy than a public cloud maintained by an organisation that is selling its services to end users or other organisations. An hybrid cloud is a combination of at least two clouds that are still different entities, but are bound in order to allow data and application portability.

2.6 Conlusion

In this chapter we have explored the different types of scalable datastores. We studied more deeply the DHTs and more particularly the ones offering transactions. The technological advancements in those fields made it possible to build an efficient and robust implementation of a Twitter-like system on the top of a peer-to-peer system, taking advantage of their assumed scalability and elasticity properties. In the next chapter we describe two possible architectures for Bwitter.

21 22 Chapter 3

The Architecture

In this section we are going to present the architecture of our application.The plat- form on which an application is based can have an important impact on its architecture. We thus explore the repercussions of having an application running either on a peer- to-peer network based on the users’ machines, either on a stable cloud based platform. The two solutions lead to two radically different architectures in terms of performance, accessibility of the different layers, as well as in terms of security concerns. But, before developing the architecture we take a look at the different requirements of our application functional wise and non functional wise.

3.1 The requirements

Bwitter is designed to be a secure social network based on Twitter, and while it looks relatively simple at first sight, it hides some complex functionalities. We included almost all those functionalities in Bwitter and decided to include some others. We depict the relevant functionalities that will help us to analyse the design of the system, highlight the differences between a centralised and decentralised architecture, study the feasibility of overcoming the problems described above and test its behaviour when faced with heavy traffic and flash crowds.

3.1.1 Non-Functional requirements

Product requirements

Scalability: • We are facing a system that is continuously growing in terms of users [4] but also in terms of traffic [33]. It is thus crucial that our system’s performance increases almost linearly with the number of machines we allocate to it, this is known as horizontal scalability. We are not interested here in vertical scalability, which means adding or removing resources (CPU, RAM, disk) from an individual machine, as it is harder to achieve dynamically and usually more costly than

23 horizontal scaling.

Elasticity: • As we explained, the load social network applications can handle must be able to vary in real time in order to adapt to social reasons. They must sometimes face high peaks of demands for some shorts periods, but do not need to keep the then needed amount of resources the rest of the time. Therefore, it is inefficient to have a fixed number of nodes. If you want to be able to handle peaks of load, you have to over-provision the number of nodes in your data center. This is why our system needs to be able to upscale when there are high demands and to downscale easily when the peak is over to avoid wasting resources.

Fault tolerance, availability and integrity: • The system has to be fault tolerant, this means that even if some machines in the system fail the whole system is still able to function. Also the integrity of the data and availability of the service have to be ensured as it is a major requirement of every social network.

Security: • Bwitter must ensure authenticity, integrity and confidentiality of the data posted by users over the whole system. No malicious user should be able to forge, edit or delete data in the system. Finally, Bwitter must forbid access to confidential data such as password. Those requirements must hold even with Bwitter’s code released as open source.

Lightness of the application: • The end user should only have a fast and light interface performing little calcu- lation. The goal is to be as portable as possible, so that smart phones and other devices with less computing power can also use our application. This implies that the heavy calculations should be done on the server side.

Performance: • We need good performances for a lot of small reads and writes. Indeed, small values are frequently read, written and updated in social network applications.

Organizational requirements

Modularity: • Our project should be build in different modules and it should possible to easily replace a layer by another based on clearly defined interfaces. For instance, the graphical user interface (GUI) module could be desktop based or web based and the main application should not see any difference.

24 Open source: • We want our project to be released in the wild with its source code available for anyone wanting to experiment with it. This also means that the libraries we use in the development of our system should be open source.

Use existing technologies: • We do not want to re-invent everything on our own so we decided to use already developed open source tools during our development.

3.1.2 Functional requirements

Nomenclature

There are only a few core concepts on which our application is based:

A tweet is basically a short message with additional meta information. It con- • tains a message up to 140 characters, the author’s username and a timestamp of when it was posted. If the tweet is part of a discussion, it keeps a reference to the tweet it is an answer to and also keeps the references towards tweets that are replies to it.

A user is anybody who has registered in the system. A few pieces of information • about the user are kept in the datastore, such as his complete name and the MD5 hash of his password, used for authentication.

A line is a collection of tweets and users. The owner of the line can define which • users he wants to associate with the line. The tweets posted by those users are from then on displayed in this line. This allows a user to have several lines with different thematics and users associated.

Basic operations

There are many different social networks existing today, and while they each have their own particularities, a few core operations to share, publish or discuss content are almost always present. Based on our own use of social networks and on Twitter functionalities, we identified a restricted number of operations our social network had to be capable of.

Post a tweet: • A user can publish a message by posting a tweet. The application posts the tweet in the lines to which the user is associated. This way all the users following him have the tweet displayed in their line.

25 Retweet a tweet: • When a user likes a tweet from another user he can decide to share it by retweeting it. This has the effect of “sending” the retweet to all the lines to which the user is associated. The retweet is displayed in the lines as if the original author posted it, but with the retweeter’s name indicated.

Reply to a tweet: • A user can decide to reply to a tweet. This includes a reference to the reply tweet inside the initial tweet. Additionally, a reply keeps a reference to the tweet to which it responds. This allows to build the whole conversation tree.

Create a line / a list: • A user can create additional lines / lists with custom names to regroup specific users / tweets.

Add and remove users from a line: • A user can associate a new user to a line, from then on, all the tweets this newly added user posts will be included in the line. A user can also remove a user from a line, he will not see the tweets of this user in his line anymore and will not receive his new tweets either. Note that if a user re-adds a previously removed user, the tweets he posted when he was still associated to the line will re-appear.

Add and remove a tweet from a list: • A user can store a new tweet into a list to be able to retrieve it later easily. The user can also decide later to remove this tweet from the list.

Read tweets: • A user can read the tweets from a line in packs. The size of those packs is a parameter, for example we can decide to retrieve the tweets by packs of 20. He can also refresh the tweets of a line, or a list to retrieve the tweets that have been posted since his last refresh.

3.1.3 Conclusion

We just presented the requirements of our application as well as the functionalities. The most important requirements are scalability, elasticity, availability and security. The next section details two different possible architectures we elaborated based on the presented requirements.

3.2 Architecture

As previously mentioned we now present two different scalable architectures for our application. In both architectures, our application is decomposed in three loosely coupled layers as we can see in Figure 3.1. From top to bottom, the Graphic User

26 Interface (GUI), Bwitter which handles the operations described in the section 3.1.2 and the scalable datastore. The datastore is distributed amongst multiple nodes that we call datastore nodes. In the next chapter, we present Beernet and Scalaris, the two datastores that we have used.

Figure 3.1: Comparison of the architectures. Left) Cloud based architecture. Right) Open peer-to-peer architecture.

This architecture is very modular, each layer can be changed assuming it respects the API of the layer above. We now have to decide where the datastore will run. We have two options, either let the datastore nodes run on the users’ machines or run them on the cloud, leading to two radically different architectures: the open peer-to- peer architecture and the cloud-based architecture. In both architectures we try to achieve a secure solution as building an insecure application would not be realistic. Indeed, if a malicious user can either reveal personal information or steal the identity of someone, our application would be both pointless and dangerous. We finally compare the two architectures based on the requirements we elaborated in the previous section.

3.2.1 Open peer-to-peer architecture

In a fully decentralised architecture, the user runs a datastore node and the Bwitter application on his machine. The Bwitter application does requests directly to this local datastore node. Ideally this local datastore node should not be restricted to the Bwitter application, but should also be accessible for other applications. The problem with this approach is that the user can bypass protection mechanisms enforced at higher level by accessing the datastore’s low level functions. Usually this is not a problem as untrusted users would not know at which key the data is stored so they cannot compromise it. But in our case, the data has to be at known keys so that the application can dynamically retrieve them. This means that any user understanding how our application works would be able to delete, edit or forge lines, users, tweets and references. This would be a security nightmare.

27 We tried to tackle this problem with the secret mechanism we designed to enrich Beernet’s interface which is presented later. But while it prevents users to edit or delete data they did not create themselves, we could not prevent them to forge elements. To avoid this we need a way to authenticate every data posted by a user. This could be done by enforcing authentication at the datastore level, but this is a feature that is not always provided. We could also do this at the application layer. Indeed, assuming that each user has public and private key information, we could authenticate all the data posted using asymmetric cryptography. However, this would require to do a cryptographic operation for each read and write operation. This would also force users to store their private and public keys either on the datastore, either on their local machine, or a mix of both. A possible solution would be to have users storing their public key in the datastore at a public location, so anyone needing the public key can retrieve it easily. The private key of a user is stored at a private location that only himself can find back, by example using a key that is the hash of his password concatenated with his username. Additionally, a sealed local cache could be maintained on the user’s machine containing his private key and the public keys of all the users with whom he has contacts. This cache is useful to avoid the constant reloading of all the needed keys each time the user want to use the application. Furthermore, public keys are values that seldom change. If a cryptographic problem is encountered while using a key from the cache, the key is reloaded from the datastore in order to avoid problems due to cache corruption or public key changed by the owner. Even with those mechanisms in place, we have to enforce security at the datas- tore level. Beernet uses encryption to communicate between different nodes to avoid confidential information leak. But anyone could add modified Beernet nodes behaving maliciously. Aside usual attacks presented in our state-of-the-art, a corrupted node could be modified to reveal all the secrets inside the requests going trough it. Scalaris faces the same problem as its code is widely available too. We thus have to make sure that the code running the datastore node is not modified, so we need a mechanism that enforces remote attestation as described in [38]. This can be done by using a Trusted Platform Module (TPM) [37], which provides cryptographic code signature in hardware, on the users’ machine in order to be able to prove to other datastore nodes that the client’s node is a trustworthy node. Until a datastore node has a way to tell for sure it can trust another datastore node we are in a dead end. This is especially true for Beernet’s new secret mechanism described in section 4.4.1, as anyone stealing the secret of another user can erase any data posted by the user. Assuming that a Twitter session time is short, there could be a problem if our application is the only one running on the top of our datastore. Indeed, it will result in nodes frequently joining and leaving the network with a short connection time. Each of those changes in the topology of our datastore modifies the keys for which the nodes are responsible, it triggers key/value pairs reallocations leading to an important and undesirable churn. This would not be an ideal environment for a DHT. Furthermore, as we saw in the state-of-the-art, DHT based datastores, such as Beernet and Scalaris, are sill exposed to attacks such as Sybil and Eclipse attacks if they accept malicious nodes.

28 In our requirements we stated that the system has to be fault tolerant and that the integrity of the data must be preserved. The integrity of the data is guaranteed thanks to the replication at the datastore level. Because this environment is not stable, we need to have a higher replication factor than usual. The impact is double. First, peers are responsible for more keys making worse the already important churn. Secondly, each transaction involves more peers which degrades the overall performance of the system. In conclusion, this solution has the advantage to provide free computing power automatically growing with the number of users. But scalability, elasticity and security are compromised due to the lack of control on the machines and to the difficulty to control direct access to the datastore by users. We now take a look at the alternative architecture based on the cloud.

3.2.2 Cloud Based architecture

With this architecture the Bwitter and datastore nodes run on a cloud platform. A Bwitter node is a machine running Bwitter but generally also a datastore node. This solution offers good elastic properties assuming we have an efficient cloud service, meaning that we can quickly obtain machines ready for use. We can thus add or remove Bwitter and datastore nodes to meet the demand, optimizing our use of the machines. This solution also allows us to keep a stable DHT as nodes are not subject to high churn, as it was the case in the first architecture we presented. Hence, a lower replication factor is acceptable which should boost the performance. Moreover, the communications should be much quicker between nodes in a cloud infrastructure than between nodes spread over the world, which will in turn increase performance. Finally, all the nodes are managed by us, there are thus no Eclipse or Sybil attacks possible in this case. Using this solutionc we do not have all the security issues we had with the open peer-to-peer architecture. Indeed, the users do not have direct access to the datastore nodes anymore, but have to go trough a Bwitter node which limit their possible actions to the operations defined in section 3.1.2. Furthermore, the communication channel between the GUI and the Bwitter nodes can guaranty authenticity of the server and encryption of data being transmitted, for instance using https. Bwitter requires users to be authenticated to modify their data. Doing so we provide data integrity and authenticity because. For instance, Bwitter does not permit a user to delete a tweet that he did not post, or to post a tweet using the username of someone else. The malicious revelation of user secrets due to a corrupted node is not relevant anymore as the datastore is fully under our control. The cloud based architecture is more secure, stable and offers obvious advantages for scalability and elasticity. This is why we have finally chosen to implement this solution, we now take a closer look at how the layer stack is build. The lowest layer, the datastore, runs on the cloud and is hidden from outside which means no user can access it directly, all the attacks targeting the datastore are thus

29 avoided. Indeed, all the accesses to the datastore are done via Bwitter. This layer is monitored in order to detect overload and, taking the advantage of the cloud, datastore nodes will be added and removed on the fly to meet the demand. The intermediate layer, Bwitter, is also running on the cloud and communicates with the datastore nodes and the GUIs. A Bwitter node is connected to several datastore nodes. They have an internal load balancer that dispatches work fairly on the datastore nodes. The load balancer is the Scalaris Connection Manager (SCM) that we present in the implementation section at 5.3. In practice, the Bwitter nodes are not accessible directly, they are accessed through a fast and transparent reverse proxy that splits the load between Bwitter nodes. We also designed a module that runs in parallel of the SCM and that we call the Node Manager (NM). It is responsible for the bootstrapping of the ring as well as adding nodes if needed. However, we do not have any module responsible to decide if a new node should be launched. The Bwitter nodes offer a REST-like (Representation State Transfer [27]) API to the higher layer. This means that among other they are completely stateless, this is important because it improves the clarity of the code and makes it easier to product bug free code. Being stateless means that the application does not have to keep information for each client. It can thus more easily scale with the number of clients and also allows to dispatch requests from the same client to different nodes, suppressing the burden of managing sessions. Some values can be frequently accessed in a social network, a caching system is thus crucial to achieve decent performance. We thus decided to add a cache at this level in order to reduce the load on the datastore. We are going more into details about the cache later in the next section. A similar cache mechanism in the decentralized architecture would not be useful. Indeed, the advantage of the cache is that it contains values that are susceptible to be accessed by several users. Therefore, if there is only one user accessing it the gain will probably be very small. The top layer is the GUI, it connects to a Bwitter node using a secure connection channel that guarantes the authenticity of the Bwitter node, and encrypts all the com- munications between them. Multiple GUI modules can, of course, connect to the same Bwitter node. The GUI layer is the only one running on the client machine.

3.2.3 The popular value problem

Describing the problem

Given the properties of our datastores both based on DHTs, a key/value pair is mapped to f nodes, where f is the replication factor, depending of the redundancy level desired. This implies that if a key is frequently requested, the nodes responsible for it can be overloaded while the rest of the network is mostly idle. Therefore, adding additional machines is not going to improve the situation. It is not uncommon on Twitter to have wildly popular tweets that are retweeted by thousands of users. In the worst cases, retweets can be seen as exponential phenomenon as all the users following

30 the retweeter are susceptible to retweet it too.

The solution: use an application cache

Adding nodes do not solve the problem because the number of nodes responsible for a key/value pair do not change. In order to reduce this number of requests, we have decided to add a cache with a Least Recently Used (LRU) replacement strategy at the application level. This cache keeps the last values read. We keep associated with each key/value pair in the cache a timestamp indicating the last time the value was read. When we face a cache miss, we evince from the cache the pair that has the oldest timestamp value. This solves the retweet problem because now the application has in its cache the tweet as soon as it gets a request to read the popular tweet. This tweet stays in the cache because the users frequently make requests to read it. This way we reduce the load of the nodes responsible for the tweet and automatically increase availability of popular values. We have to take into account that values are not immutable, they can be deleted and modified. It is thus necessary to have a mechanism to “refresh” the value inside the cache. A naive solution would be to do active polling to the datastore to detect changes to the key/value pairs stored in the cache. This would be quite inefficient as there are several values, like tweets, that almost never change. In order to avoid polling, we need a mechanism that warns us when a change is done to a key/value pair stored in the cache. The datastore must thus allow an application to register to a key/value pair and to receive a notification when this value is updated. Our application cache thus registers to each key/value pair that it actually holds and when it receives a notification from the datastore indicating that a pair has been updated it updates its corresponding replica. This mechanism has the big advantage of removing unnecessary polling requests. Notifications are asynchronous, so the replicas in the cache can have different values at a given moment, leading to an eventually consistency model for the reads. It is still possible to bypass the cache if a strong consistency is needed, but this is application dependant. On the other hand, writes do not go trough the cache but directly to the datastore, this allows to keep strong consistency for the writes inside the datastore. This is an acceptable trade off as we do not need strong consistency for the most of the reads in Bwitter. For example, it is not a problem to see for a small period of time a deleted tweet in the line of a user. Beernet, as described in [19], offers such a notification mechanism, making possible to design an efficient eventually consistent cache. Scalaris however does not provide such feature, we thus needed to find another solution in order to avoid active polling. We decided to use a time to live of one minute for the values in the cache, meaning that one minute after being read the first time the value is removed from the cache. This way any value read from the cache is at most out of date of one minute, which is not a problem.

31 3.2.4 Conclusion

We have presented two different possible architectures: the open peer-to-peer and the cloud based architecture. We summarize in Table 3.1 the differences between the two solutions.

Open peer-to-peer architecture Cloud based architecture Security No control on the DHT leading to Full control on the DHT which is numerous security flaws. hidden from users. Surface of attack much smaller. DHT con- High, uncontrollable and undesir- Much stabler environment and pos- trol and able churn. Connections between sible control on the number of nodes stability nodes can be really bad. in order to scale up and down. Costs Costs are supported by users (main- High costs but directly proportional tenance of a DHT node). to resources needed. Performance Number of nodes normally propor- Nodes are well connected and cloud tional to the number of users but guarantees their performance. Con- “quality” of the nodes is uncertain. trol allows optimization. Cache No possible improvement of the per- High potential for performance in- formance using a cache. creases using cache.

Table 3.1: Comparison between the open peer-to-peer architecture and the cloud based architecture.

We have opted for the cloud based architecture as it has numerous advantages compared to the open peer-to-peer one. From a performance point of view, it has better network properties, less churn, a smaller replication factor and finally a cache can be added to boost the performance. Moreover, security requirements are hard to achieve in the open peer-to-peer, while most of security problems are solved simply by moving to the cloud architecture. The only obvious advantage of the peer-to-peer solution it that it is free. In the next chapter we take a look at the datastore we are using and how we represent our data in it.

32 Chapter 4

The Datastore

In this section we are taking a closer look at the datastores we are going to use: Beernet and Scalaris. From there we identify the design guidelines we followed to build the datastore schema and finally detail this one. We end up this chapter by discussing the problem of running several services on the same datastore which brings us to the secret API we have designed for Beernet.

4.1 The datastore choice

4.1.1 Identifying what we need

As we saw in the state of the art there are several types of datastores: key/value stores, document stores, extensible stores and relational databases. We have only a few types of objects to store in our datastore, namely lines, lists, users and tweets. Furthermore, we do not need any complex operations like the joins and queries available in RDBMS. We want to use a simpler data model to avoid the unnecessary burden of maintaining complex structures. Moreover, we want to have the most scalable and elastic solution possible and RDBMS-like systems were shown not to be efficient in those field. For all those reasons we opted for key/value stores, but more precisely key/values stores with transactional capacities. Transactions allow us to pack several operations together and execute them atomically. A transaction either executes all those opera- tions successfully, or none if the transaction aborts. This allows us to generate unique keys and maintain the integrity of our data structures. Suppose we want to store a value Bar at a key Key, nothing can guarantee us that nothing else was already stored at Key. We thus do two operations: operation A, we make a look-up on Key, operation B, if response of operation A was “not found” we store Bar at Key. But this is not correct because nothing guarantees that no other operation C on Key can happen between operations A and B. We must thus run operations A and B in a transaction so that no other operation C can come in between them. During our discussion of the datastore design in section 4.3, we use the transactional support

33 to generate unique IDs using counters that we read and increment atomically. Persistence is a key requirement that we do not address in Bwitter. Unfortunately, the key/value datastores that fulfil our other requirements do not provide persistence. Scalaris is planning to add this feature but is still in development phase. We could use a parallel datastore to do backups as Twitter does [36], but we do not address this problem. The datastore must be robust in the sense it must be capable of handling a lot of churn without failing. This is crucial in the case of our fully distributed architecture. Indeed, machines would not be under our control and a large amount of machines would constantly join and abruptly leave the system. Our datastore should be able to manage those abrupt leaves, having a similar behaviour as machine failures, to ensure no data is lost. As we decided to go with the cloud based architecture, we work in an environment where the machines provided are not expected to fail abruptly. Robustness is thus still critical but datastores can have more complex algorithms to recover from failures as those are less subject to happen. Although most of the machine leaves and joins are under control, those operations must be efficient in order to have an elastic application. Handling correctly the churn means that the datastore must maintain correct routing between the peers as well as the replication factor.

4.1.2 Our two choices

There are several key/value datastore available but only two offer transactions with transactional capabilities: Beernet and Scalaris. The two fulfill our datastore require- ments but vary in some points. We now introduce these two datastores.

Beernet

Beernet [19, 23] is a transactional, scalable and elastic peer-to-peer key/value data- store build on the top of a DHT. Peers in Beernet are organized in a relaxed Chord-like ring [30] and keep O(log(N)) fingers for routing, where N is the number of peers in the network. This relaxed ring is more fault tolerant than a traditional ring and its robust join and leave algorithms to handle churn make Beernet a good candidate to build an elastic system. Any peer can perform lookup and store operations for any key in O(log(N)). The key distribution is done using a consistent hash function, roughly distributing the load among the peers. These two properties are strong advantages for system scalability compared to solutions like client/server model. Beernet provides transactional storage with strong consistency, using different data abstractions. Fault-tolerance is achieved through symmetric replication. It has several advantages, that we do not detail here, compared to leaf-set and successor list repli- cation strategies [11]. In every transaction, a dynamically chosen transaction manager (TM) guarantees that if the transaction is committed, at least the majority of the repli- cas of an item stores the latest value of the item. A set of replicated TMs guarantees that the transaction does not rely on the survival of the TM leader. Transactions can involve several items. If the transaction is committed, all items are modified. Updates

34 are performed using optimistic locking. With respect to data abstractions, Beernet provides not only key/value-pairs as in Chord-like networks, but also key/value sets with non blocking add operations, as in OpenDHT-like networks [26]. The combination of these two abstractions provides more possibilities in order to design and build the datastore, as we explain in Section 4.3. Moreover, key/value sets are lock-free in Beernet, providing better performance for set operations.

Elasticity in Beernet

We previously explained that to prevent overloading, the system needs to scale up to allocate more resources to be able to answer to an increase of user requests. Once the load of the system gets back to normal, the system needs to scale down to release unused resources. We briefly explain how Beernet handles elasticity in terms of data management. Scale up: When a node j joins the ring in between peers i and k, it takes over part of the responsibility of its successor, more specifically all keys from i to j. Therefore, data migration is needed from peer k to peer j. The migration involves not only the data associated to keys in the range ]i; j], but also the replicated items symmetrically matching the range. Other noSQL datastores such as HBase [1] do not trigger any data migration upon new nodes adding to the system, showing better performance scaling up. Scale down: There are two ways of removing nodes from the system: by gently leaving and by failing. It is very reasonable to consider gently leave in cloud environ- ments, because the system explicitly decides to reduce the size of the system. In such case, it is assumed that the leaving peer j has time enough to migrate all its data to its successor who becomes the new responsible for the key range ]i; j], i being j’s pre- decessor. Scaling down due to the failure of peers is much more complicated because the new responsible node of the missing key range needs to recover the data from the remaining replicas. The difficulty comes from the fact that the value of application keys is unknown, since the hash function is not bijective. Therefore, the peer needs to perform a range query, as in Scalaris [29], but based on the hash keys. Another complication is that there are no replica sets based on key ranges, but on each single key.

Scalaris

Much like Beernet, Scalaris offers a transactional, scalable and elastic peer-to-peer key/value datastore and is also build on the top of a DHT [29]. Scalaris is currently based on a traditional Chord, with a possible upgrade to Chord#. While not as fault tolerant as Beernet, Scalaris is a good candidate for building elastic systems too. Lookup and store operations have the same complexity, O(log(N)), where N is the number of peers in the network. Currently the key distribution is done using a hash function but could be lexicographically ordered after the upgrade to Chord#.

35 As in Beernet, Scalaris provides transactional storage with strong consistency, and fault-tolerance is achieved trough symmetric replication. Transactions are taken care of by a local Transaction Manager associated with the node to which the user is connected. Transactions are done in an optimistic way, trying to execute it completely on the associated node and then trying to store it at the responsible nodes if it succeeded. Beside the classical key/value-pairs, Scalaris also supports key/value lists as a data abstraction. Lists, as opposed to Beernet sets, are not lock-free and there exists no add operation on lists. In order to add atomically an element to the list, we must in a single transaction read the list, add the element to it, and write it back to Scalaris. Lists are thus a convenient abstraction that avoids the programmer to develop his own parsing system, but do not offer any performance improvement.

Conclusion

Beernet and Scalaris both fit our needs with their elastic and scalability properties and their native data abstractions. Unfortunately, due to some unexpected problems with Beernet we were forced to continue on Scalaris alone. This was disappointing as we were working closely with Boris Mej´ıas,the developer of Beernet, to further improve his system with a richer API presented in section 4.4.1.

4.2 General Design

The design of the datastore is closely linked to our application requirements. Hence before going straight into the design of the datastore, we take some time to explain the guidelines we elicited from the requirements to build the datastore’s schema. Some choices might be unclear now but they will be clarified when we present the algorithms in Chapter 5.

Make reads cheap

While designing the lines we had to decide if we should favour the reads or the writes. If we privilege the reads we push the information to the line and put the burden on the write. In this case the “post tweet” operation adds a reference to the tweet in the lines of each follower, we call this the push approach. On the other hand, we could privilege the writes. In this case, we pull the information and build the lines each time a user wants to read them. It is done by fetching all the tweets posted by the users he follows and reordering them, we call this the pull approach. As people do more reads than posts on social networks, and based on the assumption that each posted tweet is at least read one time, we opted to make reads cheaper than writes and thus privileged the push approach. However, we also study the pull approach and compare it with the push when we present our algorithms in Chapter 5 and our experiments in Chapter 6.

36 Do not store tweets in the lines but references

There is no need to replicate the whole tweet inside each line, as a tweet could be potentially contain a lot of information and should be easy to delete. Therefore, we prefer to store references to tweets. To delete a tweet the application only has to edit the stored tweet and does not need to go trough every line that could contain the tweet. When loading the tweet the application can see if it has been deleted or not.

Minimise the changes to an object

We want the objects to be as immutable as possible to enable cache systems. This is why we avoid to store potentially dynamic information inside the objects but rather have a pointer to it. For instance tweets are only modified when we delete them, this is why a reply to a tweet should not modify the tweet itself.

Do not make users load unnecessary things

Loading the whole line each time we want to see the new tweets would result in an unnecessary high number of exchanged messages and would be highly bandwidth consuming. This is why we decided to cut lines, which in fact are just big sorted sets, into subsets of x tweets, that can be organised in a linked list fashion and where x is a tunable parameter. Sets fragmentation is done differently depending on the chosen design of the datastore. This is explained later in the algorithms section.

Retrieving tweets in order

Users want to retrieve first the tweets posted last, tweets are thus dated to allow ordering. Tweets must thus be stored so that getting the most recent is easy and efficient. We have build an algorithm that guarantees the correct ordering of the tweets inside our lines even with network reordering and failures.

Filtering the references

When a user is dissociated from a line we do not want our application to still display the tweets he posted previously. We decided not to scan the whole line to remove all the references added by this user. Indeed, we rather remove the user from the list of the users associated with the line, and filter the references based on this list before fetching the corresponding tweets.

Only encrypt sensitive data

Most of the data in Twitter is not private so there would be no point in encrypting it. Only the sensitive data such as the passwords of the users should be protected by encryption when stored in the datastore.

37 Simple data structures

We believe that having complex data structures is not a good idea in a key/value store. Indeed, in order to maintain those we need to use transactions. Those are more likely to fail if, to update a data structure, we need to access a lots of different keys at the same time.

4.3 Design of the datastore

The design of the datastore is an important part of the project. In our case it is more complicated than for a classical database because we do not have high level data structures like database tables. As reminder, Beernet and Scalaris both provide two different data structures, key/value pairs and key/set (or key/list), the second one allowing to store multiple values at the same key. As we wanted an easy way to store and retrieve java objects from the datastore we decided to serialized them. Java objects when serialized are transformed to String conforming to XML 1 format. After serialization they are stored as value in our datas- tore. This has the advantage that we can easily recover java objects if needed later, or directly respond to Bwitter requests in XML without even deserializing those objects. Moreover, XML has the advantage to be a widely used format and thus a lot of exist- ing libraries handle it. The process to add something in our datastore is the following: create a java object, serialize it, choose a unique key and finally store the key/value pair in the datastore. We avoid on purpose to talk about robustness and shared key space now as we dedicate two sections to those two problems after the details of our design. Our first attempt to design a social network on a key/value pair datastore was based on references. We say that it was based on references because everything, except the user object, was stored at random and meaningless keys. Those user profiles contained references to the other objects belonging to the user. For example, the lines of a user were kept in a user set whose reference was kept in the user object. After some thought we decided to drop the random keys and references, and replaced them with a design based on human understandable and computable keys. The key space layout now looks like a file directory. We do not need to follow a chain of references to access an object anymore, it can be directly addressed. This also removes the burden of managing the references. This in turn leads to a reduced number of operations needed and improved performance. Moreover, the old design had a bigger space complexity because it had to store references to every object from the user profile to the object itself. Thanks to this simple addressing, it is also easier to write clearer code and avoid bugs. Note that through this section, when we talk about keys, the variable parts of the key are written in bold characters while the static parts are not. We have two different datastore designs: one for the push and one the pull approach. The push approach pushes the information posted to the readers and the pull one

1http://www.w3.org/XML/, last accessed 14/08/2011

38 retrieves the information from the poster. We focus on the push approach because we believe it is the most adapted to our application and describe shortly the pull design.

4.3.1 Key uniqueness

For now we assume only Bwitter is running on the datastore. We must still ensure key uniqueness to avoid unwanted overwriting of data. In order to do so, information must be kept in the datastore for each key already used. This information must be stored at a known location. We separate the datastore into several groups of objects, for example the tweets of a user, a line a of user, sets of tweets, etc. For each of those groups we keep track of the number of objects in this group so that we can forge a new key for each new object. Each group must have a unique base key from which we can create new unique keys for the members of the group. For example, we show how we add a new tweet to the tweets already posted by a user. We assume that the tweets of a user are stored at the base key “/user/username/tweets/” (username is the username of the user) called tweetBase, and that the number of his tweets is stored at “tweetBase/size”. The following pseudo code adds a new tweet to tweets already posted by that user. addNewTweet{ begin transaction x = Read(‘‘tweetBase/size’’) x++ Write(x, ‘‘tweetBase/size’’) Store the new tweet at the key ‘‘tweetBase/x’’ end transaction }

This ensures that we always use unique keys when adding a new object to a group. The drawback is that all objects adding are done via the same key where the number of those objects is stored. Any two parallel transactions that add an object to the same group thus conflict. Therefore, it is important to keep this limitation in mind while designing our data structures. We still have a problem, we just stated that we need the base keys to be unique. We consider that the username of a user must be unique, this allows us to create unique base keys for each user. The uniqueness can be easily checked at the registering of each user.

4.3.2 Push approach design details

Users

The user object, represented in Figure 4.1, the real real name of the user and his registration date. Any other personal information could be added later to this object. We store the user object at “user/username”.

39 Figure 4.1: User profile object of user “Paul”.

We store the hashed password of the user at the key “user/username/password”. We use it for authentication for each operation involving writing. We store this value on its own as it is requested more often than the other user’s personal information. We propose to add a special structure, shown in Figure 4.2, that allows to search for users. Indeed, searches in a key/value pairs are not well supported because application keys are not organized in a lexical order in the ring, but according to an hash function. We thus group in the same key/set (key/list) pair real names that share some prefix, we do not make any difference between upper and lower case. This user search tree we propose is a binary search tree, we made this choice because we know it is an efficient structure for insertion and retrieval. Leaf nodes contains matching between real names and usernames to allow to find the username of a user thanks to his real name. Indeed, people do not necessarily know the username of someone and we identify users thanks to their username. Therefore, this structure is crucial for users to easily find people they know in our system. All the leaf nodes together represent the whole alphabet. Parent nodes do not contain any search information, they only keep references to their children. Leaf nodes have an approximate maximum size. When the size of a leaf node reaches this limit, we add two children to it and and split his responsibility interval between the two children. We did not developed a formal algorithm for this search tree due to lack of time, it is thus not present in our implementation.

Figure 4.2: Username search tree.

40 Lines and lists

Lines and lists are really similar, we thus only detail lines because lists are lines but without any users associated. A line has a set of tweets and a set of users associated. In practice and as said in the main guidelines, tweets are not stored on lines, instead we store references to them. Those references contain a date, the username of the poster used for filtering, the username of the original poster if it is a retweet and the key of the referenced tweet, as can be seen in Figure 4.3.

Figure 4.3: Reference to tweet object to be stored in a line or list.

Sets of usernames are not split like tweet sets because they are always read in their entirety when used. We also keep a set containing all the lines and list names so that we can easily retrieve them (see Figure 4.4).

Figure 4.4: Left) Lines set of user “Paul”. Right) User set of the “coolpeople” line of user “Paul”.

The set of tweets associated with a line or list can become very big. Taking into account our main design guidelines, we do not store them in one set but as list of chunks organized in chronological order from most recent to oldest, as you can see in Figure 4.5. The head is at a fixed location (/user/username/line/linename/head), which allows us to quickly add an element to this set and read the latest tweets. The other chunks are located at a fix based key (/user/username/line/linename) to which we concatenate a number called chunkNbr. The chunk with the chunkNbr equal to 0 is the oldest. The newest chunk has a chunkNbr equal to the value contained at the key “/user/username/line/linename/size” minus 1. It is thus easy to access any chunk of the line. This may not be obvious at the moment but the number of tweets in each chunk has a big importance as it influences the complexity of the algorithms we present in

41 the next section.

Figure 4.5: Top) Number of chunks in the “coolpeople” line of user “Paul”. Bottom) The head chunk and two chunks of the “coolpeople” line of user “Paul”.

Topost set

The Topost set, represented in Figure 4.6, contains references to lines (keys of the lines in the datastore) in which the user must post references to his tweets. We do not store the whole reference to a line because some part of the reference are constant. Instead, we store what is needed to find the line back: the name of the line and the username of the line’s owner. As it was the case for the lines, the Topost set is fragmented using the same tech- nique. Each of its chunks contains maximum nbrOfFollowersPerChunk references, this is a parameter that has to be tuned and is further discussed in section 6.3.2 of our experiment chapter. Moreover, each chunk also has a counter, it is used to implement the post tweet algorithm robustly. This counter has a value between -1 and the number of tweets that a user has posted not included. From Figure 4.6, you can notice that the tweets of the owner of the Topost set were not correctly posted for all the chunks. Indeed, the counter values differ between the chunks indicating some remaining tweets to post. We add another counter that is used to remember the tweet number of the last tweet that was correctly posted, this counter is also initialized at -1. In this example, assuming Paul has already posted 12 tweets, we can see that one tweet needs to be posted for chunk 0 and two for chunk 1.

Tweet

The messages the users post are called tweets. As mentioned before a tweet is a small message of 140 characters. The tweet object contains a message field as well as a poster field. Moreover, some tweets can be retweeted, to handle this situation we added an original author field that contains the name of the original author of the tweet. This field is null if the tweet is not a retweet. Tweets are also dated using second precision, the time used when storing in the datastore is the Greenwich Mean Time (GMT) for the whole system, it is up to the GUI layer to adapt the time to the local area when

42 Figure 4.6: Different parts of the Topost set of user “Paul”. Top left) Number of chunks in the Topost set. Top right) Global counter of correctly posted tweets. Center) Chunk counters of correctly posted tweets. Bottom) Chunks of the Topost set.

displaying the tweet. A field indicates if this tweet was deleted by his owner. Finally, users can answer to tweets. We want to be able to find the complete conversation back given one tweet, therefore we keep a reference to a potential parent and to a set of children. We put an example in Figure 4.7, Tweet2 is a response to Tweet1, Tweet3 is response to Tweet2, Tweet6 is a response to Tweet4 and so on... Tweets are stored only once in the datastore, we made this choice in order to make their deletion easier and to minimize the data stored in the datastore.

Figure 4.7: Conversation Tree.

43 The key of a new tweet is the concatenation of the prefix of the key: “/user/textb- fusername/tweet/” with the number of tweets already posted by a user. The schema of the tweet number 42 posted by the user “Paul” is shown below in Figure 4.8.

Figure 4.8: Left) Tweet number 42 object of user “Paul”. Right) Number of tweets of user “Paul”.

4.3.3 The Pull Variation

As explained in the introduction, we have also decided to experiment with a vari- ation of the push based design, and to observe how the system would behave if we decided to pull the information instead of pushing it. As this was not our primary goal, we decided to focus on the design of the datastore with only the push approach in mind making it as efficient as possible. Afterwards, we tried to fit the pull variation in. This went very well as the pull approach borrows a great majority of the building blocks and even mechanisms of the push approach. We now store the references only at the owner side, we explain how those tweets are retrieved in the algorithms chapter. Furthermore, those references are kept grouped by timestamp, meaning the the tweets posted during the same time frame, for instance the same hour, are grouped together. The timestamp is of the form: 05/06/11 15 h 26 min 03 s GMT, with some fields set to zero according to the chosen time granular- ity. For instance, if we want the references to be grouped by hour we would have a timestamp of this form: 05/06/11 15 h 00 min 00 s GMT. The full key looks like this: /user/username/tweet/timestamp. We also have to store the subscription date of the user in order to com- pute the equivalent of the chunk numbers. This date will be stored at the key: user/username/starttime.

44 Object Type Key Description User Value /user/username User profile with user profile information Password Value user/username/password Hashed password of user Topost Set /user/username/topost/ Set of lines where set chunkNbr the user has to post the references to his tweets Topost Value /user/username/topost/chunkNbr/ Counter associated chunk counter to each chunk of the counter topost set Topost Value /user/username/topost/size Number of chunks in set size the Topost set of a user Last Value /user/username/topost/ Tweet number of the Tweet lasttweetposted last tweet correctly correctly posted posted Tweet Value /user/username/tweet/tweetNbr Tweet object con- taining the message Replies Set /user/username/tweet/tweetNbr/ Replies to the tweet to tweet children Tweet Value /user/username/tweet/size Number of tweets counter posted by a user Lines set Set /user/username/linenames Names of the lines of the user Line Set /user/username/line/linename/ Chunk of a line con- chunk chunkNbr taining tweet refer- ences Line Value /user/username/line/linename/ Number of chunks in chunk size the line (head not counter counted) Line Set /user/username/line/linename/ Users associated to a users users line Lists set Set /user/username/listnames Names of the lists of the user List Set /user/username/list/listname/ Chunk of a list con- chunk chunkNbr taining tweet refer- ences List Value /user/username/list/listname/size Number of chunks in chunk the list counter

Table 4.1: Keys used in the datastore for the push design

45 4.3.4 Conclusion

You can find in Table 4.1 a summary of the kind of keys we use in our datastore for the push design. Every key used is of course unique. Remember that the text in bold is a variable while the rest is static. Our datastore design was rebuild several times in order to meet the important criteria we have fixed: simplicity, scalability and clarity. We have built a structure for the lines that allow to retrieve easily the latest tweets in chronological order. We cut the lines and the Topost set because those two can be very big (billions of tweets and millions of followers). We have also designed a structure to efficiently search users in the system. Concerning which approach is the best between the pull and the push, you can have the intuition that the push is the best approach for reads and the pull approach is the best for the writes. We compare theorically the two approaches when discussing the algorithms in Chapter 5 and test them in the experiments of the Chapter 6.

4.4 Running multiple services using the same datastore

There are numerous situations where multiple applications may want to share the same datastore. For instance, we could easily imagine a globally distributed datastore deployed in a peer-to-peer environment being used for multiple applications, exactly as we suggested in our first architecture. This would encourage users to let the datastore node run longer and would mitigate the heavy churn problem we would face if those users only used the datastore for our Bwitter application. They would launch it to consult their latest tweets or to post a tweet, and then directly close it. Despite we do not face this churn problem in cloud or any other stable environment, this remark is also valid for those. Indeed, some application’s plugins may want to store additional data that should not be interfering with those of the main program while being able to access it. So while we could limit the access of the datastore to Bwitter it is a clear limitation. We are thus going to take a closer look at the problem of sharing the datastore and particularly the keyspace. After some thoughts, we reduced the problem of sharing the keyspace to two smaller problems: key already used and unfortunate/malicious data erasing. We explored different ways to solve those problems at the datastore level. Even though we did not use those solutions, it is still relevant to expose here our work and conclusions. Note that this work has only been done on Beernet and not Scalaris. This is due to our privileged collaboration with the developer of Beernet, Boris Mej´ıas,since the beginning of our project.

46 4.4.1 The unprotected data problem

Early in the process, we elicited a crucial requirement. The integrity of the data posted by the users on Bwitter must be preserved. A classical mechanism, but not without flaws, is to use a capability based approach. You store the data using random generated keys so that other applications and users cannot erase the value because they simply do not know at which key the value is stored. However in applications where content has to be available publicly, we cannot protect all our values by simply using not guessable keys. By example, Bwitter allows any unknown user to add his name to a Topost set of another user in order to subscribe to his tweets. This list must not only be available to any user but also has to be writable by any user. In practice, we would use the set abstraction provided by Beernet to implement this list. Any user needs the possibility to add an element to the set, but it should be impossible for anyone but the creator of the set and the user who added the value to remove the value. The problem is that Beernet does not allow any form of authentication so key/value pairs are unprotected. Hence, anybody that is able to send requests to Beernet can modify and delete any data previously stored. We detail here several solutions that we have imagined to solve this problem.

Safe environment assumption

At first, we assume Beernet is running on the cloud and that the nodes are man- aged by another entity than the applications running on the top of it. This means that nobody can add nodes except this entity and that the communications between the different nodes cannot be spied. Indeed, Beernet inter-nodes communications are done on a LAN inaccessible from outside the cloud. Moreover, we assume that the communications between Beernet nodes and applications are encrypted so nobody is able to spy them.

Cooperation between applications

The most naive solution is to do the assumption that all the applications running Beernet are written without bugs and are respectful of each other. This means that the applications check each time they want to write a Key1/Value1 pair that it exists no other Key1/Value2 pair with the same key already written by another application. Additionally, this operation has to be run in a transaction to avoid race conditions. This should normally not induce too much performance overhead because usually ap- plications will run transactions each time they store a value using the transactional replicated storage of Beernet. In order to be able to perform this check, each time a value is stored an information identifying the application that posted the value must be added manually to the value by the posting application. This solution makes a strong assumption, and even if this assumption holds it adds complexity to the code of each application running on Beernet. Indeed, applications need to parse each value they read and add information to each value posted.

47 Data protected by secrets

We now lift the assumption made in the previously presented solution. We assume that several applications are running of the top of Beernet and are not respectful of each other and thus do not cooperate. We would like to enable an application A to protect the values it posted from being overwritten by an application B. This is not possible without the help of Beernet because the two applications can access Beernet freely and are not cooperating. We have thus designed a solution to enhance the API of Beernet: we enable an application to protect a key/value pair it posted using a secret chosen by itself. This secret is needed if another operation tries to modify or delete the value associated to the key newly protected. Because Beernet is running in a secure environment, secrets will not leak from Beernet, a malicious user can still try to guess the application secret, but it is the application’s responsibility to use secrets that are hard to discover. A secret mechanism was developed by OpenDHT [26], they made possible to add a removal secret so that when a delete operation is performed the secret is requested to remove the value. In a very similar fashion, Beernet’s secret mechanism allows to share values with other applications and keeping them protected at the same time. Application A can now write a value and protect it against editing and deleting using a secret. Without this secret anyone can still read the value but can not edit nor delete it. But, the secret mechanism developed for Beernet goes further. Indeed, the sets are now protected by secrets too and offer much more flexibility. Three different secrets can be used to protect the different parts of the set. First, the Set Secret which is one of the two secrets associated to the set itself when it is created. It can be seen as a master key allowing its owner to do all the operations desired on the set. The creator of the set can thus destroy the set along with all its contents, insert items into the set, but also delete separately each item contained into the set. Secondly, the Write Secret which is the other secret associated to the set itself when it is created. This secret is required to add an item to the set. This way the creator of the set can decide to who he gives the right to add items to his set. Finally, the Value Secret associated to a given item in the set. This secret protects a single item in the set against editing and deleting. This means only the user that has added the item and the owner of the set can delete the item. This secret is set by the user that adds the value to the set. This new way to protect sets allows to easily implement numerous applications based on user posting content. Comments on blogs are made extremely easy for instance. The author of the blog can give the permission to other users to add comments to an entry. All the users can now see the comments posted by their peers but can only edit the comments they posted themselves. The author can manage the comments posted as he has the right to delete and edit the comments too. This is only a short and simplistic example, but we are convinced this new secret mechanism will make the development

48 of more complex applications much more easy.

New semantic using secrets We need three new kinds of fields, one for each secret, in addition to the existing Key and Val fields. Those new fields are automatically set to NO SECRET when applications uses the functions of the old API that do not use any secrets. NO SECRET is a reserved value of Beernet indicating that there is no secret. For example, we show the difference for the put function. It used to be:

put(K:Key V:Val)

Stores the value Val associated with the key Key at the responsible of the Hash of Key. This operation can have two results, “commit” or “abort”. The operation returns “commit” if:

there is nothing stored associated with the key Key or there is a value stored • previously by a put operation.

the value has successfully been stored. •

Otherwise the operation returns “abort” and nothing changed.

The new version is now:

put (S:Secret K:Key V:Val)

Stores the triplet (Hash(Secret) Key Val) at the responsible of the Hash of Key. This operation can have two results, “commit” or “abort”. The operation returns “commit” if:

there is nothing stored associated with the key Key or there is a triplet stored • previously by a put operation.

there is no triplet (Secret1 Key Val1) stored at the responsible of the Hash of Key • so that Hash(Secret) != Hash(Secret1).

the value has successfully been stored. •

49 Otherwise the operation returns “abort” and nothing changed. If no value is specified for Secret, Beernet will assume it is the equivalent to put(S:NO SECRET K:Key V:Val).

The whole new API of Beernet now contains a secure version of put, write, add, remove and delete but also allows explicit set creation. The full new API semantic can be found in the annexes at Chapter 8.

4.4.2 Key already used problem

At the moment in Beernet and in Scalaris, as in all key/value stores we know, there is only one key space. This means that multiple services have to share it and if a service uses one key another service cannot use it anymore. For some applications, not being able to use a given key can be very annoying as keys may have a defined meaning and the application expects to find a certain type of info at a certain type of key. This can be solved designing more complex algorithms at the application level, but this adds complexity not directly linked to the application, which is, at our sense, a bad idea. Sharing a key space can thus create problems if multiple services want to use the exact same keys. For instance, if another service decides to store the usernames of their users at the keys “user/username” we have a conflict with our Bwitter application. This means that the applications cannot both have a user with the same username. This problem can not be solved with the secrets mechanism we proposed. It can be solved using a capability based approach. This was not the case in the unsecured data problem we just presented, indeed, the goal it not to protect the data but to avoid key conflicts between applications. The simplest way to avoid using the same keys is by appending a differentiation number in front of the key. When an application wants to start using Beernet it generates a root random key, for instance 93981452. From then on the application will only use keys starting with 93981452. If we can be confident enough that no other application will use this root random key, we can assume that we are working with our own key space. We can thus design the application accordingly, removing the burden of complex algorithms to recover from a key already used. In RFC41222 the authors affirm being able to generate global unique identifier, we could use those identifiers as root key, the chance that this key would be used two times on the same datastore is infinitesimal. This approach is also valid if you want to hide data from some application or users. Indeed, you can imagine guessing the root key but it is in practice not possible.

2Can be found at http://www.ietf.org/rfc/rfc4122.txt

50 4.4.3 Conclusion

In this section we have addressed two problems that arise when multiple applications share the same key space, namely the unsecured data problem and the key already used problem. The first problem was solved with the secret mechanism that we designed for Beernet and is now implemented in Beernet version 0.9. Now, key/value pairs and key/value sets can be protected by a secret needed to modify or delete those values. We even proposed a finer granularity at the set level. Indeed, it is possible to create a set controlled by one person but that can be read and written by several. This can be done while preventing other users than the managers to modify or delete values posted in the set. The second problem is solved thanks to the capability base approach. We can, thanks to those two mechanisms, run multiple applications in parallel on the same Beernet without encountering any perturbations between them.

51 52 Chapter 5

Algorithms and Implementation

This chapter contains four sections. We first show the implementation of the cloud based architecture we detailed in section 3.2.2. We then take a closer look at our three main modules: the Nodes Manager, the Scalaris Connection Manager and the Bwit- ter Request Handler. The Nodes Manager is responsible of launching the machines needed, as well as performing remote operations on those machines. The task of the Scalaris Connection Manager is to control the access of Bwitter to Scalaris. We finish this chapter by presenting all the algorithms we have designed for the Bwitter Request Handler. Those algorithms were designed to work with a key/value datastore support- ing transactions. We also do a theoretical estimation of the number of reads and writes performed by Bwitter for a given social network.

5.1 Implementation of the cloud based architecture

We did not produce the current implementation of Bwitter directly, we first went through two other implementations that had several similarities with the current one. In this section we briefly describe those first two implementations as they are integral part of our project, and finish by detailing the third and final version.

5.1.1 Open peer-to-peer implementation

The first version implemented the open peer-to-peer architecture we presented in section 3.2.1. In this solution it was necessary to protect data from malicious/unin- tentional modification at the datastore level. This is why we developed the secrets mechanism for Beernet we described in section 4.4.1. The secrets were used by Bwitter to protect user data. This version was stateful, meaning that the client had to establish a session by logging in before being able to use the functions offered by the Beernet API. This was not really practical because the Beernet nodes had to remind all the clients that were connected. Moreover, the load balancer had to be configured in order to always attribute the same client to the same Bwitter node. This first version was not totally implemented and only reached the draft state.

53 5.1.2 First cloud based implementation

Along the way we realized, as explained in section 3.2.4, that the cloud architecture was a lot more adapted to our project. We thus made heavy changes to our imple- mentation and came up with the second version of our application. Due to unexpected maturity problems of Beernet we were not able to test our implementation with it and our implementation was running on an emulated DHT. This implementation was fully operational and even had a functional GUI. It was presented at the “Foire du Libre” held the 6th of April at Louvain-la-Neuve1 and visitors could try it at the Beernet stand. As time went by, it became apparent that we would not be able to use Beernet for our implementation, we thus decided to switch to Scalaris. Furthermore, after some preliminary tests on our second implementation, we identified some heavy changes to be made to our Bwitter API. This was caused by the decision to get rid of the sessions we were maintaining for our users and have an API closer to the Representational State Transfer (REST) principles [27]. This change in the interface combined to the need for a switch to a new scalable database made us decide to start a fresh third implementation.

Figure 5.1: View of our global architecture, highlighting the three main layers: the GUI, Bwitter and Scalaris.

1“Foire du Libre” is a fair celebrating open source software and organised by the Louvain-li-nux: http://www.louvainlinux.be/foire-du-libre/, last accessed 05/08/2011

54 5.1.3 Final cloud based implementation

We will now present the final version of our application implementing the cloud based architecture we detailed in section 3.2.2. You can see in Figure 5.1 a full repre- sentation of our implementation.

The GUI

We currently do not have a fully functional GUI but a minimal one demonstrating the important features of our application. Indeed, we focused on the design of other aspects of our implementation, we thus leave the implementation of the complete im- plementation of the GUI as future work. We could not adapt the previous version of the GUI as it was designed for an old version of our application using a significantly different version of the API. The GUI was implemented using the Flex technology from Adobe2. This technology allows to create nice Rich Internet Application (RIA). We decided to create a GUI that could be accessed through a web browser so that it could be directly used with any operating systems and even with smart phones. A screenshot of this basic GUI can be seen in Figure 5.2.

Figure 5.2: The GUI of our second implementation.

2http://www.adobe.com/products/flex/, last accessed 05/08/2011

55 Bwitter layer

This is our main layer, it contains a Nginx3 load balancer, a Tomcat4 server, the Bwitter Request Handler (BRH), the Nodes Manager (NM), the Scalaris Connections Manager (SCM) and a cache system. The Nginx load balancer is not a real part of our implementation. Indeed, we did not modify it and the only thing needed in order to use it is to configure it with the IP addresses of the Bwitter nodes. As those are stateless no other special configuration is needed. The Tomcat 7.0 application server uses java servlets from java EE to offer a web- based API and relays the requests to the BRH. Those Tomcat servers are accessed through a reverse proxy server, in this case the Nginx load balancer which is told to support 10k concurrent connections. This Nginx load balancer can be configure in charge to serve static content, for example the GUI application, as well as doing load balancing for the Bwitter nodes. The connections of the GUI to the web-based API is performed using https in order to guarantee a secure channel. We currently have the BRH, NM and SCM running on Amazon, they are detailed in sections 5.2, 5.3 and 5.4.

The cache The SCM uses Ehcache v2.4.05 as cache system in order to increase the performances and mitigate the popular value problem we discussed in section 3.2.3. Note that we have one cache per Bwitter node and that they are not synchronised. The values in the caches have a time to live of one minute so that they are refreshed periodically. Values are added to the cache during the read operations, not during the write operations. The cache only keeps three different kind of values in memory: tweets, passwords and references to tweets, all the other elements are accessed directly through Scalaris. As previously explained, the tweets were designed to be as immutable as possible to be able to be included in the cache. The references to tweets are static too and used in the posting recovery mechanism. The passwords are values that are used very often as for each post we must fetch the hash of the password stored in the system in order to verify if the password provided is the correct one. The three elements cited above are only accessed through the cache if they are accessed via a transaction where only them are involved, this is done in order to keep the strong consistency properties in the other cases. For example, in the first pseudo code below, the two elements would be accessed through Scalaris.

{ begin transacation tweet t = read(someTweetKey) write(Some key, Some value) end transaction }

3http://nginx.net/, last accessed 08/08/2011 4http://tomcat.apache.org/, last accessed 08/08/2011 5http://ehcache.org/, last accessed 08/08/2011

56 In this example the tweet can be accessed through the cache because it is the only element involved in the transaction.

{ begin transacation tweet t = read(someTweetKey) end transaction

begin transaction write(Some key, Some value) end transaction }

Scalaris layer

The lowest layer is the Scalaris layer which is accessed and managed via the SCM and NM. We started the development of our system with Scalaris version 0.2.3 and switched to version 0.3.0 when it was released the 15th of July as it was giving better performances and corrected some bugs.

5.2 Nodes Manager

The Nodes Manager (NM) was designed to facilitate our tests and to allow us to easily control nodes. The NM can start Bwitter nodes as well as Scalaris nodes. We mainly use it to start Scalaris nodes to form the initial ring for our tests and to start additional Scalaris nodes during our elasticity tests. As we will further explain during our experiments in chapter 6, we are working with the Amazon cloud infrastructure. We made a heavy use of the java API6 Amazon offers in order to control the nodes, as it is closely linked to the tasks the NM perform. Indeed, this API allows to start new machines on the cloud and to check the state of the machines associated to an account. We list below the main tasks the NM performs and describe briefly how we realized them. As just said, the NM can be used to start new Bwitter nodes, but we did not design any mechanism to detect when nodes should be added or removed. There are different kinds of observable behaviours preceding flash crowd in social networks and it should be possible to study them in order to predict flash crowds, but we did not do it. We rather decided to focus on other aspects of our system.

Start new machines The NM can send commands to Amazon in order to start new machines of a given type (Scalaris nodes or Bwitter nodes). We must fix the security group (which indicates which ports must be open on a machine), the location of the machine (east-america, europa,...), the type of instance (s1.small, c1.medium...) and finally the security keys used to access the machines remotely. It is also possible to add

6http://docs.amazonwebservices.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/ ec2/AmazonEC2.html, last accessed 07/08/2011

57 tags to machines (in a key/value pair fashion) to identify them more easily. We use those to make a clear distinction between the Bwitter nodes and the Scalaris nodes.

Wait for machines to be started Once the command to start the machines has been sent to Amazon, it is necessary to wait for the machines to be running. We do this by requesting regularly the states of all the instances of the Amazon account and wait until all the machines are in the running state. This is important to understand that all the objects returned by Amazon API calls are not updated dynamically. This means that an object representing an instance may not accurately represent this instance and must be refreshed regularly to avoid working with old information. Machines can be in four states, running (the machine is started), shutdowning (the machine is stopping), pending (machine is being started), stopped (the machine is stopped but can be restarted) and terminated (the machine is stopped and cannot be started anymore).

Check machine reachability Some machines can be unreachable even if they are in the running state. We do not know the reasons that would explain why machines are sometimes unreachable but we noticed that machines in Amazon sometimes do not respond to ping requests even from inside the private LAN of their own security group. In addition, they sometimes respond to ping requests but not to ssh because they did not correctly initialize their security keys at boot time. This problem gave us lots of troubles during the tests, it is indeed necessary to reboot the machine and sometimes to restart the test when it happens.

Launch a fresh Scalaris ring Once the machines are launched we still need to start Scalaris on those machines. This requires to first create the configuration file, whose main use is to indicate which nodes are already in the ring. We then by default stop any remaining instances of Scalaris running on those machines before restarting it. The first node is launched and the configuration file for the other nodes is build. Then we launch each node sequentially every 2 seconds. We then wait a small period of time in proportion to the size of the ring to let it stabilise and a fresh ring is now launched.

Add nodes to an existing ring This is similar to starting a new ring, except that several nodes are already in the ring, therefore the configuration file contains several nodes and not only the first one.

Reboot a node Particularly with version 0.2.3 of Scalaris which we used at beginning of our work, we ended up frequently in situations where a node was correctly inserted in the ring but it was impossible to make any write on it. We thus created a function to restart this Scalaris node and insert it again in the ring so that the test could continue normally. This usually happens right after the insertion of the node, thus we do a series of dummy writes to test if the node is correctly bootstrapped and if it is not we restart it. In the version 0.3 of Scalaris this bug is nearly nonexistent.

58 Most of those functions require to perform remote actions. In order to send files and run commands on remote machines we use the Runtime class of java combined with SSH and SCP. It was necessary to use the two options “-o UserKnownHostsFile=/dev/null” and “-o StrictHostKeyChecking=no” with SSH and SCP in order to avoid checking if host was already seen before, otherwise the execution stalls because SSH is waiting for non coming answer. For example to stop Scalaris on a machine, we run the following command using the exec method from the Runtime object. ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i BwitterXM.pem [email protected] ‘‘sudo killall beam’’

We use a threadpool in order to run several commands and send files in parallel to improve the throughput which is a must because the time to launch a complete ring can otherwise be very long. In conclusion, the NM allows us to automatically and efficiently launch Scalaris rings. This was a valuable tool during our tests as we needed to start a new ring for each test we did. Once a ring is launched we still have to connect to it, we explain how we do this in the next section which is dedicated to the Scalaris Connections Manager.

5.3 Scalaris Connections Manager

The Scalaris Connections Manager (SCM) is implemented in a producer/consumer fashion. The producers are the Bwitter functions. They produce small pieces of work which use Scalaris functions, we call them Scalaris runnables (SR). SRs typically contain one Scalaris transaction but they can contain several provided that the failure of one of those does not introduce an inconsistent state. The SCM stores them in a blocking FIFO queue and the consumers that we call Scalaris workers (SW), managed by the SCM, access this queue to execute the SRs. Bwitter functions can efficiently wait that the result of a SR is computed. Accesses to the SRs are synchronised and the SWs notify any function that was waiting for the result of the SRs as soon as it is computed or the execution of the SR is aborted. This design allows the Bwitter layer running on top of Scalaris to easily make parallel requests to different Scalaris nodes without taking care of any connections or threads. Taking a big task and splitting it in several SRs pushed on the queue of the SCM indeed does the job. We show in the next chapter at section 6.2.2 that controlling the number of connec- tions to Scalaris nodes is important to get the best performances. Opening too many connections increases the degree of conflicts and does not improve the performance. On the contrary having not enough connections lowers the performance. We thus want to control the number of connections to Scalaris and avoid opening and closing them as it needlessly consumes resources. Moreover, a connection cannot be used by several threads. Indeed, it can only handle one transaction at a time otherwise unknown errors start appearing. It is thus crucial to control correctly the access to a connection so that only one SW accesses it at a time. We solve this problem by associating a dedicated connection to a SW. We could

59 have solved it differently, the SCM could have managed a pool of connections instead of a pool of SWs and dispatch arriving work to a new thread with a free connection. We believe our solution is better because we do not need to create a new thread for each SR. A thread is created only once, when a SW is created, which limits tremendously the time used to manage the life cycle of threads. We put below, in Figure 5.3, a drawing of the architecture of the whole SCM connected to Scalaris nodes.

Figure 5.3: Scalaris Connections manager connected to Scalaris nodes.

It is possible to call the SCM to add a new SW to the existing ones or to remove a SW on the fly, this does not need to be statically configured. The SCM will always connect the new SW to the Scalaris node that has the lowest number of connections. It does so by associating a Scalaris node to a SW, the SW is then responsible to open the connection to the Scalaris node. The SWs are responsible to manage the connection they have opened. They automatically reconnect if the connection is lost and they also restart a Scalaris node if it has crashed. This must be done carefully because several SWs can be using the same Scalaris node, we thus synchronized them so that only one SW is responsible for restarting a dead Scalaris node. The state machine of a SW can be seen in Figure 5.4, it highlights the different states and the events leading from one state to an other. The SW starts by trying to connect with its Scalaris node, if too many connection attempts fail the SW restarts the node and retries. Once the SW is connected it waits for SR jobs to be run on the Scalaris node, runs them, retrieves the result and waits for an other SR. If the connection with the Scalaris node is lost the SW will try to reconnect.

60 Figure 5.4: State machine of a Scalaris Worker.

An important thing to notice with this design is that an SR can never create another SR, add it in the blocking queue of the SCM and wait for his result. This situation can indeed create deadlocks. Taking a simple case where we only have one SW, this one takes an SR, called sr1, and executes it. Assuming that sr1 creates another SR, called sr2, adds it to the blocking queue and waits for its result we have a deadlock. Indeed, no SW will ever execute sr2 as we only have one SW that is already busy with sr1.

5.3.1 Failure handling

A SR can fail, but you as you can see in Figure 5.4, the SW will run the SR several times before aborting it and take a new one. This implies that SRs must be designed in such a way that when they fail they do not introduce partial state in the database and can thus be restarted without any risk. This is important in order to be able to restart jobs at this low layer because it simplifies the algorithms that are running on the top of it. Indeed, they are not forced to develop themselves their own strategy to recover from failure of Scalaris operations, this is needed if we want to avoid aborting too often high level tasks. Those tasks can be quite complex and contain several SRs which increase the probability that at least one SR fails. We only throw an exception at higher level when the SR has failed several times. Algorithms running on the top the SCM can then decide if they want to completely abort the task they were running or to resend the SR to the SCM.

61 5.4 Bwitter Request Handler

In this section we detail the most important algorithms we used in the Bwitter Request Handler: posting a tweet, reading tweets and deleting tweets. We have developed two different approaches for posting and reading tweets: the push and the pull. As represented in Figure 5.5, the push approach (on the left) is when the user who posts a tweet is responsible for inserting the references inside the lines of his followers. The pull approach (on the right) is when the user fetches all the references himself from the lines of the people he follows.

Figure 5.5: Representation of the reads (dotted arrows) and writes (full arrows) pro- cesses. The tweets are posted in lines (rectangles) and read from them . Left) Push design with one line per reader. Right) Pull design with one line per writer.

Note that in the following algorithms we will never post the whole tweet object into the lines of the followers but rather a reference to it. So when we say we post a tweet in someones line we mean the tweet reference. As explained in the previous chapter, a reference contains the posting date, the username of the author or retweeter, the username of the original poster if it is a retweet and the key of the referenced tweet. It limits the amount of redundant data stored and makes it possible to easily edit or delete a tweet, as it is explained in section 5.4.1 where we detail the operations to delete a tweet. All the algorithms we designed use the Scalaris Connection Manager we just pre- sented. We do not develop any recovery mechanism to handle the failure of the execu- tion of one of the SR involved in an algorithm. We assume that the recovery mechanism

62 of the Scalaris Connection Manager is sufficient so that failure will be scarce. A failure of one SR can thus make a higher level function abort in some cases forcing the user to restart manually the operation. The algorithms are developed to work on a transactional key/value datastore with list abstraction. However this one is not necessary, we could use a classical key/value datastore and develop our own parsing to simulate lists. An important fact to keep in mind is that concurrent reads on the same key do not conflict with Scalaris, this means that two parallel reads in two different transactions will not lead to an abort. On the other hand, two parallel transactions with one writing a key and the other writing or reading the same key will conflict and this can lead to abort. In this section, we will explicitly detail the pseudo code of posting and reading the tweets for the pull and push approach. We will compare for the two the number of Scalaris opertations they need. We do not take into account the fact that some operation can fail. The number of operations in the worse case can be easily computed from the number of operations in the normal case by multiplying it by a factor of k, where k is the maximum number of trials for one SR. We use the notation SR(some piece of code) to explicit that a piece of code is executed by the SCM inside an SR, this call is thus non blocking. To get the result of a piece of code executed inside an SR we write result = SR(some piece of code), this blocks until the result is computed or an exception is thrown.

5.4.1 The push approach

Post a Tweet

Posting a tweet is a core function of our application, it is thus important to have an efficient, as well as robust way to post tweets. Our tweet posting algorithm must thus be able to handle the failures of a datastore node as well as the failure of an application node during the posting of the tweet. This algorithm must also scale with the number of followers. It is also necessary to take into account that some users can have millions of followers. Below you can see the skeleton of the algorithm, it is composed of three main parts: Posting the tweet object, posting the references to the followers’ lines and updating the value of the last tweet correctly posted. Those are detailed in the next subsections. The algorithm can be adapted for retweets and replies to tweets but we do not detail it here as it is very similar. postNewTweet(posterName, msg){ // First step // Post the tweet object, if this step succeeds the tweet is eventually posted everywhere. tweetNbr = SR(posttweet(posterName, msg))

// Second step // Produce the chunkProcessesors. Each of them is an SR responsible of posting the remaining tweets to the followers’ lines for a given chunk

63 of the Topost set until it reaches tweetNbr. SRlist = SR(produceChunkProcessors(posterName, tweetNbr))

// Add all the chunkProcessors produced in the SCM. foreach sr in SRlist { add SR to the SCM } // Wait for the terminaition of the chunkProcessors and check if none failed (all the tweets until tweetNubr correctly posted for all the followers) . bool = false foreach sr in SRlist { try { // Block until the result is computed, no real result is returned we just check if nothing went wrong. result = sr.get } catch (exception) { bool = true } }

// Third step // If none of the previously launched chunkProcessors has failed mark tweets as posted until tweetNbr. if (! bool ){ SR(markTheTweetsAsPosted(posterName, tweetNbr)) } }

The algorithm starts with the posting of the tweet, if this first step finishes without errors we can guarantee that the tweet will be eventually posted on the lines of all the followers. Otherwise we abort the posting of the tweet and the user must manually restart it. The second step is responsible for pushing the information to the lines of the followers. This is the heavy part of the job, we thus decided to cut this job in several independent SRs that run on several Scalaris nodes in parallel. We added a repair mechanism which logs the operations successfully performed in order to recover from failures during this part. Finally, the last step marks the tweet as correctly posted on all the lines. As just mentioned, only the first step is needed to have the tweet eventually posted to all the followers. Subsequent executions of the algorithm will indeed automatically repair previously started work that failed. This repair is done during the “Post the references” phase detailed later in this section.

Post the tweet object This first step is executed as one SR to guarantee atomicity. As mentioned, if it succeeds the tweet will be eventually posted on all the lines. A tweetNbr uniquely identifies one tweet for a given user. As you can see below, it is attributed at the creation of the tweet. posttweet(posterName, msg){ begin transaction tweetNbr = read(/user/posterName/tweet/size) postingDate = currentDate() tweetReference = buildTweetRef(posterName, tweetNbr, postingDate) tweet = buildTweet(posterName, tweetNbr,postingDate) write(/user/posterName/tweet/tweetNbr/reference, tweetReference) write(/user/posterName/tweet/size, tweetNbr+1) write(/user/posterName/tweet/tweetNbr, tweet)

64 end transaction return tweetNbr }

You can notice that we save the tweet reference in order to easily recover from failure later.

Post the references We now explain the next step of the posting algorithm which is posting the references on the lines of the followers. This step repairs any previously started tweet posting which has failed after the first step. It can also be run for this purpose only. This repair mechanism is needed as this part of the algorithm is highly subject to failures. Indeed, it writes to the line of every follower. Therefore, it can potentially conflict with followers reading their lines and other posters posting their tweets. This step is cut in two substeps: the first substep is to create the chunkPro- cessors and the second one is to execute them. Some stars have millions of followers, it would thus not be scalable to do the whole work in one big transaction. Therefore, we split the work into several SRs run on dif- ferent Scalaris nodes. Now remember that the Topost set is cut in several chunks. We associate one SR to each chunk of the Topost set. It is responsible of posting all the remaining tweets (which usually limit to the new tweet posted if there were no failures before) to all the followers in its attributed chunk. We call an SR with this precise task a chunkProcessor.A chunkProcessor stops when it reaches the tweetNbr with which he was initialized. tweetNbr corresponds to the tweet number of the last tweet that the chunkProcessor must post to the lines. If a chunkProcessor finishes with- out error, we are sure that all the tweets up to tweetNbr are correctly posted for this Topost chunk. The pseudo code below details the creation of the chunkProcessors.

//tweetNbr is the last tweet to post on the lines. produceChunkProcessors(posterName, tweetNbr, posterName){

// First we do a check in order to verify if the job is not already done, this step can be skipped as it is just an optimisation. This test is equivalent to test if each counter associated with a chunk of the Topost set has a value at least equivalent to tweetNbr but is quicker as only one key must be accessed. begin transaction lastTweetNbrCorrectlyProcessed = read(/user/posterName/tweet/processed) end transaction if(lastTweetNbrCorrectlyProcessed >= tweetNbr) return new emptyList

// Read the number of chunk in the Topost set. begin transaction nbrOfToPostSetChunks = read(/user/posterName/topost/size) / topostSetChunkSize +1 end transaction

// Create the different chunkProcessors. chunkIndex = 0 SRlist = new emptyList while(chunkIndex < nbrOfToPostSetChunks){ SRlist.add( new chunkProcessor(posterName, chunkIndex, tweetNbr) chunkIndex++

65 }

return SRlist }

You can notice that we only create the chunkProcessors in this part of the algo- rithm and do not execute them. They are executed in the “Post the tweet object” phase detailed at the beginning of the section. Indeed, chunkProcessors are SRs, and, as explained at the end of section 5.3 where we present the SCM, an SR can never wait for the result of another SR he launched because it can create deadlocks. We detail below the algorithm of a chunkProcessor, which explicits how we post the references to all the lines of the followers contained in a chunk. chunkProcessor(chunkIndex, tweetNbr, posterName){ while(true){ begin transaction //Compare the value of the current chunkCounter with tweetNbr, if chunkCounter is bigger or equal job is done. chunkCounter = read(/user/posterName/topost/chunkIndex/counter) if(chunkCounter >= tweetNbr) return

//Get the reference corresponding to the next tweet that is not posted. The tweetNumber of this tweet is equal to chunkCounter+1 as it is the last correctly posted tweet for this Topost set chunk . tweetReference = read(/user/posterName/tweet/(chunkCounter+1)

//Read the the toposet chunk containing the references to the followers’ lines. lineKeys = read(/user/posterName/topost/chunkIndex)

//Finally we add the reference to all the lines referenced in the chunk, the function called must execute in the same transaction as the current one. foreach lineKey in lineKeys { addReferenceToLine(tweetReference, lineKey) } //If chunkChounter has reached tweetNbr exit the loop. if(chunkCounter+1 >= tweetNbr) return end transaction } }

We must now detail the addReferenceToLine function which is responsible of posting a particular reference to the line of a follower. Remember that the lines containing the references are divided in chunks too. For this part we thus have two choices, either we cut the line while posting or we put the burden of cutting at another moment (for example when a user reads his new tweets).

Triggered cutting In the “triggered cutting” solution we do not cut the line during the posting. Indeed prefer to do it at another moment in order to ease the post tweet function which is already quite subject to failures. The only necessary operation is thus to add the tweet reference to the head.

66 addReferenceToLine(tweetReference, lineKey){ add(lineKey/head, tweetReference) }

The line must thus be cut at another moment. We chose to do it when a user reads his tweets. Indeed the only operation needed to check if the head must be cut is to read the head, which is almost always what the read tweet operation does, as the head chunk contains the latest tweets. Hence the algorithm presented below must be run each time a user reads his tweets. Most of the time the algorithm does not impose any overhead on readings as the head must only be cut when it is full. By taking advantage of the read tweet operation, we can avoid reading the head during the cutting mechanism. However, we present below a version of the cut mech- anism where we read the head in order to show the complete algorithm. In a real implementation the head would be given as argument. splitHead(lineKey){ begin transaction headChunk = read(lineKey/head)

// While the head is too big we transfer nbrTweetsPerChunk to a new chunk. headChanged = false nbrOfChunkCreated = 0 if(headChunk.size <= nbrTweetsPerChunk) return ;

// Number of chunks in the line excluding the head. nbrOfChunkInLine = read(lineKey/nbrchunks)

while(headChunk.size <= nbrTweetsPerChunk){ headChanged = true // Remove nbrTweetsPerChunk oldest tweets from the headChunk this does not modify the datastore, just our local copy. newChunk = removeOldest(headChunk, nbrTweetsPerChunk)

// Write the new chunk in the line write(lineKey/nbrOfChunkInLine, newChunk) nbrOfChunkCreated++ }

// If the head has changed write the new head and update the number of chunks . if(headChanged){ write(lineKey/head, headChunk) write(lineKey/nbrchunks, nbrOfChunkInLine+nbrOfChunkCreated) } end transaction }

In conclusion, we can observe that the posting on the line is really easy because you only have to add an element to a set but you have to pay the price later to split the line.

Cutting the line while posting We now present the addReferenceToLine version where we cut the line while posting. We add the tweet in the head of the line, and, if

67 the head is full, we flush the head and create a new chunk. The overhead for cutting is thus paid while posting but not at each post. addReferenceToLine(tweetReference, lineKey){ // Read the head. headList = read(lineKey/head)

// Check if the head is full and we need to create a new chunk. if(headList.size >= nbrTweetsPerChunk){ // Replace the head by a fresh one with the new tweet. newList.add(tweetReference) write(lineKey/head, newList) chunkNumber = read(lineKey/nbrchunks)

// Write the new number of chunk in the line and the old head to the new chunk . write(lineKey/chunkNumber, headList) write(lineKey/nbrchunks, chunkNumber + 1) } else { headList.add(tweetReference) write(lineKey/head, headList) } }

Observe that we usually do not do more operations than in the triggered cutting as we only need to make a new chunk when the head is full. Thus adding a reference to a line takes 1 read and 1 write usually and occasionally 2 reads and 3 writes.

Chronological ordering We have shown how to post in a reliable and efficient way tweets on lines, however some tweets might be misplaced due to application failures during the posts or latency in the network. We propose an improvement to maintain strong chronological ordering between the tweets in all situations. The idea is to have a date associated to each chunk of a line. This date would be equal to the posting date of the newest tweet of the previous chunk. This way when we add a tweet to a chunk we check that his posting date is newer than the posting date of the date associated to the chunk. If it is not the case we walk back through the line and find the first chunk for which it is true and add the tweet to this chunk. This means that we can have more than nbrTweetsPerChunk tweets per chunk but this has no repercussions on the other algorithms. We can adapt the two algorithms described above to impose chronological ordering as just explained but we do not detail it here. This complicates the posting algorithm which should be as light as possible in order to achieve the best scalability. We believe that is not absolutely crucial to have perfect ordering between the tweets and thus should not make the post algorithm even more complex.

Mark tweet as correctly posted This is the final step of the algorithm, if every- thing succeeded before we can be sure that the tweets of the user until tweetNbr are correctly posted on the lines of the followers present in his Topost set at the time of

68 the posting. We can thus update the lastTweetNbrCorrectlyProcessed variable to tweetNbr. As already mentioned, this step is not mandatory and could be skipped, it only permits to test later more efficiently that the tweets were correctly posted in the produceChunkProcessors part of the postTweet algorithm. We must take into account that several runs of the postNewTweet algorithms can be running concurrently. This can happen if a user post quickly two tweets or if the recovery part of the algorithm was called in response to some event. It is thus crucial to test the value of the last- TweetNbrCorrectlyProcessed before erasing it with tweetNbr, indeed another run posting a newer tweet (and thus a tweet with a higher tweetNbr than the one we working on) can have just written a newer value for lastTweetNbrCorrectlyPro- cessed. markTheTweetsAsPosted(posterName, tweetNbr){ begin transaction lastTweetNbrCorrectlyProcessed = read(/user/posterName/tweet/processed) if(lastTweetNbrCorrectlyProcessed < tweetNbr) write(/user/posterName/tweet/processed, tweetNbr) end transaction }

Theorical performance analysis This algorithm is heavy, which is normal as in the push approach we favor the reads and put the burden on the writes. Let’s try to have an idea of how many operations these algorithms need. When we talk about operations we mean reads and writes on Scalaris. We have observed while testing that writes and reads on Scalaris approximately take the same time. An operation to add a value to a list using Scalaris requires to do a read and a write because there is no built-in operations on sets. We now analyse the three steps of the algorithm. First step is the posting of the tweet in the datastore, it requires 4 operations (1 read and 3 writes) in one transaction to post the tweet object, post the tweet reference and update the tweetNbr. The second step is the posting of the references in all the lines. This is the heaviest step, the number of operations depends on the number of followers (nbrFollowers) that a given user has. We first check the lastTweetNbrProcessed (one read): the job is done if the check indicates that everything is correctly posted, this can happen during recovery and concurrent posting. Assuming that we are in the normal situation where everything goes correctly, we have one tweet to post and all the previous tweets were correctly posted on all the lines. We read the size of the Topost set (one read) then we can dispatch the work for each chunk of the topost set. So we need two reads to create the chunkProcessors. Each chunk of the Topost set requires one transaction, the size of the transaction (number of keys it works on) depends on the number of followers per chunk of the Topost set (nbrOfFollowersPerChunk). The size of the transaction for each chunk is proportional to nbrOfFollowersPerChunk and the number of transactions is inversely proportional to nbrOfFollowersPerChunk. We assume we had no failures previously and that there only is one tweet to post for all the chunks. Each chunkProcessor thus

69 reads and writes one time his counter. It thus requires 2 nbrOfTopostSetChunks × operations or equivalently 2 nbrFollowers/nbrOfFollowersPerChunk. × We must post the reference on the lines of each follower in the Topost set. The complexity of this operation depends if we cut the line while writing or not. If we do not cut it we only need 2 operations to update the head with the new reference. If we cut while posting we must also create a new chunk and flush the head which requires 3 additional operations. In average we must cut the line every nbrTweet- sPerChunk tweets, we thus assume that the final cost for one posting for the cutting is 3/nbrTweetsPerChunk. Those operations must be done for every follower, we thus finally get to 2 nbrFollowers operations for the posting without cutting and × nbrFollowers (2+3/nbrTweetsPerChunk) operations for the posting with cutting. × Despite the fact we do not cut the lines while posting in the first option we would like to compute the overhead of cutting those lines at another moment. By example while reading the new tweets, as this allows to avoid the burden of reading the head. So, for each new chunk created while cutting the head we need to do a write (thus in total nbrNewChunk writes). We must also flush the head (one write) and we must update the number of chunks in the line (one read and one write). Thus we have 3 + nbrNewChunk operations. If we consider that a reader reads his tweets regularly we have a nbrNewChunk generally equal to 1. The last step only requires 2 operations in one transaction, one to read the last- TweetProcessed and one to update it. To summarize, we have a different number of operations (nbOp) to perform if we decide to cut while reading or not:

Cutting while reading: • nbrFollowers nbOp = 8 + 2 nbrFollowers + 2 × × nbrOfFollowersPerChunk ! (5.1) 2 = 8 + nbrFollowers 2 + × nbrOfFollowersPerChunk

Cutting while writing: • ! 3 nbOp = 8 + nbrFollowers 2 + × nbrTweetsPerChunk nbrFollowers + 2 × nbrOfFollowersPerChunk (5.2) 2 = 8 + nbrFollowers 2 + + × nbrOfFollowersPerChunk ! 3 nbrTweetsPerChunk

70 Difference between the two techniques: • 3 Diff = nbrFollowers (5.3) × nbrTweetsPerChunk

The difference between the two is small but we believe that the overhead introduced for cutting can have side effects. The number of operations involved in the posting of a tweet is mainly influenced by two parameters that we can control: nbrOfFollowersPerChunk and nbrTweet- sPerChunk. Increasing nbrOfFollowersPerChunk reduces the burden induced by the management of the counter associated with each chunk of the Topost set. However, it also makes transactions more complex because each chunkProcessor works on more elements. On the other hand increasing nbrTweetsPerChunk means that we cut the line less often, but we waste resources each time we update a chunk because we are forced to load a bigger chunk. We made some tests in the experiment chapter at section 6.3.2 to observe the impact of the nbrOfFollowersPerChunk parameter. As pointed out previously, if we want to guarantee strong ordering of the tweets between the chunks of a line we have to perform more operations. With the automatic cutting design solution we would need to read one additional structure at each tweet insertion. In the normal case the tweet will be inserted in the head chunk because it is newer that all the ones previously posted. However, sometimes we have to insert the tweet in an older chunk. This means that we have to walk back and find the adequate chunk, involving in the worst case as many operations as there are chunks in the line, this case is however very unlikely to happen. With the triggered cutting solution we would not need to do any additional operations during the insertion because we always insert the tweet in the first chunk. The burden related to the walk back would be transferred on the split head function.

Delete a tweet

If the real tweet was posted on all the lines, it would be necessary for the delete operation to find back all the lines where the tweet was posted and to remove the tweets from all those lines. This would be really impractical for several reasons. First, you would need to find back where a particular tweet was posted. Indeed, it is not enough to know the lines where a tweet was posted, you must also find the chunk of the line in which the tweet was posted. You must thus either maintain this information for each tweet or walk through all the chunks of the line in order to find and delete the tweet. This is why we post references to the tweets in the lines. To delete a tweet we only need to access the tweet object that is located at a given key and mark it as deleted. The BRH checks the mark when fetching a tweet and discards it if it has been marked as deleted.

71 Reading tweets

We will now explain how we fetch tweets from the lines. Users on social networks usually want to retrieve the latest news and less frequently walk back to find older posts. We thus assume that users want to retrieve the tweets from the newest to the oldest. So we do not load the whole line, instead we load only some tweets from it. Because lines are already cut in chunks it is natural to fetch one chunk of the line at a time starting with the first chunk of the line, called the head, which contains the newest tweets. However it is possible to access directly one chunk of the line if needed. The first chunk of the line can be directly accessed because the head is at a fixed location. We suppose that the line is already cut when we read it. If we want to access the chunk that follows the head we have to retrieve the number of chunks in the line, compute the key of the penultimate chunk and request it. The next step is to filter the references in order to discard the tweets posted by users we do not follow anymore. Indeed, we never remove the tweets posted by a user from a line. It means that all the tweets that were posted while we were following a user that we do not follow any more will stay forever on the line. It also implies that if we decide to follow again a user his tweets will reappear on the line. Chunks only contain references to tweets, we thus still have to fetch the tweets using the references remaining after the filtering. Once we have retrieved the tweets we filter the deleted tweets. You can notice that we are forced to load the tweets before we can filter the dead tweets as the references do not indicate if the tweet is deleted or not. Once the filtering has been done we can return the pack of tweets remaining. We present below the pseudo code we have implemented to read nbrTweets tweets from a line. This code is run in an SR. To avoid complicating the code we did not show the recovery mechanism inside it. In the implementation, while we are fetching the tweets we do not abort the operation if we could not fetch a tweet, instead we just skip it. The SR fails only if other data is not accessible as it is needed to fetch the tweets. One missing tweet, on the other hand, does not compromise the rest of the operation. We could also split the SR in two parts if we want to add the cutting mechanism. The first part of the SR would read the head and split it if needed. Then it would give as argument the current head to the second part of the algorithm removing the need to read it again. getTweetsFromLine(nbrTweets, linename, username){ refList = read(/user/username/line/linename/head)

chunkIndex = read(/user/username/line/linename/nbrchunks) - 1 while(refList.size < nbrTweets && chunkIndex>-1 ){ //Read the current chunk refList.add(read(/user/username/line/linename/chunkIndex)) chunkIndex-- }

users = read(/user/username/line/linename/users) filter(refList, users)

tweets = new tweetList

72 for each tweetRef in refList{ tweet = read(/user/tweetRef.posterName/tweet/tweetRef.tweetNbr) if(! tweet.isDeleted) tweets.add(tweet) } orderTweetsFromNewestToOldest(tweets) return tweets }

Having the pseudo code we can, as we did for the posting algorithm, compute the number of operations needed on Scalaris. The number of chunks we read depends of nbrTweets and nbrTweetsPerChunk. We read the number of chunks in the line (1 read). We then read nbrTweets/nbrTweetsPerChunk chunks to get nbrTweets tweet references. Then to filter the users, we must retrieve the user list associated to the line (one read). We must then do nbrTweets reads (minus the number of tweets associated to users that are not anymore on the line) to get the real tweets. Considering that all the tweets we fetched are posted by users still associated to the line the result is:

nbrTweets nbOp = 2 + nbrTweets + (5.4) nbrTweetsPerChunk

The heavy part is thus the fetching of the tweets. We could have posted tweets instead of references we would obtain 2 + nbrTweets/nbrTweetsPerChunk oper- ations, reducing tremendously the number of operations to do (but not the data to fetch) however the delete tweet operation would have been much more complex as we explained before.

Add a User to a line

We explain here how we add a new follower (newfollowed) to an existing line (linename). We first check if newfollowed is not already in the set of users associated with linename (one read). If it is not already present we add it (one write). Once it is done we add a reference to linename in the Topost set of newfollowed (one read and one write). We also create an object containing a reference towards the chunk of the Topost set in which we added the reference to linename so that we can easily remove this one later. Note that those are not the same chunks as the ones we use to divide lines. In total we thus have a cost of 3 writes and 2 reads. Sometimes we must also create a new chunk, in this case we must update the number of chunks and thus add 2 writes and one read. addUserToLine(username, linename, newfollowed){ SR( begin transaction users = SR(read(/user/username/line/linename/users)) if(newfollowed belongs to users) return

users.add(newfollowed)

73 write(/user/username/line/linename/users,users) lasttopostchunk = read(/user/newfollowed/toposet/nbrchunks))-1

reflist = read(/user/newfollowed/topostset/lasttopostchunk)

// We must create a new chunk if(reflist.size >= nbrOfFollowersPerChunk){ lasttopostchunk++ reflist = newList write(/user/newfollowed/topostset/nbrchunk, lasttopostchunk+1) //Create the counter associated with the chunk lastTweetNbr = read(/user/newfollowed/tweets/size)-1 write(/user/newfollowed/topostset/lasttopostchunk/counter, lastTweetNbr) } reflist.add(new ref(username, linename)) write((/user/newfollowed/topostset/topostchunk/, reflist)

// Write the chunk of the topost we posted in for easy removal. write(/user/username/newfollowed/linename/, lasttopostchunk) end transaction ) }

Remove a user from line

We now want to remove a user (followingUsername) from a line (linename). We first remove followingUsername from the set of users associated with linename (one read and one write). We then read the object (see “Add a user to a line”) containing the number of the chunk of the Topost set in which we added the reference to linename and suppress it (one read and one write). We can then locate the chunk and remove the reference from it (one read and one write). Thus in total 3 reads and 3 writes. Note that we do not modify the number of chunks in the Topost set even if a chunk becomes empty. Indeed we do not want to remap all the keys attributed to already existing chunks that depends of this number of chunks. removeUserFromLine(username, linename, followingUsername){ SR( begin transaction users = read(/user/username/line/linename/users) if(! followingUsername belongsto users) return users.remove(followingUsername) write(/user/username/line/linename/users, users)

topostchunk= read(/user/username/followingUsername/linename/) delete(/user/username/followingUsername/linename/)

reflist = read(/user/followingUsername/topostset/topostchunk) reflist.remove(newRef(username,linename)) write(/user/followingUsername/topostset/topostchunk, reflist)

// We must create a new chunk if(reflist.size >= nbrOfFollowersPerChunk){ lasttopostchunk++ reflist = newList write(/user/newfollowed/topostset/nbrchunk, lasttopostchunk+1) }

74 reflist.add(new ref(username, linename)) write((/user/newfollowed/topostset/topostchunk/, reflist)

// Write the chunk of the topost we posted in for easy removal write(/user/username/newfollowed/linename/, lasttopostchunk) end transaction ) }

Create a user

The first thing to do when creating a user is to check if there is not already an other user with the desired username registered in the system. To do so we check whether there is already a value at the key “/user/username”. If there is already a value at this key we can conclude that there is already a user registered with this username and the user creation is aborted. Otherwise a user object is created, containing all the information of the user, and is stored at this key.

5.4.2 The pull approach

As it was the case in section 4.3.3 of the design of the datastore chapter, we tried to re-use as many of the building blocks and mechanisms from the push approach as possible while having efficient alternative algorithms. With the pull approach only the mechanisms needed to post and retrieve the tweets are heavily modified. Other basic mechanisms such as adding a user are simplified because there is no need to keep a Topost set up to date anymore or do not need to have some fields initialised. Ultimately some simple mechanisms such as deleting tweets are exactly the same.

Post a tweet

The tweet itself is posted much in the same way as with the push approach that we explained in section 5.4.1. The difference is that the user now only posts the tweet references at one location. This location varies according to the time. This means that all the tweets posted during a given time frame are going to be grouped together and accessed via the same rounded timestamp. We call the set containing all the tweets corresponding to a time frame a postTimeFrame. The timestamp is rounded to the desired time granularity by setting some of the fields to 0, as explained in the section 4.3.3 of the datastore chapter. posttweet(posterName, msg){ SR( begin transaction tweetNbr = read(/user/posterName/tweet/size) tweet = buildTweet(posterName, tweetNbr) write(/user/posterName/tweet/tweetNbr, tweet) write(/user/posterName/tweet/size, tweetNbr+1) postingDate = currentDate() tweetReference = buildTweetRef(posterName, tweetNbr, postingDate) references = read(/user/username/tweet/timestamp)

75 references.add(tweetReference) //write the reference to the given postTimeFrame write(/user/username/tweet/timestamp, references ) end transaction return tweetNbr ) }

As expected we have a much lighter post tweet operation in this case with only 5 operations in total (2 reads and 3 writes). You could wonder why we still post references instead of tweets. The reason comes from the algorithm to read tweets. As we explain in the next section, a time frame must be read for each of the users followed. We thus wanted to limit the size of a time frame.

Reading the tweets

This operation is now heavier as it has to retrieve the references from each author. We have kept the chunks number format for the sake of simplicity and compatibility with the existing API. The chunk 0 is the very first chunk associated to the user, the timestamp of this chunk is the rounded registration time of the user. For example, if a user registered at 05/06/11 15 h 00 min 00 s GMT and the time granularity is counted in hours, when he requests to read the chunk 2 he will fetch all the tweets posted between 05/06/11 17 h 00 min 00 s GMT and 05/06/11 17 h 59 min 59 s GMT by all the users he is following. If a chunk with a negative value is requested the latest chunk is returned along with its real chunk number. In the same fashion as what we did with the post tweet in the push approach, we create a series of smaller tasks. In this case we create one SR per user followed that is responsible for fetching the tweets of this user. The tweets fetched are the tweets in the chunk corresponding to the chunk number cNbr which is given as argument to the function getTweetsFromLine that we describe below. getTweetsFromLine(username, linename, cNbr){ // First step // Produce a list of SRs to add to the SCM, each SR takes care of one user SRlist = SR(produceLineProcessors(username, linename, cNbr))

// Second step // Add all the SRs produced in the SCM and get their tweets foreach sr in SRlist add SR to the SCM foreach sr in SRlist { try { //Block until the result is computed result.add(sr.getTweets) } catch (exception) { //The tweets of a user could not be retrieve , in this case we abort the //the reading. return null } }

chronologicalSort(result)

76 return result }

produceLineProcessors This part creates one lineProcessor per followed user. A lineProcessor takes as argument the key of the chunk that it must fetch. To compute the key of the chunk we must convert cNbr to a date because, as already explained, lines are fragmented according to the time and thus each chunk in a line corresponds to a specific date. produceLineProcessors(username, linename, cNbr){ SR( startTime = read(/user/username/starttime) dateKey = chunkToDate(startTime, cNbr) users = read(/user/username/line/linename/users)

SRlist = new emptyList for(User u: users){ SRlist.add(new lineProcessor(dateKey, user)) } return SRlist ) }

lineProcessor This part fetches the tweets posted by a given user during the dateKey time frame up to the given date. Note that no ordering is done at this stage as all the tweets are ordered at the end of getTweetsFromLine. As for the previous reading tweets operation, we do not abort an SR if one of the tweets is not accessible, but rather ignore the error as this does not compromise the rest of the operations. This case is supposed to happen very rarely. Indeed tweet objects, once stored, are only modified when the author wants to delete them, otherwise they are only read and reads are not conflictual and should thus not abort. lineProcessor(dateKey, user, cNbr){ SR( refList= read(user/username/tweet/dateKey)

tweets = new tweetList for each tweetRef in refList{ tweet = SR(read(/user/tweetRef.posterName/tweet/tweetRef.tweetNbr)) if(! tweet.isDeleted) tweets.add(tweet) } return tweets ) }

Theorical performance analysis The whole getTweetsFromLine operation is esti- mated to do 2 + nbrFollowing + nbrRetrievedTweets basic Scalaris operations, where nbrFollowing is the number of users followed and nbrRetrievedTweets the total number of tweets to retrieve. Indeed we need one read to determine dateKey

77 and one read to determine the users we follow. Then to fetch the references we must for each user read the chunk of their line corresponding to dateKey, thus nbrFollowing operations. Finally we must do nbrRetrievedTweets operations to read the tweets corresponding to the tweet references we just read.

Why not store the built line chunks An alternative approach would be to keep the work done and to store the built chunk when it has been read. This would avoid the need to build several times the same chunk of a line. The application could do a simple check to see if a given chunk has already been built and, if it is the case, retrieve the references from the chunk previously stored. This might sound like an interesting optimisation but we have to keep in mind the way our application is going to be used. Users almost never re-read tweets they have already read, they usually want to see the last posted tweets. This means that they are going to load the latest chunk to see if there are new references in it. This implies that the latest chunk has to be rebuilt from scratch and thus storing the previous tweets references will not speed up this operation. Furthermore trying to keep previously built chunks of the line will increase the number of checks and operations to do in the case the references were not already on the follower side, which will be the case when trying to read previously unread tweets. Finally, the most obvious advantage of not storing line chunks is that it decreases the space complexity. We thus decided against this solution because it complicates the implementation, increases the amount of data to keep in the system and slows down the most used operations in order to increase the performance of rarely used operations.

5.4.3 Theoretical comparison of Pull and Push approach

We are first going to compare the two approaches based on the complexities we computed in the previous section. We then try to give an intuition of the impact of those complexities on the behaviour of Bwitter when used by simulated users.

Summary of the complexities

We are now going to compare the complexity of the push and pull approaches for the two main operations, the postTweet and getTweetsFromLine. Below we present a summary of those operations for both the puch and pull approach.

Push - postTweet • nbOp = 8 + nbrFollowers ×  3 2  (5.5) 2 + + nbrTweetsPerChunk nbrOfFollowersPerChunk

78 Pull - postTweet • nbOp = 5 (5.6)

Push - getTweetsFromLine • nbrTweets nbOp = 2 + nbrTweets + (5.7) nbrTweetsPerChunk

Pull - getTweetsFromLine • nbOp = 2 + nbrFollowings + nbrRetrievedTweets (5.8)

Before we start, here is a reminder of the different terms involved:

nbrFollowers: number of users the user is followed by (pull/push). • nbrFollowing: number of users the user follows (pull/push). • nbrTweets: minimum number of tweets we want to retrieve (push). • nbrTweetsPerChunk: number of tweets in one chunk (push). • nbrRetrievedTweets: number of tweets retrieved in one get (pull). • nbrOfFollowersPerChunk: number of followers in a chunk of the Topost set • (push).

time granularity: the time frame corresponding to a post chunk (pull). • As announced, we can see that obviously the post operation in the push approach is much heavier than in the pull. Indeed the time to post a tweet using the pull design is constant while in the push approach it depends on the number of users that follow you. On the opposite, the read is lighter in the push approach, indeed it depends on the number of tweets retrieved as in the pull but its complexity does not grow with nbrFollowings as it is the case in the pull. The two designs thus have their respective heavy operation, the post for the push and the read for the pull. However we believe that the first is more resistant to failures because it does not need to succeed directly and can be recovered later while the second must read from all the following successfully in order to produce a result. If we further considerate that a user does not like to wait, the push is more reasonable as after the first step of the post we can tell that the operations was a success, for the read in the pull we must wait until the end of the whole operation in order to respond to a request. However the push operations involve much more conflicts as they do a lot more write than the pull operations. We still did not prove if from a complexity perspective it is better to use the push or the pull. To this end we would like to compare them according the number of followers and following and the read and write rates.

79 Theorical Bwitter simulation

We now simulate the two designs we have presented. It is aimed at estimating the global number of operations performed by the system and determining which design is the best according to an unknown number of followers and read rate. This simulation does not take into account failures during the algorithms, size of the data transfered, neither complexity of the transactions (number of keys involved in a transaction).

Description of the problem As we can see the operations are not comparable the way they are now. Indeed, in the push approach we fetch at least a specified amount of tweets, and in the pull approach we retrieve an arbitrary number of tweets, depending of the number of tweets posted during a given time frame. The two operations are thus semantically different, and thus naturally their complexities depend on different parameters. We would like to be able to compare them in terms of total number of operations done on Scalaris. The main problem is that the parameters of the system are unknown. Indeed, each user using Bwitter is different, we define a user in term of his behaviour and have four parameters describing it:

postingRate: the rate at which a user posts new tweets. • readRate: the rate at which a user reads his tweets. • nbrFollowers: the number of followers a user has. • nbrFollowings: the number of followings a user has. •

Moreover, we must fix all the design parameters involved in the complexities namely nbrTweets, nbrTweetsPerChunk and nbrFollowersPerChunk for the push and time granularity and nbrRetrievedTweets for the pull in order to estimate the number of operations done.

Assumptions The parameters we just described vary a lot between users and are unknown. Indeed we did not find any precise statistics about the usage of Twitter. Because we would like to give you an idea of the performance of our two designs according to those parameters, we have decided to fix those to some values that were chosen by ourselves according to the following assumptions:

(1) Users more often read their tweets than they post a tweet.

(2) Most of the users in Twitter have more followings than followers, we call them fans. Other users have a lot of followers compared to the number of user they follow, we call them stars. This means that nbrFollowers for fans is smaller than for stars.

(3) Users are only interested by new tweets and when a user reads his tweets he reads all the new tweets.

80 (4) readRate is the same for all the users and is the average of the read rates of each user in the real network. Because we cannot compute it as we do not have the figures, we take it as a parameter of the simulation.

(5) postingRate is the same for all the users and is the average of the posting rates of each user in the real network. Because we cannot compute it as we do not have the figures, we take it as a parameter of the simulation.

(6) nbrFollowings is the same for all the users.

The first three assumptions come from the observation of the usage of Twitter. Generally users that connect to a Twitter application read their messages more often than they post new ones. Also 1% of the users of Twitter are responsible for 50% of its content, this observation motivated our distinction between the star and fans behaviour. The second three assumptions were made in order to simplify the following development.

Properties of the system simulated We define in this section two properties that we derived from the assumptions we do above. Those properties fix some of the simulation parameters that we defined before.

First property: The number of new tweets a user reads when he reads his tweets is constant and equals to nbrOfNewTweets .

postingRate nbrFollowings nbrOfNewTweets = × readRate First, please notice that nbrOfNewTweets is constant as postingRate, nbrFol- lowings and readRate are fixed. We derived this property from (3), (4), (5) and (6). This property allows us to fix nbrTweets (push) and nbrRetrievedTweets (pull) to nbrOfNewTweets. We also decided to read only one chunk (push) and one postTimeFrame per following (pull). This choice was made in order to simplify the simulation which is already rather complex. This constraint helps us to fix the time granularity and the nbrTweetsPerChunk. Concerning the push approach, it is easy to fix nbrTweets as it is a parameter of the function call. In order for the tweets to be packed in the same number of chunks at each read call it is sufficient to choose nbrTweetsPerChunk to be a multiple of nbrOfNewTweets, or put differently:

nbrTweetsPerChunk % nbrOfNewTweets = 0

We decided to fix nbrTweetsPerChunk to nbrOfNewTweets. Therefore, we need to read exactly one chunk in order to have the new tweets at each read operation.

81 Concerning the pull approach, we cannot directly influence how many tweets are read when performing a read tweet operation. However we can fix the time granu- larity so that:

1 time granularity % = 0 readRate This ensures that all the new tweets are always in the last postTimeFrame. We chose to fix the time to the time granularity to 1/readRate. This allows to have the smallest chunk possible (no unused references loaded) while fulfilling the just stated property. Please note the choice of the time granularity does not have any direct influence on the simulation. But we wanted to show it was possible to tune our design to meet simulation constraints.

Second property: Each user has the same number of followers (nbrFollowers is fixed). We now discuss and prove that this second property is not restrictive. You can notice that this property is aimed at simplifying assumption (2). In other words, it claims that there is no distinctions between stars and fans, or any other way of distinguishing users based on their nbrFollowers. This is in fact not needed. Our simulation estimates the global number of operations performed by a system according to some user profile. And, thanks to the two properties below, we can affirm that having some users having more followers than others has no influence of the total number of operations.

(7) The postingRate and readRate are the same for all users (which is exactly the assumptions we made at (4) and (5)).

(8) The complexities of the operations in the two designs are linear with respect to the number of followers and followings (you can observe it remembering that nbrTweetsPerChunk, nbrFollowersPerChunk are constant parameters)

Property (8) states that if a user has one more follower it only increases the charge it puts on the system by a constant amount (which is the same for all the users) for each operation it performs. However, moving a follower from a user to an other user does not change the total charge put on the system if all the users perform the same number of operations. This last condition is exactly what property (7) states. If (7) was not true we could have a system with a user having lots of followers but a posting rate equals to 0 and an other user with a few followers and a postingRate different from 0. The first user would not generate any posting load as he never posts. But if you transfer one of his follower the second user it would change the total load put on the system. To summarize, thanks to (7) and (8), we can always move followers from users having more followers to users having less followers without changing the total amount of operations perform on the network. We thus proved that it is not needed to make a distinction between stars and fans.

82 In conclusion the two properties we just defined fix the following relations between the simulation parameters.

nbrFollowers = nbrFollowings • (postingRate nbrFollowings) × = nbrOfNewTweets = nbrTweets = • readRate nbrRetrievedTweets = nbrTweetsPerChunk

time granularity = 1 • postingRate

The simulations We now explain the final details concerning the simulation. Below you can see the formulas we use in order to simulate Bwitter. The first formulas is the one for the push design and the second is the one for the pull design.

Push:

nbOp = postingRate 8 + nbrFollowers 2+ × × !! 3 2 (5.9) + + nbrNewTweets nbrOfFollowersPerChunk readRate (3 + nbrNewTweets) ×

Pull:

nbOp = postingRate 5 + readRate × × (5.10) (nbrFollowings + 2 + nbrNewTweets)

Those formulas compute the number of operations performed with respect to the readRate, the postingRate and the nbrFollowers. Remind that an operation is a transactional read or a write. Because we do not simulate other operations than reading tweets and posting tweets we have a direct relation between the two rates. If we normalize them we have that readRate + postingRate = 1. We thus chose to make readRate vary from 0 to 1 with postingRate varying accordingly. We defined nbrUsers as the number of users in the system. We chose nbrFollowers, which, as already stated, represents the mean number of followers each user has, and thus also his number of followings as nbrFollowings = nbrFollowers. Because we had no idea of the value of this number we chose some arbitrary values. The higher it is, the more socially connected the users in our system are. Finally we must fix the last unknown parameter: nbrOfFollowersPerChunk. This parameters is only present in the push design, the number of operations that must be done in a write operation decreases as it increases. The problem is that it is difficult to fix a value for it. We can not neglect its influence but we cannot decently put it very high either. Indeed the number of keys involved in the transactions while posting grow linearly with it. We thus made a compromised and chose to have it equals to 20. We summarise below the values of the parameters.

83 nbrUsers = 100 • nbrFollowers = 10, 30, 70 • nbrOfFollowersPerChunk = 20 • (postingRate nbrFollowings) × = nbrOfNewTweets = nbrTweets = • readRate nbrRetrievedTweets = nbrTweetsPerChunk

We have plotted the results of our simulation in Figure 5.6. Lines go by pair (one push and one pull), lines with the same weight correspond to the same number of followers. We have indicated the intersections that are relevant by a big read dots.

Figure 5.6: Number of Scalaris operations with respect to the read rate: comparison between the pull and the push approach for nbrFollowers = 10, 30 and 70.

First, you can observe that all the lines of the pull approach are parallel, this means that nbrFollowers influences the number of operations by a constant amount whatever the readRate. We can thus see that the number of operations in the pull approach does not vary much with the readRate which may be a surprising observation at first. Even more surprisingly it decreases slowly with the readRate. Secondly, we can see that, as expected, as the readRate increases the push ap- proach becomes more and more interesting. When the nbrFollowers is smaller we need to have a higher readRate before the push approach becomes more interesting than the pull. If you observe the red dots you can see that some kind of asymptotic functions seems to appear indicating that before some readRate the push approach is never the good choice. We thus plotted the curve defined by the intersection of the pull and push lines in Figure 5.7 to confirm this intuition. We have kept on the plot the lines already shown before to better visualize what the curve represents. The curve shows the intersections for nbrFollowers between 4 and 300. nbrFollowers values smaller than 4 give intersections at readRate bigger than 1 which makes does not make sense.

84 Figure 5.7: Intersection of the push/pull lines for nbrFollowers between 4 and 300.

This curve can be used to determine which design is theoritically the best according to nbrFollowers and readRate. We can observe an asymptote around readRate = 0,7. We made the math for nbrFollowers = 30000 and obtained readRate = 0,672. We can also note that once nbrFollowers is higher than 70 the black curve becomes nearly vertical. This means that for a readRate bigger than 0,672 and nbrFollowers bigger than 70 the push approach is theoretically always the best in terms of number of Scalaris operations performed.

Conclusion

In conclusion, we have compared the push and the pull theoretically according to an unknown mean nbrFollowers and an unknown readRate. We have seen that we could find a value for the readRate under which the the pull approach is always the best. However, if we are above this value and if nbrFollowers is bigger than 70 the push approach is the best. It seems safe to assume that we are in the second case with social networks like Twitter. Moreover, the read algorithm in the pull is heavier and its termination must be waited in order to respond to a given call which is not the case for the posting in the push. According to those observations, we thus believe that the push approach is the more adapted for a social network like Twitter. We will see in the next chapter if the practical tests confirm this conclusion.

85 5.5 Conclusion

In this chapter we detailed the main modules of our implementation. The NM is a powerful tool that allows us to manage the different machines we need to run Scalaris nodes, and the SCM allows us to easily dispatch work on those nodes. The BRH is the module on which we spend the most time and attention in order to design the simplest and fastest algorithms. We believe we have minimised the complexity of our most used algorithms. Finally, our theoretical comparison between the push and the pull approaches comforts us in the idea that the push approach is probably more adapted to our application. In the next chapter we are going to do tests on Scalaris and Bwitter’s pull and push variations.

86 Chapter 6

Experiments

This chapter details the experiments we did on Scalaris and Bwitter. The first part is dedicated to the description of the Amazon Elastic Compute Cloud, which was the platform on which we did all our tests. We then detail the tests we do on Scalaris and Bwitter. We do for both scalability and elasticity tests. We start in the second section of this chapter with Scalaris as Bwitter tests’ results are heavily influenced by Scalaris ones. Bwitter is tested in the third part. We study the influence of a cache and of the nbrOfFollowersPerChunk parameter for the push. Afterwards, we test the scalability and elasticity of our Bwitter push solution. Finally, we study the scalability of the pull approach and finish with the conclusion.

6.1 Working with Amazon

We do not want to simulate the cloud platform by ourselves as we feel it would not reflect the way our application would be used ultimately. We thus decide to work with the Amazon Elastic Compute Cloud (Amazon EC2) because it is a professional and realistic work environment.

6.1.1 Choosing the right instance type

An instance is virtual machine running on a physical machine, it is characterized by four different attributes: CPU, network capabilities (we sometimes say IO capacity), RAM memory and storage capacity. The last attribute is less interesting to us as none of our tests use persistent storage. While working on the Amazon cloud infrastructure, we used four kinds of instances: the standard micro, the standard small, the standard large and the high CPU medium instance. The micro instance is the smallest possible Amazon instance. It provides minimal CPU and IO capacity. The micro instance can consume up to 2 EC2 Compute units for short period of burst. This is not enough for running correctly Scalaris. According to Amazon, an EC2 Compute Unit is equivalent to CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor. You can find

87 more information about amazon instances and EC2 Compute unit on Amazon website1. We show the description of Amazon’s micro instance in Table 6.1.

Standard Standard Standard High Cpu Micro Small Large Medium instance instance instance instance

Memory 613 MB 1.7 GB 7.5 GB 1.7 GB

Up to 2 EC2 1 EC2 Compute 4 EC2 Compute 5 EC2 Compute Compute Compute Units Unit (1 virtual Units (2 virtual Units (2 virtual Units (for short core with 1 EC2 cores with 2 cores with 2.5 periodic bursts) Compute Unit) EC2 Compute EC2 Compute Units each) Units each)

Storage EBS storage 160 GB 850 GB 350 GB only instance storage instance storage instance storage

32-bit or 64-bit 32-bit 64-bit 32-bit Platform

I/O Perf Low Moderate High Moderate

API t1.micro m1.small m1.large c1.medium name

Table 6.1: Characteristics of the different Amazon instance types we use during the tests.

The small instance is just above the micro, it provides moderate IO performance and fixed CPU. Those were well suited to run up to 18 scalaris nodes. However, they showed some CPU and IO limitations when we use a high number of connections and/or nodes. As for the micro, the performance of the small instance can be found in Table 6.1. Most of the tests use the small instances to run the Scalaris nodes as they were rather cheap and efficient, but we could have benefits of instances with higher CPU and network capabilities as shown later. 1Amazon EC2 FAQs, http://aws.amazon.com/ec2/faqs/, last accessed 27/07/2011

88 We also use the large instance which has better network performance than the two others, and the High CPU medium instance which has same network performance but way higher CPU performance. Those two instances are used for special tests when we suspect that some behaviours can be explained by the lack of performance of the previous instances. At first, we tried to work with the micro machines, but they turned out not to be powerful enough to support Scalaris and the operations we wanted to perform. Those preliminary measures are thus not relevant and we only detail our experiments and results with the other instances we presented.

6.1.2 Choosing an AMI

Instances need to have an associated Amazon Machine Image (AMI). AMIs can have two kinds of storage instances: AMI storage and Elastic Based Storage (EBS). The first one does not allow the user to stop and restart the machine, indeed once the machine is stopped all the modifications done are lost. The second one works as a normal personal computer, you can restart the machine and the changes done before are still present. We use the EBS solution because it allows us to create custom images easily from existing AMIs and store them, which is not possible with the classical AMI storage.

6.1.3 Instance security group

Amazon instances all belong to a security group. This security group defines several firewall settings for the instances. For the sake of simplicity, we have allowed all the TCP connections as well as all the ICMP messages between the nodes.

6.1.4 Constructing Scalaris AMI

We started from the AMI with ID ami-06ad526f, this is a 32 bits image of ubuntu 11.04 (Natty Narwhal)2. The first step is to install all the packages needed to install Scalaris: java jdk, erlang, make, svn and ant. We ran the following commands in order to install the required packages.

sudo apt-get install erlang sudo apt-get install make sudo apt-get install openjdk-6-jdk sudo apt-get install ant sudo apt-get install subversion

We then installed the latest version (0.3.0) of Scalaris, downloaded from the SVN.

svn checkout http://scalaris .googlecode.com/svn/trunk/ cd /home/ubuntu/trunk/ sudo ./configure sudo make install sudo make install java

2Can be found at http://uec-images.ubuntu.com/releases/11.04/release/ last accessed 27/07/2011

89 We have also modified a little bit the starting scripts of Scalaris and added some scripts to restart Scalaris easily on a machine. Once all those steps were performed, the new AMI was done and ready to run Scalaris.

6.2 Working with Scalaris

We now detail the procedure to launch Scalaris and the different tests we did on it before testing our Bwitter application.

6.2.1 Launching a Scalaris ring

The first thing to do is to modify the “scalaris.local.cfg” file, which is located in the bin folder of Scalaris. The two important lines shown below must be modified.

{mgmt_server, {{127,0,0,1}, 14194, mgmt_server}}. {known_hosts, [{{127,0,0,1}, 14195, service_per_vm}]}.

The mgmt server, known hosts and service per vm parts must not be modified, otherwise Scalaris will not work correctly. Indeed, nodes do not connect correctly when you modify those values. You must replace the IP address of the first line with the IP address of the node running the manager server (mgm server). 14194 is the port on which the manager server runs, note that you can change it. The second line contains the known hosts, those are the other DHT nodes already inserted in the ring. Each known host is identified by an IP address and a port on which it listens. Below is an example of configuration.

{mgmt_server, {{192,168,1,1}, 14194, mgmt_server}}. {known_hosts, [ {{192,168,1,1}, 14195, service_per_vm}, {{192,168,1,2}, 14195, service_per_vm}, {{192,168,1,3}, 14195, service_per_vm}, {{192,168,1,1}, 14200, service_per_vm} ] }.

In this configuration, one node (192.168.1.1) is running the management server and a DHT node. Launching the nodes is quite simple, the three following commands are used respectively to run the management server, the first node and another DHT node. The “scalarisctl” binary is located in the bin folder of the Scalaris folder.

./scalarisctl -n mgmt_server@hostname -p 14195 -y 8000 - start ./scalarisctl -n FirstNodeName@hostname -p 14195 -y 8000 -s -f start ./scalarisctl -n AnotherNodeName@hostname -p 14195 -y 8000 -s start

Note that each node has a name, which is needed to communicate with Scalaris nodes. The mapping between the node name and its location (IP address and port) is done by the epmd server that is launched automatically with Scalaris. It is possible to launch several Scalaris nodes on the same machine, they only need to have a different node name. This node name is fixed with the “-n” parameter. In fact only the part before the @ is the true name, but fixing the hostname is important if you want to avoid

90 communication problems when using the java API for Scalaris. Indeed, Java does not resolve hostnames the same way Erlang does, and Scalaris is written in Erlang. Fixing the hostname thus prevents Erlang to fix it itself, and using the same in Java avoids the problem. The parameter “-p” is used to fix the port on which the DHT nodes communicate. This is important in order to configure the firewall settings. The parameter “-y” fixes the port on which the webserver is running, this webserver is not mandatory but allows to debug more easily as you can do put/get operations directly from this webpage. You can also have a visual representation of the complete ring by going on the webpage of the management server. We end up with the parameter “-m”, “-f”, “-s” which are respectively used to start the manager server, the first node and a normal DHT node.

6.2.2 Scalaris performance analysis

Before doing any test directly related to Bwitter, we need some important informa- tion about Scalaris itself in order to understand our future results. Our first analysis focus on the connection strategy used to communicate with Scalaris nodes. We then perform scalability and elasticity tests based on those results. Scalaris is configured with a replication factor of 4. Scalaris does not allow to choose the consistency level between replicas and thus always guarantees strong consistency. This means that read and write operations are always done in a transaction. Read and write operations will thus conflict if they are working on the same keys. However, concurrent reads on the same value are not conflicting, which is important to keep in mind during the tests. One important precision is that we only run one instance of Scalaris per Scalaris node. We decided to do so because the Scalaris developers told us that having more than one instance by node was less stable, and only sightly increases the overall performance of the system. Moreover, small instances from Amazon might not be powerful enough to handle more than one instance of Scalaris. During our tests with Scalaris we take two measures: the time, in milliseconds, taken to perform 20000 operations and the number of operations that failed during the test. We do not apply any restart strategy, if an operation fails we report it and execute the next operation. We then compute the throughput and the failure percentage defined respectively by equations 6.1 and 6.2. We have chosen to show the throughput as it is easier to analyze and closer to what we want to measure than the time. Moreover time can sometimes be difficult to interpret on its own and cannot be compared with other tests’ results, except if exactly the same number of operations are done. Concerning the failure percentage, it has the advantage to be easily comparable for other people doing similar tests to us.

number of Scalaris operations successfully performed Throughput = (6.1) measured total time

number of operations failed Failure percentage = 100 (6.2) number of operations performed ×

91 Before presenting our tests we want to point out that the Amazon instances do not provide a constant level of performance. This means that the performance of the Scalaris nodes are variable from one run to the other. Indeed, we do not use the same physical machines all the time, but virtual machines whose performance can vary from time to time. We had lots of tests to run it was thus not possible to make several runs for the same test. However, we did lots of tests, it was thus possible to observe if some results deviate too much from what we already observed. In that case we restarted the test. Note that we do not detail all the tests we did on Scalaris as part of them were done to familiarize ourselves with the system. We are thus going to present only the most relevant ones that allow us to give the reader the broadest view of Scalaris’ behaviour.

Connection strategy test

This test is aimed at evaluating the impact the number of parallel connections the dispatcher maintains towards a single Scalaris node has on the performance. A connection is a TCP connection toward a Scalaris node which can be used to make sequential requests. The word sequential is important as concurrent requests using the same connection trigger errors. Indeed, Scalaris does not distinguish between different requests which thus mix if done concurrently. The dispatcher is the node that sends operations to Scalaris nodes. We have decided to run the dispatcher on another machine than the Scalaris nodes because later we run our Bwitter nodes on dedicated machines. Indeed, we believe that the overhead of Bwitter can perturb the execution of Scalaris which is already quite heavy. Our guess is that the conflict level (conflictLevel) plays an important role in the optimal number of connections. We define the conflictLevel of a set of operations as the chance that a random pair of operations in the set conflict if they occur at the same time. Therefore, having more connections increases the probability that two conflicting operations occur at the same time, leading to the failures of those. We designed a benchmark with a fixed number of nodes, we have chosen 18 as it is the maximum number of nodes we could launch in the test environment we were provided, and some predefined conflict levels. We made the number of connections vary for each value of conflictLevel. This benchmark consists of 20000 random operations with as many reads as writes operating on a random key inside a given pool of keys. The value written is always the constant String “test”. The conflictLevel is inversely proportional to the number of keys on which we work. This means that the smaller the number of different keys, the higher the chances are that two parallel operations work on the same keys and thus conflict. We believe that 20000 operations are significant enough so that little variations do not influence the overall results. We decided that the best way to connect two nodes is to have a symmetric connec- tion strategy with respect to each node. This makes sense as each node is supposed to be equivalent to the other nodes.

92 Mathematically speaking this means that:

number of connections to n1 number of connections to n2 1 | − | ≤ n1, n2 where n1, n2 set of nodes ∀ ∈ We apply this symmetric connection strategy during all our tests. Please also note that in order to avoid side effects, we shut down the whole ring between each run and start with a fresh ring each time. This test uses small Amazon instances for the dispatcher and the Scalaris nodes. The results are summarized in Figure 6.1 and Figure 6.2.

Figure 6.1: Read and write throughput with respect to the number of connections.

As we can see in Figure 6.2, the conflictLevel has a clear impact on the perfor- mance. The number of failed operations increases whith the conflictLevel leading to a lower throughput, as we can see in Figure 6.1. The number of failed operations also increases with the number of connections. It becomes obvious that having less connections lowers the number of failed operations in an environment where operations can conflict. We can distinguish two parts in Figure 6.1, the part before we reach the same number of connections than nodes, and the part after we reached it. We will call this rupture the break point. In the first part (except for conflictLevel equal to 0.1), the number of operations per second is increasing almost linearly as the number of connections increases. We thus deduce the following property: in normal conditions where the textbfconflictLevel is not tremendously high, it is necessary to use as many connections as nodes in order to fully take advantage of those nodes’ power.

93 Figure 6.2: Failure percentage with respect to the number of connections.

In the second part, the throughput varies with the conflictLevel. When the conflict level is low, the throughput increases with the number of node connections up to a certain point, and then, eventually, decreases again under the value measured at the break point. We believe the increase is due to a more important load on Scalaris nodes. The decreasing part can be explained by the growing number of failures observed. We also believe that having only one dispatcher is maybe not enough. Indeed, the network capacity of Amazon small instances is only moderate and traffic toward Scalaris nodes increases with the throughput. It is thus possible that we have reached a maximum throughput for one dispatcher. Finally, the throughput does not increase with the number of node connections in cases of very high levels of conflict. For example with a conflictLevel equal to 0,02, we can see that the throughput drops down directly after the break point. Concerning the line with a conflictLevel equal to 0,1, the throughput put increase stops even before the break point and then do not stop to decrease. This indicates that, if the conflictLevel is really high, the optimal number of connections is below the number of nodes, despite the fact that more parallel requests could be handle by Scalaris. We thus conclude that until a given level of conflict between operations we must use at least as many connections as there are nodes. Moreover, using more connections also increases the throughput but not as drastically and it depends of the environment in which we are working.

Connection strategy conclusion: In the light of the tests we did, we have shown the crucial influence of the number of connections as well as the conflictLevel on

94 the throughput and failure percentage. In case of highly conflicting environment, it might be a good idea to reduce the number of connections a little bit. However, when operations are almost not conflicting, a higher number of connections can significantly increases the performance because it allows to put a higher load on Scalaris. Choosing a the right number of connections is really difficult as it requires to esti- mate the conflict level which is an application dependent parameter. Moreover, results could also have been different for another number of nodes. We finally conclude that we must use at least as many connections as there are nodes. Indeed, in most of the practical situations the conflictLevel is not high enough to justify going under this number.

Scalability test

Scalaris is claimed to be a scalable system. Despite we could have accepted it, we wanted to verify this claim in our own environment, as it is really important to understand the next tests.

First scalability test with one dispatcher and small instances: We performed 20000 writes on random keys and then read each of the keys we just wrote. The con- flictLevel should be close to 0 as keys are chosen randomly using the Math.random() function from java. We measure the time taken for all the writes and reads with respect to the number of Scalaris nodes. We make the number of Scalaris nodes vary from 4 to 18, maintaining only one connection per node. As for the connection test, we use small instances from Amazon for all the nodes (dispatcher and Scalaris nodes). The results can be found in Figure 6.4. We can clearly observe that the throughput increases when the number of nodes increases. It seems to increase more slowly when the number of nodes become higher. Indeed, from the 70% throughput increase we observe between 4 and 18 nodes, we already have 45% of the throughput increase between 4 and 8 nodes.

Second scalability test with one dispatcher and medium instances: We were surprised by the slow down at the end of the last test. Our assumption is that the small instances are not powerful enough to handle a ring of that size. We thus restart the test with medium instances for the Scalaris nodes, the other parameters remaining the same. The results of this test can be found in Figure 6.4. We can see a general improvement in performance with more powerful machines, but again a decrease of scalability with a higher number of nodes. However, we can notice that this decrease is not as marked and happens after a few more nodes than in the previous case, around 10 nodes instead of 8. The performance of the machines do certainly play a role but are probably not the cause of this decrease. Our guess would be that there are some networking delays that come up because we only use one dispatcher.

95 Figure 6.3: Throughput for 20000 Scalaris operations with respect to the number of Scalaris nodes, results for small instances and conflict level of 0.

Figure 6.4: Throughput for 20000 Scalaris operations with respect to the number of Scalaris nodes, results for small and medium instances and conflict level of 0.

96 Third scalability test with 2 dispatchers and small instances: Looking at our logs we noticed that the time the nodes spend waiting for a new job when they finished the previous one had an impact in this scalability. This time increases with the number of nodes in the ring as the dispatcher must keep more nodes busy. Networking delays are thus probably the source of this problem. We now want to see the magnitude of this impact. Our idea was to add another dispatcher in order to increase the load on the Scalaris nodes. We performed a series of tests to measure the impact of having two dispatchers instead of one. In the first series of runs we have one dispatcher maintaining two connections with each Scalaris node, while in the second series of runs we have two dispatchers main- taining each one connection per Scalaris node. Note that we use two connections in the first case because we want to have the same number of parallel requests in the two tests. In order to widen our view of the Scalaris behaviour we opted to have a conflict level conflictLevel equal to 0,004. We thus do 20000 Scalaris operations, 20000 for single dispatcher and 10000 for each dispatcher when we use two, with as many reads as writes and make them overlap. Our results can be found in Figures 6.5 and 6.6.

Figure 6.5: Throughput for 20000 Scalaris operations with respect to the number of Scalaris nodes, results for one and two dispatchers on small instances and conflict level of 0,004.

As we can see in Figure 6.5, the throughput does not seem to be very much affected by the addition of a second dispatcher, even though we can notice a clear difference starting when the ring has more than 8 nodes. The difference, however, seems too small to conclude that the scalability issues are due to the increasing time the nodes were waiting. Surprisingly, we see in Figure 6.6 that the number of failure percentage is always higher with the single dispatcher.

97 Figure 6.6: Fail percentage for 20000 Scalaris operations with respect to the number of Scalaris nodes, results for one and two dispatchers on small instances and conflict level of 0,004.

Final scalability test with 4 dispatchers and small instances: Finally, we want to see if a single dispatcher with higher network capacity is a better choice than several small dispatchers with medium network capacity. We invite you to consult Table 6.1 in order to recall the specifications of small and large instances. As you can see the large instance offers far better performances than the small instance in every domain. We decided to make a final test with a really small conflictLevel equal to 0,00007 and made the number of nodes vary from 8 to 16. Once more we chose another conflict level in order to widen our view. We again do 20000 operations in total, in case of a single dispatcher it performs 20000 operations and in case of 4 dispatchers each dispatcher does 5000 operations. The single dispatcher has 4 connections per node and the 4 dispatchers use one connection per node. Every dispatcher is connected to every Scalaris node. The results are shown in Figure 6.7. We do not show the failure percentages because their value is nearly equal to 0 and their variation is not relevant. Our first note is that the performance for all the tests increase linearly, meaning that all the configurations scale correctly. Then we can observe that using a small or a large dispatcher has no effect on the performance. This means that a small instance should be powerful enough to manage at least 16 4 connections to Scalaris, and that × there are special conditions in the Amazon cloud that limit the networking performance with one dispatcher. We believe the reason the 4 small dispatchers outperform the two others is because they can send more quickly new jobs to Scalaris nodes than a single dispatcher. This confirms the results we have obtained in the previous test.

98 Figure 6.7: Throughput for 20000 Scalaris operations with respect to the number of Scalaris nodes, results for one small, one large and four small dispatchers and conflict level of 0,00007.

We thus reach the conclusion that, while increasing the number of connections to Scalaris can increase the performance, it is sometimes necessary to have several dis- patchers to put enough load on Scalaris. We can finally observe that using 4 dispatchers we have the throughput that approximately doubles as the number of nodes double, indicating really good scalability results.

Comparaison with Scalaris developers scalability tests: We have discussed with Florian Schintke, a member of the Scalaris developers, about their scalability tests. They use a different approach than ours and do not do any conflicting operations. For instance, they make the number of nodes vary and have 10 clients per node. Each client begins by initializing a random key and then do 1000 increments on this key. The probability of conflict between operations is thus infinitesimally small. They also used more powerful machines than ours and were not working on the Amazon cloud. Figure 6.8 is one of the results Florian Schintke sent us. We can clearly see that Scalaris is correctly scaling. However, their tests are rather different from ours for several reasons. First, they use completely different infrastruc- tures. Secondly, most of our tests are working with a conflictLevel which is important for us as we know that Bwitter will obviously work with conflicting values. Finally, we do not have our dispatcher on the same machine as the Scalaris nodes. We believe it is not realistic for us to have the Bwitter nodes (equivalent of the dispatcher in the tests we made) directly on the Scalaris nodes as this would perturb Scalaris nodes that

99 can potentially already be under high load. Furthermore, we would reduce the benefit gained from the cache by having more Bwitter nodes.

Figure 6.8: Increment Benchmark test of the Scalaris developers.

Final words on Scalability and the connection strategy: We have concluded that Scalaris is scalable, as the performance clearly improves with the number of nodes. We explain the performance slow down at high number of nodes because the load we put on the Scalaris nodes is not high enough. To increase the load we have three possibilities either we increase the number of connections, or we use several dispatchers, or we improve networking performance of the environment. Using several dispatchers gives slightly better results than having only one. Therefore, we believe that after a certain number of connections managed by a dispatcher it is a good idea to add another one to have better scalability. We were limited in the number of machines at our disposal to do all the tests we wanted to do. We believe that results would have been more explicit if we could reach a higher maximum number of nodes. Scalability is also limited by the conflictLevel. The higher the conflictLevel, the less connections and parallel requests we can perform without having the number of failures exploding, which is shown by the connections test.

Elasticity test

Test description Until now we worked with a constant number of nodes during each test. In order to react to flash crowds, we need Scalaris to be elastic enough so that the throughput can be increased quickly. The detection of the flash crowd is not part

100 of the test and we consider that the flash crowd starts at the beginning of the test. Afterwards, we have to decide what is the best strategy to handle this flash crowd. To determine it we observe the throughput as well as the failure percentage during the whole test. The final throughput reached is also important for us, as well as the total number of operations performed during the whole test, in order to determine which behaviour is the best during the churn period.

Parameters We have observed that Scalaris scales well from 6 to 18 nodes and are going to test different ways to get from 6 to 18 nodes under high load. We will use one dispatcher to dispatch a constant number of parallel requests to Scalaris. This dispatcher is also responsible for adding the new nodes to the ring. Note that it takes between 45 seconds and 200 seconds to start a new node using the Amazon API. The dispatcher samples periodically the number of operations correctly done as well as the number of failures. This allows us to plot the evolution of the throughput and failure percentage with respect to the time. We now present the different strategies we will try. Each strategy is defined by a number of nodes to add at each adding point and a constant time between each adding point. For each strategy we wait one minute before adding the first node so that we can observe what is happening before and after.

(1) We do nothing in order to have a standard measure to compare with the other results.

(2) One node added after one minute and then no more.

(3) One node added every minute until we reach eighteen nodes.

(4) Two nodes added every minute until we reach eighteen nodes.

(5) Two nodes added every two minutes until we reach eighteen nodes.

(6) Six nodes added every five minutes until we reach eighteen nodes.

(7) Twelve nodes added after one minute.

We believe that with those strategies we have covered almost all possible behaviours. Meaning, doing nothing, adding nodes regularly and adding lots of nodes at the same time but waiting longer after the next adding. We must precise that those strategies are objectives, it may not be possible to add nodes as quickly as planned and thus we will most probably observe jitter in the nodes starting time. We summarize below the parameters of the test.

1 connnection per node • nbrInitialData = 2000 • 15 minutes of test • conflictLevel = 1/250 (so all the operations work on a pool of 250 keys) •

101 6 nodes runnning initially • 1 minute before adding first node(s) • Large instance dispatcher • Small instance Scalaris node • Successful and failed operations sampled every 20 seconds. •

According to the Scalaris developers and at the time we are writing, nodes buffer the requests arriving while they are inserted in the ring and start responding to them as soon as they are correctly inserted. The parameter nbrInitialData is a special parameter. It is aimed at simulating previous content on Scalaris nodes. Indeed, in order to maintain the replication factor, new Scalaris nodes must retrieve the values they are now responsible for when they are added in the ring. This thus adds an overhead during each insertion of nodes in the ring. We thus wanted to take this overhead into account and be able to tune it with the parameter nbrInitialData. Before the test starts we add nbrInitialData key/value pairs to the ring. The keys used are random and the value is always the same and corresponds to a constant String of 360448 random characters. We have chosen nbrInitialData equal to 2000. This means that we have quite a lot of data that must be transferred to the Scalaris nodes before starting the test. We have observed that the initialization phase takes approximately 5 minutes. We have several tasks running on the dispatcher: one responsible to check that operations are correctly done, one that must send time statistics, the management of the Scalaris Connection Manager and the Nodes Manager which are both heavy tasks. This is why we have chosen to use a large dispatcher.

Remarks on the tests environment Before getting to the results, we make two general remarks. First, the Amazon cluster is unstable, sometimes some machines are not reachable (ping not working). Secondly, sometimes the Ubuntu AMI we are using does not initialize correctly the SSH keys and SSH is thus not working, we spotted this problem really late and could not correct it. Indeed, it would have been necessary to modify the AMI we used and it was too late to redo all the tests. When faced with one of those two problems we are forced to reboot the machine on the fly, which is quicker than launching a new one but takes some time and CPU. We consider that this overhead is part of the test. This was not a problem with the previous tests as the launching of the ring was done during the initialisation phase. This is indeed the first test where we need to launch a new machine at run time.

Scalaris elasticity test results Figure 6.9 shows the evolution of the throughput of the different strategies, the numbers in the legend of the graph correspond to the numbering of the strategies presented above. The throughput is computed thanks to the data collected every 20 seconds, the throughput at the time x is thus equal to the average throughput between x-20 seconds and x. We show the marks when we begin to start new instances by blue points on the graph, note that the number of instances

102 started depends on the strategy. We also show with red points the moment where Scalaris is started on the nodes, this is the moment where the command is launched, not when it is effectively inserted in the ring. We first comment each strategy separately.

(1) During this strategy we do not add any node and thus keep the ring size at the initial value of 6 nodes. As you can see the throughput stays stable during the whole test.

(2) In this strategy we add only one node. We start the adding procedure 60 seconds after the beginning of the test. We can see that this procedure has an impact on the performance. Indeed, the graphs shows that the throughput decreases during the insertion, but this is not due to the Scalaris churn. Scalaris is not started on the node before 120 seconds. At that time, the throughput increases by a small amount and stays stable until the end of the test. The node was thus quickly operational and we could not notice a drop down in the performance after Scalaris starts.

(3) We tried to add one node every minute but we could only start 6 nodes on the 12 nodes planned. Indeed, launching one node correctly takes a certain amount of time which varies from 45 seconds to 3 minutes approximately. The throughput is more chaotic as we regularly add nodes. And, as we have seen in the last strategy, between the moment we start inserting a new node and the moment where Scalaris is effectively started on the node, the throughput lowers. Again as we observed in strategy (2), the throughput increases directly after Scalaris is started on the node. Finally, we remark it was not possible to observe the stabilization because nodes are added too regularly and all the nodes were not added.

(4) Here we add two nodes every minute. This time we could add 10 nodes out of 12. The throughput once again increases regularly with the adding of nodes while being perturbed by this adding. It finished at a higher value than (3) simply because it could reach more nodes at the end. The throughput reached a pretty high value but was not stable at the end of the test.

(5) We could only add 6 nodes here which is nearly the same as with the strategy 3. However, here we added two nodes at the same time (where we added one in 3) and waited two times longer before adding them (120s instead of 60s). We can observe some kind of period in their adding and see that they regularly reach the same throughput than with the third strategy. This is confirmed at the end of the test where they eventually reach the same throughput at the same number of nodes.

(6) We increased the number of nodes per adding to 6, we first add 6 nodes at 60s, they were ready at 160s and we directly see a high increase in the throughput and a quick stabilization. We observe the same behavior at the second 6 nodes adding and finally reach a stable throughput around 560 ops/s. Something weird is that it should reach the same throughput as (7). Indeed, the last node adding was done at 560s, and, as we have seen, the throughput stays stable from this time showing no indication that it will ever increases. Our guess would be that physical machines

103 Figure 6.9: Throughput with respect to the time for the seven strategies presented, with a large dispatcher and small Scalaris nodes for a conflict level of 0,004.

104 Figure 6.10: Failure percentage with respect to the time for the seven strategies pre- sented, with a large dispatcher and small Scalaris nodes for a conflict level of 0,004.

105 position creates some special conditions limiting the number of messages that can be exchanged between nodes and lowering the throughput. It is indeed possible as each test is run with different nodes.

(7) In this last strategy we add 12 nodes directly at 60s and Scalaris is started on those nodes at 120s. Between 60s and 120s, we have a diminution of the throughput which seems proportional to the number of nodes added, this is normal as the number of operations involved in starting 12 nodes grows with the number of nodes to start. We can observe that this diminution is of about 25% of the throughput. However as soon as the nodes’ starting is finished and Scalaris is booted the throughput explodes and quickly reaches a stable value at 630 ops/s. We can confirm that this value corresponds to the stable throughput for 18 nodes. Indeed, it is close from what we obtain in the connection strategy test at section 6.2.2 for which we obtained an average throughput of 650 ops/s.

We now summarize the results we obtain observing the throughput evolution for each strategy. First, we notice that during the adding period (during which we do the following tasks: launching the nodes on Amazon, periodic call to Amazon API to check the instances state, sending the necessary files, retrieving from the nodes the information necessary to launch Scalaris) the performance is lowered by a factor proportional to the number of nodes. However, launching several nodes at the same time is less time consuming as Amazon starts all the nodes in parallel and thus the time waited for one node is divided by the number of nodes. Secondly, after Scalaris is started on nodes and despite our initial data which is quite important, nodes are almost instantly ready to operate as the throughput in all the strategies increases directly after Scalaris is started on nodes. We believe it is because the number of initial data is too small to observe any performance drop. Moreover, this throughput is quite stable. We must also notice that several node strategies could not reach 18 nodes neither stabilize because the test length was too short. It is not a problem as other strategies have already shown better results than those and reached the best stable state possible (7). Conclusions would thus not have been different from the actual ones. Finally, we decided that the last strategy was the best according to the throughput evolution as it allows to quickly reach a very high and stable throughput with only some disagreement. We now look at the average throughput of each strategy during the test in Fig- ure 6.11. This criteria is important in order to know which strategy maintains the best average service during the 15 minutes we have to react to the flash crowd. It is obvious that the last strategy outperforms the others which is not surprising according to the evolution of the throughput we just observed. We still have to observe the failure percentage evolution. It may give some indication of Scalaris’s instabilities. We can see in Figure 6.10 that the failure percentage grows with the number of nodes. As for the throughput we thus observe an increase of the failure percentage after the node adding which is proportional to the number of nodes added. This is what we observed in all our tests: increasing the number of connections increases the number of failures. There is thus no reason to penalize the solution with higher failures percentages.

106 Figure 6.11: Mean throughput results for the seven strategies presented, with a large dispatcher and small Scalaris nodes for a conflict level of 0,004.

Conclusion We conclude that the best strategy is to add all the nodes at the same time because it is the quickest way to increase the throughput, it also gives the best average throughput on 15 minutes and do not present a failure percentage higher than usual for this number of connections. The results are thus very encouraging as it was indeed possible to go from 6 nodes to 18 nodes in only two minutes with only a loss of approximately 25% while nodes were started. Moreover, as soon as Scalaris is started on nodes, the throughput reaches a value which is close to what we obtained before in 6.2.2. It would have been interesting to test with higher value of nbrInitialData to try to observe loss of performance during Scalaris nodes insertion in the ring but we lacked time to perform those tests.

6.3 Bwitter tests

Now that we have looked at the performance of Scalaris, we can study Bwitter keeping in mind those results. As explained previously, we have implemented two different approaches: the pull and the push. We are going to test and comment those two in this section. However, we will focus on the push approach as it is the one we have finally selected as the best approach. One section will be dedicated to the pull approach later. Therefore, unless we explicitly specify it, we are talking about the push approach. We will start by showing the impact of the application cache we use in order to solve the popular value problem. We then make a test to show the influence of nbrOf-

107 FollowersPerChunk, the number of followers per chunk of the Topost set. Then we test the scalability and elasticity of the system we have implemented.

6.3.1 Experiment measures discussion

In this section we explain which data we measured during our tests in order to clarify it for the rest of the experiment section.

Measures taken

The following tests are aimed at determining the best design and parameter choices. We thus want to measure the performance of the different configurations we propose, but we are also interested in determining how successfully the operations were per- formed. We do two types of operations: reading and posting tweets.Those operations have different success conditions and restart strategies, we detail it below. First, we discuss the posting tweet operation. This operation is assumed to fail only when the first step of the algorithm fails. Indeed, performing this step correctly ensures that the tweet will eventually be posted to all the lines assuming the recovery mechanism is triggered or another tweet is posted by the same user. In case the first step fails, we restart the operation at the test level and do not count it as another operation, but if any of the remaining steps fails we do not trigger the recovery mechanism. This means that all the tweets posted during the tests are always stored in the system but might not be posted in all the lines. However, we have noticed a negligible amount of SR have aborted. This indicates that tweets are successfully posted on the lines most of the time. Secondly, concerning the reading of the tweets, we do not abort the whole operation if one tweet is not available. This should almost never happen because, as shown in the previous tests, concurrent reads are not conflictual. Moreover, tweets are frequently read from the cache lowering even more the probability of failing. We restart the operation only if an error occurs when accessing the line containing the tweet references. We now describe the most relevant measures we took during our tests. We have taken more in order to help us understanding some results and to verify that everything was working correctly. However, those are most of the time not helping to understand the results and would only clutter the text.

Time: • We measure the total time in milliseconds needed to perform the requested num- ber of operations.

SR run: • This is the number of SRs that were performed during the whole test. Indeed, the posting tweet operations are split in various SRs. We take this measure in order to compare it with the number of restarted SRs and the number of aborted SRs. An SR restarted is not counted in the SRs run.

108 SR restarted: • This is the number of SRs that were restarted by Scalaris Workers, remember that they restart a SR a given number of times, that we have fixed to 10, before aborting. We use this value in conjunction with the SRs aborted and the SRs run in order to compute the failure percentage.

SR aborted: • This is the number of SRs that were aborted by Scalaris Workers. When an SR is aborted the Bwitter operation that created the SR get an exception. If the number of SR aborted is low we can be sure that the Bwitter operations were successfully performed. In fact, the number of aborted SRs is extremely low, so low that we actually got approximately two aborted operations during the tests we are presenting here. This is mainly due to our aggressive restart strategy. As just stated, we restart and retry a failed operation 10 times before aborting it. We thus do not present this measure in our results.

Cache hits: • This indicates the number of times a read was successfully performed from the cache. Each cache hit avoids a transactional read on Scalaris.

Cache miss: • This indicates the number of times an access to the cache was performed and no entry was found in the cache. This is usually pretty low compared to the cache hits as we access frequently the same data because the network simulated is small.

You could wonder why did not measure the failures at the Bwitter level. In fact, we did not get any failures of any Bwitter operations during the tests we did. We thus decided to measure the failures at the layer below which is the Scalaris Connection Manager Layer. This measure is precise enough to compare the degree of failures between the different tests we did. As it was the case with Scalaris, we rather represent our results in terms of throughput and failure percentage.

Throughput: • Our Bwitter tests generally consist of a given number of operations. When we talk about an operation we mean one of the two we described in the previous point: posting a tweet or reading tweets. According to the test settings those operations can be more or less heavy. The throughput measure is the number of operations per second achieved by the tested configuration. We believe this is the best way to determine which configuration is the best for a given test as we feel it fairly measures the global throughput of the whole system.

number of Bwitter operations successfully performed Throughput = measured total time Failure percentage: •

109 The failure percentage is the amount of restarted SRs divided by the number of succeeded operations. We only take into account the number of restarted SRs because as said the number of aborted SR is negligible.

SRs restarted Failure percentage = 100 SRs Run + SRs restarted ×

Measuring the time

The time is always measured with the System.currentTimeMillis function from Java. We use the following pseudo code to measure the time taken by a piece of code.

Long timeAtStart = System.currentTimeMillis; codeWeWantToProfile(); Long executionTime = System.currentTimeMillis - timeAtStart;

This method does not take into account that we are working in a concurrent en- vironment. Imagine we want to measure the time taken by an operation. When the test consists only of operations of the same type it is not problem, we can measure the total time of the test and divide it by the number of operation performed. However, if we mix operations of different types (by example posting tweets and reading tweets) we can not use this method. Indeed, a posting tweet thread can be preempted by a reading tweet thread and thus some time spent in the reading tweet thread will be accounted for the posting thread that was preempted. We did not solve this problem and so we did not measure the time taken by a single operation. Ultimately, we are more interested in the time taken to perform a given number of operations rather than the mean time of one type of operation.

6.3.2 Push design tests

The parameters

All the tests are based on a simulation of Bwitter’s use. Between each test we restart Bwitter and Scalaris in order to avoid side effects that could happen due to old tests previously done. This is time consuming because Scalaris is not persistent and we need to initialize Bwitter with some data so that the tests are as realistic as possible. We have two phases: the initialization phase and the main phase. In the first phase, we create the users and one line for each of them and we add the owner of the line on it. We also add a number of followers to each line in order to simulate social connections. We use a hash function in order to chose which user a given user should follow. Finally, each user posts some tweets to create data on the lines. This phase is never taken into account in the results we present. In order to have comparable results the initialization phase is exactly the same for all the tests. In the second phase, we do two kinds of operations we previously described: post a tweet and read tweets. We decided to only read the tweets contained in the head chunk as this is what the users usually want to access. The second phase is finished after a

110 predefined number of operations were successfully performed. We decided to fix this number of operations to 20000 because, as was the case for Scalaris, we feel that 20000 operations are significant enough so that little variations do not influence the overall results. The throughput is computed based on this phase. This second phase is not static in opposition to the first phase, indeed the operations performed are made in different order each time the test is run, and the number of operations of each type varies a little bit. We made this choice because we wanted to avoid to create an artificial pattern by fixing the order of the operations and because we believe it is the best way to simulate the real use of Bwitter. Below we detail the parameters we use for the social network simulation, Scalaris and Bwitter. Some values are fixed and others are variable, we will not detail in each test the parameters that are fixed. Therefore, if you need more information about a precise parameter please refer to this section. We only detail in the tests parameters that are not fixed or parameters that differ from the values we give here. We could not find any precise numbers about Twitter’s use. We thus decided to create two different social networks that, according to us, should be close to reality. The different parameters associated to those two configurations in Table 6.2.

Heavy network Light network Number of users 2000 4000 Lines per user 1 1 Users followed 50 25 Tweets per user at beginning 1 1 Number of users 0,025 0,00625 Users followed Table 6.2: Social network parameters, part 1.

It is not possible to simulate a network as big as Twitter, we were thus forced to simulate a smaller network. However, the initialization phase for those two networks is already quite long. The names we have chosen for those two network are significant, the heavy network is more dense than the light one. The heavy network overestimates the real complexity of a network like Twitter in order to avoid presenting better results than a real world network would give. Indeed, we have chosen nbrUsers and nbrFollowers in order to have a dense network, which complicates the task of Bwitter. You can notice that the ratio (number of users / users followed) is quite high. This ratio of 0,025 means that each user follows 2,5% of all the users in the network, which implies a quite high level of conflict between concurrent operations. This ratio is the equivalent of the conflict level in the Bwitter tests. We believe the light network is closer to the reality, because it is absurd to imagine that each user is following 2.5% of the users in the network. We thus designed this other network which has a smaller ratio (number of users / users followed) equal to 0,00625 to see how our application reacts to different level of conflicts. We now detail the parameters related to Scalaris that we grouped in Table 6.3.

111 Scalaris node type Small instance Number of Scalaris nodes Varies from 4 18 → Connections per node Usualy one, can vary during the tests Number of trials per SR 10 Number of parallel requests Usualy 20, varies with the total number of connections to Scalaris nodes

Table 6.3: Scalaris parameters.

We can use a total of maximum 20 nodes during the experiments, taking into account both Scalaris nodes and Bwitter nodes, however we use maximum 19 for his- torical reasons. In order to maintain a high load during all our tests, we constantly make 20 operations in parallel. If we use a higher number of connections per node we increase this value so that it is always higher than the number of connections to Scalaris nodes. Finally, we have configured the Scalaris Connections Manager so that each SR is restarted 10 times before being aborted. We now present the Bwitter parameters we grouped in Table 6.4 .

Dispatcher / Bwitter node type Small or Large instance tweetchunksize 30 nbrOfFollowersPerChunk 20

Table 6.4: Bwitter application parameters.

We have two parameters of the Bwitter application to fix namely tweetchunksize and nbrOfFollowersPerChunk. They are susceptible to have an impact on the re- sults of our tests as they will influence the number of tweets read and the number of operations involved in the write. We have chosen them so that the first tweet chunk should contain a decent amount of tweets in order to have relevant tests. With a value of 30 we estimate to 20 the number of tweets in the head chunk at the start of the test. Indeed, each user should have got around 50 tweets in his line during the initialisation phase.

Real system with stars and fans

In order to stick as much as possible to reality we have decided to populate our system with two kind of users: stars and fans. Indeed, in Twitter, some users are a lot more followed than they follow and the others follow more persons than they have followers. We fixed to 10% the number of stars in the system, the rest of the users being fans. Each user in the system has 75% of stars users amongst the users he follows. An example of simulated network can be seen in Figure 6.12.

112 Figure 6.12: Simulated social network with social connections between users, each user follows 3 users. Left) Random following pattern. Right) Nodes 2 and 4 are stars and each user has a 2/3 probability per connection to follow a star.

Furthermore, users tend to do more reads than posts when visiting social networks, we took this behaviour into account too by allowing to fix the read tweets operations / total number of operations ratio. We use the parameters we list in Table 6.5 for all the tests.

Stars percentage 10% of users are stars Pourcentage of stars in the followers 75% of the users followed Read percentage 80% of the operations are read

Table 6.5: Social network parameters, part 2.

Cache influence

With this test we will prove that a cache mechanism is not optional and in fact crucial for the performance of the system. We made two runs of our Bwitter simulation, one with the cache and one without the cache. The parameters we used for the two runs are the ones we just fixed with as exception the ones we list in Table 6.6.

Type of social network Heavy network Dispatcher / Bwitter node Large istance Number of Scalaris nodes 18 Connections per node 1

Table 6.6: Parameters changed for the cache test.

We have put the test results in Table 6.7. Remember that we have set a time to live equal to 1 minute for the elements in the cache. Those elements thus stay a maximum of 1 minute in the cache before being ejected, meaning that a deleted tweet can remain

113 visible for a maximum of 1 minute. The cache has a size big enough to keep all the elements of the test that can be cached. This would probably not be the case in a real situation, if we must remove an element from the cache because it is full we use a least recently used strategy as explained in section 3.2.3.

Without cache With cache Time taken for all the operation in seconds 863s 492s Troughput (ops/s) 23,15 ops/s 40,59 ops/s Failure percentage 1,32% 3,18% Cache hits / 250704 Cache misses / 4431

Table 6.7: Performance comparison with and without application cache.

Obviously the cache is the quickest option, it allows to nearly double the number of operations performed per second. This noticeable performance improvement is ex- plained entirely by the frequent access to the cache. The cache is mainly used to access tweet and passwords. Assuming tweets are in the cache, we avoid X transactions to Scalaris, where X is the number of tweets read in one read operation. We saw during the previous tests that reading a value from Scalaris approximately takes 1,5 ms with 18 nodes, the cache statistics indicates that the mean time taken for an access to the cache is 0,006 ms. It is thus theoretically 250 times faster! According that we have 250704 hits we win (1, 5 0, 006) 250704ms = 374551ms = 375s for the whole test. − × The difference between the two test times is 371s, the cache is thus indeed the main factor improving the performance. A side effect that can be observed when using the cache is that we have a higher failure percentage. The failure percentage goes from 1,32% to 3,18%, which is still a very good result, meaning that almost all of the Scalaris operations were correctly performed. This is probably due to the higher number of concurrent posting due to the cache usage, indeed, the reading tweets operations are a lot quicker and thus we have more concurrent posting tweets operations than when we did not use a cache. This implies that we have more conflicts. Indeed, we did a quick test and observed that while only reading tweets we end up with failure percentage of 0, and at the contrary when only posting tweets we had a failure percentage of 30%. Our assumption that more concurrent tweet posting are responsible of this increase in failure percentage is thus reasonnable. In conclusion, the cache improves the global performance. The reading tweets algorithms mainly benefits from the cache making reads even faster, which was our goal. We could probably still optimize the cache usage but decided not to focus on this part. The following tests will thus all use the cache described here.

Number of followers in a chunk of the topost set.

Before starting the scalability test we were curious to know in practice the influ- ence of nbrOfFollowersPerChunk on the performance of our system. We first list

114 some theoretical elements that should help us understand the results. Then, we do a simulation to see if verifies in a real test. You must recall that the higher the nbrOfFollowersPerChunk, the higher the number of keys involved in a write transaction, but the lower the number of necessary transactions. Moreover, transactions involving more keys are in general more likely to fail. We make use of our theoretical analysis, and we compute, using the Equation 5.2, that we need respectively 174, 110, 102 and 98 Scalaris operations to do a single write for values of nbrOfFollowersPerChunk of 1, 5, 10 and 20. Those results are displayed in Figure 6.13.

 3 2  nbOp = 8 + nbrF ollowers 2 + + × nbrT weetsP erChunk nbrOfF ollowersP erChunk  3 2  = 8 + 40 2 + + × 20 nbrOfF ollowersP erChunk 80 = 8 + 80 + 6 + nbrOfF ollowersP erChunk 80 = 94 + nbrOfF ollowersP erChunk (6.3)

Figure 6.13: Number of Scalaris operations needed to perform a Bwitter “post tweet” operation with respect to the number of followers per chunk.

We now want to evaluate in practice the impact of this parameter. We simulate a social network with a higher level of conflict than the two we already presented in

115 order to have a clearer view of this impact. The levels of conflict of the heavy and light networks we presented are 0,025 and 0,00625 respectively, in this test it is equal to 0,06. We measured the time needed to perform 10000 operations with different sizes of nbrOfFollowersPerChunk. We summarize the simulation parameters in Table 6.8 and present our results in Figures 6.14 and 6.15.

Bwitter node / Dispatcher Small Number of Scalaris nodes 10 Number of Bwitter operations 10000 Number of users 700 Users followed 40 Number of users 0,06 Users followed Table 6.8: Parameters changed for the Topost set influence test.

Figure 6.14: Time measured to perform 10000 Bwitter operations with respect to the number of followers per chunk, results for small instances and conflict level of 0,06.

We can see that the time lowers a lot between one and five followers per chunk. This has nothing surprising as the number of operations to do per tweet posted decreases a lot between one and five as shown in Figure 6.14. The time difference is entirely explained by the lower number of operations done for one tweet posting. Indeed, the cost of the read operation stays the same whatever nbrOfFollowersPerChunk. However, if we follow this reasoning the time should continue to lower between 5 and 20. But it seems to stagnate and even to increase slightly at 20. We can explain it taking a look at Figure 6.15 which plots the failure percentage. This one shows a big increase of

116 Figure 6.15: Failure percentage for 10000 Bwitter operations with respect to the number of followers per chunk, results for small instances and conflict level of 0,06.

the failure percentage between 5 and 20. Indeed, as we mentioned in the introduction of this section, the bigger nbrOfFollowersPerChunk the bigger the number of keys involved per transaction during a tweet posting. And, larger transactions induce more conflicts and thus more failures. The advantage of having less structures to manage at higher value of nbrOfFollowersPerChunk seems thus to compensate with the number of failures that increase also with nbrOfFollowersPerChunk, this is why we observe this stagnation at the end of the graph. In conclusion, we should not use a value too small for the nbrOfFollowersPer- Chunk as in this case the number of operations increases a lot and the time thus explodes. On the other hand, we should not use a too high value either as it quickly increases the number of failures and thus the time. This is why we decided to use in all our following tests a value of 20 for nbrOfFollowersPerChunk because it seems a good compromise. It is maybe not the best choice but at least it seems to be a wise one.

Scalability tests

With this test we evaluate the scalability of our application. We run our simulation with the parameters described at the beginning of this section for different numbers of nodes. We use the heavy network we have presented. We do not know for sure what would be the best connection strategy as the degree of conflict of our simulation is hard to evaluate. We thus test with a small dispatcher with one connection per node (1)

117 and with a small dispatcher with two connections per node (2). We also test with a large dispatcher and one connection per node (3) as a small dispatcher is maybe not powerful enough to handle the Bwitter tasks and lots of Scalaris connections. Our results are grouped in Figures 6.16 and 6.17. The first shows the throughput with respect to the nodes and the second plots the failure percentage.

Figure 6.16: Throughput for 20000 Bwitter operations on a heavy network with respect to the number of Scalaris nodes, results for one small dispatcher with one connection per node, one small dispatcher with two connections per node and one large dispatcher with one connection per node.

From Figures 6.16 we can see that (2) does not scale well. Indeed, the throughput first increases until 12 nodes and then decreases at a level below the throughput reached at 4 nodes. The failure percentage is more than twice the one for (1) and (3), which seems to indicate than there are too many connections toward Scalaris. A simulation with a smaller conflict level could have benefited from a higher number of connections but we did not test it. The throughput of (1) grows at regular pace until 14 nodes and then seems to slow down. We have observed the same behavior during the Scalaris scalability tests but it is more obvious here. The failure percentage grows linearly with the number of nodes, which is normal. Configuration (3) gives far better results in terms of throughput than (1) and (2). It grows very well until 16 nodes and suddenly falls at 18 nodes. However the gap between 14 and 16 seems higher than usual, we thus believe that this situation was created by exceptional conditions. We deduce from the observation of the throughput of (1), (2) and (3) that a small dispatcher can not handle both Bwitter tasks and Scalaris related

118 Figure 6.17: Failure percentage for 20000 Bwitter operations on a heavy network with respect to the number of Scalaris nodes, results for one small dispatcher with one connection per node, one small dispatcher with two connections per node and one large dispatcher with one connection per node.

tasks. We have indeed observed with Amazon’s basic monitoring tools that the CPU as well as the network were used a lot more during these Bwitter tests than during the Scalaris tests. It is not surprising as the values and keys used are bigger than during the Scalaris tests and Bwitter performs various additional tasks. It therefore indicates that it is necessary to use more powerfull machines than the small instances from Amazon for the Bwitter nodes. The failure percentage grows slowly and is nearly the same as for (1) until we reach 12 nodes. From 12 nodes and until 18 nodes, (3) sees its failure percentage growing faster. This is probably because it has more CPU and network capabilities and thus can run more transactions in parallel which creates more conflicts. However, this seems to indicate that the gain of adding one node will decrease slowly as the number of nodes becomes bigger. This has nothing surprising and does not indicate a scalability problem. Indeed, during this test we increased the number of parallel operations while keeping stable the number of users. Normally, the number of machines grows with the size of the social network and thus the number of users. But a user should not follow more users simply because there are more users in the network. In conclusion, Bwitter is scalable, but we need to have the Bwitter nodes powerful enough to handle the necessary number of connections toward Scalaris while performing the Bwitter tasks. We now make a final scalability test with a simulated social network with a smaller conflict level which, we believe, is closer to reality. We only run the tests with one large dispatcher and one connection per node. The parameters changed for

119 this test are in Table 6.9.

Bwitter node / Dispatcher Large Number of Scalaris nodes 4 18 → Connections per node 1 Network type Heavy and Light network

Table 6.9: Parameters changed for the push scalability test

This means that we now have a conflict level of 25/4000 = 0,00625. We show in Figures 6.18 and 6.19 the results of the test as well as the results for the more dense social network so that we can more easily compare the two.

Figure 6.18: Throughput for 20000 Bwitter operations with respect to the number of Scalaris nodes for the heavy and the light network, for one large dispatcher with one connection per node.

We observe as expected better performances with a smaller conflict level. The failure percentage increases much slower than before which explains the tremendous gain in performance. Looking at the two Bwitter scalability tests we can see that there exists a pretty clear correlation between the failure percentage and the conflict level. With 18 nodes and this conflict level we reach finally 66 ops/s which means around 13 tweets posted/s and 53 reads/s. If we make a small computation and assume a user posts 3 tweets a day and reads their tweets 12 times a day, we estimate that we can handle 380162 users with only 19 machines. This is obviously overestimated and not precise but even a quarter of this number would be a good result. During those tests we observed good scalability properties for the large dispatchers,

120 Figure 6.19: Failure percentage for 20000 Bwitter operations with respect to the number of Scalaris nodes for the heavy and the light network, for one large dispatcher with one connection per node.

the small dispatchers were too short in resources. As for the Scalaris scalability test we saw that a high conflict level reduces the throughput and lowers the gain obtained from adding a machine. We now test Bwitter’s elasticity.

Elasticity tests

The scalability tests of Bwitter have shown good scalability results from 4 to 18 nodes for the heavy and the light network. However, we have decided to use the light one, indeed the throughput increases faster with the number of nodes with this network, we thus believe it is easier to observe elasticity with this one. Concerning the nbrInitialData, defined during the elasticity tests on Scalaris, we have decided to increase its value up to 20000. Indeed, the Scalaris elasticity test did not seem to show any instability after nodes adding, we thus decided to try to increase the impact of the churn. The time of the initialisation phase is very long, it takes approximately 45 minutes to post the initial data and additional 40 minutes to initialize Bwitter related data such as followers, tweets and so on. It was thus not possible to push the number of initial data much higher though we would have liked to. We keep the seven strategies we defined during the elasticity tests on Scalaris and start with 6 initial nodes. However, the results should be quite different because we used much more initial data and Bwitter adds an important CPU and network overhead compared to the Scalaris operations we did before. We present the results in Figures 6.20 and 6.21. As for the last elasticity test, we present the evolution of the throughput as well as the failure percentage, we

121 Figure 6.20: Throughput with respect to the time, Bwitter results for the seven pre- sented strategies on Scalaris small instances with large dispatcher and and light network.

122 Figure 6.21: Failure percentage with respect to the time, Bwitter results for the seven presented strategies on Scalaris small instances with large dispatcher and and light network.

123 also indicate by blue dots the moment we start the machine on Amazon and by red dots the moment at which Scalaris is started on nodes. We also indicate the final number of nodes reached by each strategy in Table 6.10.

Strategy 1 2 3 4 5 6 7 Nodes added 0 1 5 8 8 12 12

Table 6.10: Number of nodes inserted in the ring at the end of the test.

First, you can observe that the throughput is much more unstable than during the Scalaris elasticity test. The first reason is that the measure we take is much more volatile. Secondly, we have put a lot more initial data in the system. This might thus slow down Scalaris at some times, slowing down some read or post tweets operations that are finished at the next sample when we take the measures and thus giving a big gap between two measures. Thirdly, Bwitter operations are much more heavy than the operations we did during Scalaris scalability test, it may thus have an impact on the results. We will not discuss each strategy in detail as we did for Scalaris. Instead we do some general comments. We can observe that the first strategy’s throughput varies a lot (between 20 and 30) all along the test, which means that even without adding any node the throughput is quite variable. We also can see that as for the Scalaris elasticity test, between the moment we start instances on Amazon and the moment Scalaris is started on the nodes we have a slow down in the throughput. The adding of nodes is once again directly effective, and the throughput in general increases. We also observe that most of the strategies were not stabilized at the end of the test and that their throughputs still vary a lot. But, as expected, the strategies that added the more nodes during the test reached the highest throughput. The throughput is varying a lot, it is thus not representative to chose a strategy according to the final throughput, we thus turn toward the average throughput, represented in Figure 6.22, which is much more easy to analyze. As we can see, the strategies 6 and 7 that reach the highest number of nodes at the end have also the highest average throughput. The strategies 4 and 5 have a similar average throughput but 4 has a higher one because it adds its nodes before 5 and can thus benefit sooner from the new nodes. As we can see in 6.21 the failure percentage varies also a lot, we can also see that when it reaches a peak the throughput naturally drops down. We see that when the number of nodes grows the failure percentage also varies much more. We suppose the peaks are an effect of Scalaris stabilisation algorithm run periodically. So, once again, our conclusion is that the quicker you add nodes, the quicker you increase the throughput and the higher average throughput you obtain during the tests. However we can observe that strategy (7) was less stable than it was during the Scalaris elasticity test. So maybe if we could have performed elasticity tests with more nodes, we could have observed that adding all the nodes at the same time was not a good idea. In order to conclude, we can say that, with our current resources, adding all the

124 nodes at the same time seems to be the best strategy.

Figure 6.22: Average throughput, Bwitter results for the seven presented strategies on Scalaris small instances with large dispatcher and and light network.

6.3.3 Pull scalability test

In this final section we test the scalability of the push approach. We use exactly the same parameters as those described at the beginning of this section. As for the other scalability tests we make the number of nodes vary from 4 to 18, use one connection per node and make 20000 Bwitter operations. We simulate the heavy and the light networks. Those parameters are summarized in Table 6.11.

Bwitter node / Dispatcher Large Number of Scalaris nodes 4 18 → Connections per node 1 Number of Bwitter operations 20000 Network type Heavy and Light network Users followed 40

Table 6.11: Parameters changed for the pull scalability test

The heavy network should give much worse results than the other one. Indeed, from the theoretical analysis, we know that the complexity of the read operation grows linearly with the number of followers for the pull approach. We read one chunk (here one time frame) as we did for the push. We have set up the time frame at one day,

125 which is a reasonable choice for a real case application, so all the tweets posted are posted in the same chunk. It may thus seem unfair compared to the push approach, which flushes the head when it is full, while in the pull we are forced to read all the tweets that were posted during the day. However, because we use a cache, this side effect is strongly mitigated. Indeed, most of the tweets are in the cache and its read access is really quick. The pull and push simulation are thus comparable. As the large dispatcher has given better results for the push approach we decided to make the test with a large dispatcher also. Concerning the Scalaris nodes we use, as usual, the small instances. We put the throughput and the failure percentage for the heavy and the light network in Figures 6.23 and 6.24.

Figure 6.23: Throughput for 20000 Bwitter operations with the pull approach with respect to the number of Scalaris nodes for the heavy and the light network, for one large dispatcher with one connection per node.

The pull approach presents an excellent scalability for the two networks. Indeed, the throughput increases perfectly linearly with the number of nodes. This good behavior is due to the failure percentage that grows extremely slowly with the number of nodes. This seems to indicate that it can handle a really high number of nodes. The low failure percentage is the consequence of the low number of writes involved in the pull version of the post tweet. Remember that the pull only writes the tweet reference at one place and that when followers read their tweets they do not make any write. Operations in the pull are thus nearly not conflictual at all. The failure percentage for the light network seems to increase a lot at 16 nodes but it is only a visual effect. Indeed, it only increases approximately of 0,05 For the same parameters, namely those described at the beginning of this section, we go from around 18000 to 250000 Scalairs operations. This is due to high number of

126 reads and low number of writes in our test. As predicted the reads require much more operations to be done using the pull design.

Figure 6.24: Failure percentage for 20000 Bwitter operations with the pull approach with respect to the number of Scalaris nodes for the heavy and the light network, for one large dispatcher with one connection per node.

6.3.4 Conclusion: Pull versus Push

We want to say some final words about the pull and the push approach. We can only compare the scalability tests because we did not performed elasticity test for the pull. The results can be directly compared because we used the same parameters for the two approaches. We put as usual the throughput and the failure percentage for the push and the pull for the two network we tested. They are shown in Figures 6.25 and 6.26. We can observe that the push approach for the two network types outperforms the pull in term of throughput. The throughput also increases faster with the number of nodes in the push approach. However, we can see that the throughput increase for the push, as already observed, seems to reduce when we reach high number of nodes. This is not the case with the pull approach which grows more constantly. The push approach will thus probably reach a limit in scalability quicker than the pull.

127 Figure 6.25: Throughput for 20000 Bwitter operations with the push and pull approach with respect to the number of Scalaris nodes for the heavy and the light network, for one large dispatcher with one connection per node.

Concerning the failure percentage, it is much more important in the push approach and increases much quicker with the number of nodes. This explains why the increase of the throughput in the push approach slows down with the number of nodes. The pull approach does not present this problem as the failure percentage grows very slowly and is ridiculously low. We thus conclude that the two approaches have their pro and cons. The push approach present a much better scalability but at the cost of a higher failure percentage. The two approaches scale well but the pull does not seem to slow down. This seems to indicate that the pull would be the most appropriate for a very high number of nodes. However this last conclusion is purely hypothetical. We would need much larger scale tests in order to confirm this intuition.

128 Figure 6.26: Failure percentage for 20000 Bwitter operations with the push and pull approach with respect to the number of Scalaris nodes for the heavy and the light network, for one large dispatcher with one connection per node.

6.4 Conclusion

During this section we have shown how to configure Amazon and Scalaris. We performed a series of tests concluding that Scalaris running on Amazon was indeed scalable and elastic. We then performed a series of test on Bwitter, both push and pull approaches. We have demonstrated for both that they were scaling very well. The push approach presenting a quicker increase of performance but with a failure percentage growing much faster. Finally, we showed that our system, based on the push approach, was able to significantly improve its performances while facing a high load in 15 minutes.

129 130 Chapter 7

Conclusion

Our goal was to design and develop a scalable and elastic implementation of a social network application on the top of key/value datastore. Looking at the results detailed in the previous chapter, we are confident we have reached our goal. Indeed, we developed an implementation of our pull and push design, and they both showed good scalability results. The elasticity was only tested for the push approach, and we showed it was possible to quickly improve performance while assuring good level of service. All those tests were achieved under real world assumptions using Amazon’s Elastic Compute Cloud infrastructure. The implementation was realized with the goal of being as close as possible to a real social network application, we thus took care at protecting user data and at avoiding security flaws. During our work with Beernet and its main developer Boris Mej´ıas,we identified the basic requirements to allow different services to run on the same DHT without interfering with each other. Those lead to the discovery of some potential improvements for Beernet’s API, which are now implemented in version 0.9. This new API allows users to protect and grant limited rights to their data by using a system of secrets. Before testing Bwitter, we have also heavily tested Scalaris in order to understand the future Bwitter tests results. We first showed the importance of choosing the right number of connections. Afterwards, we studied deeply his scalability and tried different strategies in order to evaluate the elasticity of Scalaris on Amazon’s EC2. It was shown to be highly scalable and elastic. Besides this work, we have also co-written an article, along with Peter Van Roy and Boris Mej´ıas,entitled “Designing an Elastic and Scalable Social Network Application”. In this article we detail some of the observations and design decisions which developped in this master thesis. This article, that can be found in Chapter 10 of our annexes, has been accepted for The Second International Conference on Cloud Computing, GRIDs, and Virtualization1 organized by the IARIA and held the 25th to the 30th of September 2011 in Rome, Italy.

1CLOUD COMPUTING 2011, http://www.iaria.org/conferences2011/CLOUDCOMPUTING11.html, last accessed 13/08/2011

131 7.1 Further work

At multiple occasions during the tests, we concluded that it would have been in- teresting to perform the tests with more nodes in order to have a better idea of the scalability and the elasticity. Indeed, during this work we were limited in our tests to 20 machines. And hence, while Bwitter displayed good performance in this environment, it would have been interesting to increase the number of machines in order to approach a more realistic number of machines. We also believe the flash crowd detection mechanism is an interesting subject to study. Indeed, during our researches, we have noticed that there are sometimes telltale behaviours in the network before a high peak of activity. It would thus be interesting to try to design a mechanism based on those social behaviours in order to predict the heavy loads and already allocate machines before the peak. We did not study the downscale elasticity in our work because, according to the Scalaris developers, their system does not handle graceful shutdowns yet in version 0.3.0. It would thus be interesting to observe and test Bwitter on Scalaris once this feature is implemented in order to study its behaviour. We did not address the load balancing between Bwitter nodes but it could be interesting to develop an algorithm to detect which requests should be forwarded to which Bwitter node in order to share the load between them. Following the same idea, some requests, like tweets posted from stars, are quite heavy, it might also be a good idea to split this work between the Bwitter nodes and not only between the Scalaris nodes. Finally, the load balancer of the Scalaris Connection Manager could be improved in order to decide which SR should be executed in order to decrease the conflict between SRs executed concurrently.

132 Bibliography

[1] Apache. Apache hbase, frontpage. http://hbase.apache.org, 2011. [Online; accessed 28-June-2011].

[2] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Sto- ica, and Matei Zaharia. Above the clouds: A berkeley view of cloud computing. Technical Report UCB/EECS-2009-28, EECS Department, University of Califor- nia, Berkeley, Feb 2009. URL http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/ EECS-2009-28.html.

[3] Hari Balakrishnan, M. Frans Kaashoek, David Karger, Robert Morris, and Ion Stoica. Looking up data in p2p systems. Commun. ACM, 46:43–48, February 2003. ISSN 0001-0782. doi: http://doi.acm.org/10.1145/606272.606299. URL http://doi.acm.org/10.1145/606272.606299.

[4] Shea Bennett. Twitter passes 300 million users, seeing 9.2 new registrations per sec- ond. (allegedly.). http://www.mediabistro.com/alltwitter/twitter-300-million-users b9026, 2011. [Online; accessed 28-June-2011].

[5] John Buford, Heather Yu, and Eng Keong Lua. P2P Networking and Applica- tions. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008. ISBN 0123742145, 9780123742148.

[6] Nicholas Carlson. Facebook has more than 600 million users, goldman tells clients. http://www.businessinsider.com/ facebook-has-more-than-600-million-users-goldman-tells-clients-2011-1, 2011. [Online; accessed 28-June-2011].

[7] Rick Cattell. Scalable sql and nosql data stores. ACM SIGMOD Record, 39(4), dec 2010.

[8] Chris Clayton1. Standard cloud taxonomies and windows azure. http://blogs.msdn.com/b/cclayton/archive/2011/06/07/ standard-cloud-taxonomies-and-windows-azure.aspx, 2011. [Online; accessed 26-July-2011].

[9] Technology Expert. Twitter proves itself again, in chilean earthquake. http:// www.tech-ex.net/2010/02/twitter-proves-itself-again-in-chilean.html, 2010. [Online; accessed 28-June-2011].

133 [10] Code Futures. Database sharding. http://www.codefutures.com/ database-sharding/, 2011. [Online; accessed 28-June-2011].

[11] Ali Ghodsi. Distributed k-ary System: Algorithms for Distributed Hash Tables. PhD thesis, KTH –- Royal Institute of Technology, Stockholm, Sweden, dec 2006.

[12] Ali Ghodsi, Luc Alima, and Seif Haridi. Symmetric replication for structured peer-to-peer systems. In Gianluca Moro, Sonia Bergamaschi, Sam Joseph, Jean- Henry Morin, and Aris Ouksel, editors, Databases, Information Systems, and Peer-to-Peer Computing, volume 4125 of Lecture Notes in Computer Science, pages 74–85. Springer Berlin / Heidelberg, 2007. URL http://dx.doi.org/10.1007/ 978-3-540-71661-7 7. 10.1007/978-3-540-71661-7 7.

[13] Ali Ghodsi, Luc Onana Alima, and Seif Haridi. Symmetric replication for structured peer-to-peer systems. In Proceedings of the 2005/2006 interna- tional conference on Databases, information systems, and peer-to-peer computing, DBISP2P’05/06, pages 74–85, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN 978-3-540-71660-0. URL http://portal.acm.org/citation.cfm?id=1783738.1783748.

[14] Jim Gray and Leslie Lamport. Consensus on transaction commit. ACM Trans. Database Syst., 31:133–160, March 2006. ISSN 0362-5915. doi: http://doi.acm. org/10.1145/1132863.1132867. URL http://doi.acm.org/10.1145/1132863.1132867.

[15] El-Ansary Sameh Haridi Seif. An overview of structured overlay networks. Hand- book on Theoretical and Algorithmic Aspects of Sensor, Ad Hoc Wireless, and Peer-to-Peer Networks, 2005.

[16] Abigail Hauslohner. Is egypt about to have a facebook revolution? http://www. time.com/time/world/article/0,8599,2044142,00.html, 2011. [Online; accessed 28- June-2011].

[17] Bill Heil and Mikolaj Piskorski. New twitter research: Men follow men and nobody tweets. http://blogs.hbr.org/cs/2009/06/new twitter research men follo.html, 2009. [Online; accessed 28-June-2011].

[18] Rachelle Matherne. Social media coverage of the haiti earthquake. http://sixestate. com/social-media-coverage-of-the-haiti-earthquake/, 2010. [Online; accessed 28- June-2011].

[19] Boris Mej´ıasand Peter Van Roy. Beernet: Building self-managing decentralized systems with replicated transactional storage. IJARAS: International Journal of Adaptive, Resilient, and Autonomic Systems, 1(3):1–24, July-Sept 2010. ISSN 1947-9220. doi: 10.4018/jaras.2010070101.

[20] MySQL. Mysql cluster. http://www.mysql.com/products/cluster/, 2011. [Online; accessed 28-June-2011].

[21] John Naughton. Yet another facebook revolution: why are we so surprised? http:// www.guardian.co.uk/technology/2011/jan/23/social-networking-rules-ok, 2011. [On- line; accessed 28-June-2011].

134 [22] Timothy Grance Peter Mell. The nist definition of cloud computing (draft). Rec- ommendations of the National Institute of Standards and Technology, 2011.

[23] Programming Languages and Distributed Computing Research Group, UCLou- vain. Beernet: pbeer-to-pbeer network. http://beernet.info.ucl.ac.be, 2009. URL http://beernet.info.ucl.ac.be.

[24] Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, and Scott Shenker. A scalable content-addressable network. In Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer commu- nications, SIGCOMM ’01, pages 161–172, New York, NY, USA, 2001. ACM. ISBN 1-58113-411-8. doi: http://doi.acm.org/10.1145/383059.383072. URL http: //doi.acm.org/10.1145/383059.383072.

[25] Redis. Redis. http://redis.io/, 2011. [Online; accessed 28-June-2011].

[26] Sean Rhea, Brighten Godfrey, Brad Karp, John Kubiatowicz, Sylvia Ratnasamy, Scott Shenker, Ion Stoica, and Harlan Yu. Opendht: a public dht service and its uses. SIGCOMM Comput. Commun. Rev., 35:73–84, August 2005. ISSN 0146- 4833. doi: http://doi.acm.org/10.1145/1090191.1080102. URL http://doi.acm.org/ 10.1145/1090191.1080102.

[27] Alex Rodriguez. Restful web services: The basics. https://www.ibm.com/ developerworks/webservices/library/ws-restful/, 2008. [Online; accessed 13-August- 2011].

[28] Antony Rowstron and Peter Druschel. Storage management and caching in past, a large-scale, persistent peer-to-peer storage utility. SIGOPS Oper. Syst. Rev., 35: 188–201, October 2001. ISSN 0163-5980. doi: http://doi.acm.org/10.1145/502059. 502053. URL http://doi.acm.org/10.1145/502059.502053.

[29] Thorsten Sch¨utt,Florian Schintke, and Alexander Reinefeld. Scalaris: reliable transactional p2p key/value store. In Proceedings of the 7th ACM SIGPLAN work- shop on ERLANG, ERLANG ’08, pages 41–48, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-065-4. doi: http://doi.acm.org/10.1145/1411273.1411280. URL http://doi.acm.org/10.1145/1411273.1411280.

[30] Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Bal- akrishnan. Chord: A scalable peer-to-peer lookup service for internet applica- tions. SIGCOMM Comput. Commun. Rev., 31:149–160, August 2001. ISSN 0146- 4833. doi: http://doi.acm.org/10.1145/964723.383071. URL http://doi.acm.org/ 10.1145/964723.383071.

[31] Chunqiang Tang, Zhichen Xu, and Mallik Mahalingam. psearch: information retrieval in structured overlays. SIGCOMM Comput. Commun. Rev., 33:89–94, January 2003. ISSN 0146-4833. doi: http://doi.acm.org/10.1145/774763.774777. URL http://doi.acm.org/10.1145/774763.774777.

135 [32] G. Tselentis, J. Domingue, A. Galis, A. Gavras, and D. Hausheer. Towards the Future Internet: A European Research Perspective. IOS Press, Amsterdam, The Netherlands, The Netherlands, 2009. ISBN 1607500078, 9781607500070.

[33] Twitter. #numbers. http://blog.twitter.com/2011/03/numbers.html, 2011. [Online; accessed 28-June-2011].

[34] Guido Urdaneta, Guillaume Pierre, and Maarten Van Steen. A survey of dht security techniques. ACM Comput. Surv., 43:8:1–8:49, February 2011. ISSN 0360- 0300. doi: http://doi.acm.org/10.1145/1883612.1883615. URL http://doi.acm.org/ 10.1145/1883612.1883615.

[35] Harry Wallop. Japan earthquake: how twitter and facebook helped. http://www.telegraph.co.uk/technology/twitter/8379101/ Japan-earthquake-how-Twitter-and-Facebook-helped.html, 2011. [Online; ac- cessed 28-June-2011].

[36] Evan Weaver. Improving running components. http://www.slideshare.net/Eweaver/ improving-running-components-at-twitter, 2009. [Online; accessed 28-June-2011].

[37] Wikipedia. Trusted platform module. http://en.wikipedia.org/wiki/Trusted Platform Module, 2011. [Online; accessed 28-June-2011].

[38] Wikipedia. Trusted computing. http://en.wikipedia.org/wiki/Trusted computing# Remote attestation, 2011. [Online; accessed 28-June-2011].

[39] Wikipedia. Partition (database). http://en.wikipedia.org/wiki/Partition (database), 2011. [Online; accessed 28-June-2011].

[40] Ethan Zuckerman. The first twitter revolution? http://www.foreignpolicy.com/ articles/2011/01/14/the first twitter revolution, 2011. [Online; accessed 28-June- 2011].

136 Part II

The Annexes

137

Chapter 8

Beernet Secret API

8.1 Without replication

8.1.1 Put put (S:Secret K:Key V:Val)

Stores the triplet (Hash(Secret) Key Val) at the responsible of the Hash of Key. This operation can have two results, “commit” or “abort”. The operation returns “commit” if:

there is nothing stored associated with the key Key or there is a triplet stored • previously by a put operation.

there is no triplet (Secret1 Key Val1) stored at the responsible of the Hash of Key • so that Hash(Secret) = Hash(Secret1). 6 the value has successfully been stored. • Otherwise the operation returns “abort” and nothing changed. If no value is specified for Secret Beernet will assume it is the equivalent to put(S:NO SECRET K:Key V:Val).

8.1.2 Delete delete(S:Secret K:Key)

Deletes the triplet (Hash(Secret1) Key Val) stored at the responsible of the Hash of Key. This operation can have two results, “commit” or “abort”. The operation returns “commit” if:

there is a triplet (Hash(Secret1) Key Val) stored with a put operation at the • responsible of the Hash of Key.

139 Hash(Secret) = Hash(Secret1) • the triplet has successfully been deleted • Otherwise the operation returns “abort” and nothing changed. If no value is specified for Secret Beernet will assume it is the equivalent to delete(S:NO SECRET K:Key).

8.2 With replication

8.2.1 Write write(S:Secret K:Key V:Val)

Stores the triplet (Hash(Secret) Key Val) at the majority of replicas, updating the value gives a new version number to the triplet. This operation can have two results, “commit” or “abort”. The operation returns “commit” if

there is nothing stored associated with the key Key or there is a triplet stored • previously by a write operation at the majority of the replicas.

there is no triplet (Secret1 Key Val) where Hash(Secret) = Hash(Secret1) stored • 6 in the majority of the replicas

the triplet has been correctly stored in the majority of the replicas • Otherwise the operation returns “abort” and nothing changed. If no value is specified for Secret Beernet will assume it is the equivalent to write(S:NO SECRET K:Key V:Val).

8.2.2 CreateSet createSet(SS:SSecret K:Key S:Secret)

Stores the triplet (Hash(SSecret) Key Hash(Secret)) at the majority of replicas. This operation can have two results, “commit” or “abort”. The operation returns “commit” if:

there is nothing stored associated with the key Key in the majority of the replicas • there is no triplet (Hash(SSecret1) Key Hash(Secret1)) stored in the majority of • the replicas yet

the triplet has been correctly stored in the majority of the replicas •

140 Otherwise the operation returns “abort” and nothing changed. If no value is specified for SSecret or Secret Beernet will set those values to NO SECRET.

8.2.3 Add add(S:Secret K:Key SV:SValue V:Val)

Adds the quadruplet (Secret Key SValue Val) to the set referenced by the key Key in the majority of the replicas. This operation can have two results, “commit” or “abort”. The operation returns “commit” if:

there is no triplet (Hash(SSecret1) Key Hash(Secret1)) stored at the majority of • the replicas with Hash(Secret1) = Hash(Secret) 6 there is no quadruplet (Hash(Secret2) Key Hash(SValue2) Val) with • Hash(SValue2) = Hash(SValue) stored in the majority of the replicas 6 the quadruplet has successfully been stored in the majority of the replicas • Otherwise the operation returns “abort” and nothing changed. Note that if no (Hash(SSecret1) Key Hash(Secret1)) was stored previously at this key by createSet Beernet will assume it is the equivalent to createSet(SS:NO SECRET K:Key S:Secret) followed by add(S:Secret K:Key SV:SValue V:Val) where NO SECRET is a reserved value of Beernet. If no value is specified for Secret or SValue Beernet will set those values to NO SECRET.

8.2.4 Remove remove(S:Secret K:Key SV:SValue V:Val)

If no value is provided for VaI, this means we are dealing with a key/value pair and not a key/value set, and so SValue is not evaluated. It deletes the triplet (Hash(Secret1) Key Val1) stored at the majority of the replicas. This operation can have two results, “commit” or “abort”. The operation returns “commit” if:

there is a triplet (Hash(Secret) Key Val1) stored with a write operation at the • majority of the replicas.

Hash(Secret) = Hash(Secret1) • the triplet has successfully been deleted from the majority of the replicas •

141 Otherwise the operation returns “abort” and nothing changed. If a value is provided for Val, this means we are dealing with a value in a set and SValue will be checked. It deletes the quadruplet (Hash(Secret1) Key Hash(SValue1) Val1) stored at the majority of the replicas. This operation can have two results, “commit” or “abort”. The operation returns “commit” if:

there is a quadruplet (Hash(Secret1) Key Hash(SValue1) Val1) stored with an • add operation and there is a triplet (Hash(SSecret1) Key Hash(Secret1)) stored with a createSet operation at the majority of the replicas.

Val = Val1 • Hash(Secret) = Hash(Secret1) • Hash(SValue) = Hash(SValue1) or Hash(SValue) = Hash(SSecret1) • the quadruplet has successfully been deleted from the majority of the replicas • Otherwise the operation returns “abort” and nothing changed. If no value is specified for Secret or SValue Beernet will set those values to NO SECRET.

8.2.5 DestroySet destroySet(SS:SSecret K:Key)

Deletes the triplet (Hash(SSecret1) Key Hash(Secret1)) and all the quadruplets (Hash(Secret1) Key Hash(SValue1) Val) at the majority of replicas. This operation can have two results, “commit” or “abort”. The operation returns “commit” if:

there is a triplet (Hash(SSecret1) Key Hash(Secret1)) stored at the majority of • the replicas

Hash(SSecret) = Hash(SSecret1) • the triplet and quadruplets have successfully been deleted at the majority of the • replicas

Otherwise the operation returns “abort” and nothing changed. If no value is specified for SSecret Beernet will assume it is the equivalent to de- stroySet(SS:NO SECRET K:Key).

142 Chapter 9

Bwitter API

9.1 User management

9.1.1 createUser public void createUser(String userName, String password, String realName)

Creates a user with his personal information.

Parameters:

userName - the userName of the user, may not contain spaces. • password - the password of the user, has to be at least 8 characters long and must • contain at least one number and one special character (not from the 26 letters of the alaphabet).

realName - the full name of the user, must contain a first and last name. •

Throws:

UserAlreadyUsed - if there already exists a user with this userName. • PassWordTooWeak - if the password does not meet the requirements. • UserNameInvalid - if either the userName ot realName does not meet the require- • ments.

ActionNotDoneException - if there was another problem during the operation. •

143 9.1.2 deleteAccount public boolean deleteAccount(String userName, String password)

Deletes the account of the user along with his lists and lines. Also deletes all the tweets this user posted.

Parameters:

userName - the userName of the user performing the operation. • password - the password of the user performing the operation. •

Throws:

BadCredentials - if the provided userName does not exist or if the password does • not match the userName.

ActionNotDoneException - if there was another problem during the operation. •

9.2 Tweets

9.2.1 postTweet public void postTweet(String userName, String password, String msg)

Posts the message in order to be displayed in all the lines following the user

Parameters:

userName - the userName of the user performing the operation. • password - the password of the user performing the operation. • msg - a String containing the message •

Throws:

BadCredentials - if the provided userName does not exist or if the password does • not match the userName.

ValueNotFound - if a critical value needed to perform the operation could not be • retrieved.

ActionNotDoneException - if there was another problem during the operation. •

144 9.2.2 reTweet public void reTweet(String userName, String password, String tweetID)

Posts the referenced tweet as a retweet in order to be displayed in all the lines following the user.

Parameters:

userName - the userName of the user performing the operation. • password - the password of the user performing the operation. • tweetID - reference of the tweet to retweet •

Throws:

ActionAlreadyPerformed - if this action has already been performed previously. • BadCredentials - if the provided userName does not exist or if the password does • not match the userName.

ValueNotFound - if a critical value needed to perform the operation could not be • retrieved.

ActionNotDoneException - if there was another problem during the operation. •

9.2.3 reply public void reply(String userName, String password, String msg, String tweetID)

Posts a new tweet with msg as message in order to be displayed in all the lines following the user. The new tweet contains a reference to his parent tweet referenced by tweetID and is added to its children.

Parameters:

userName - the userName of the user performing the operation. • password - the password of the user performing the operation. • msg - a String containing the message • tweetID - reference of the tweet to which to reply •

145 Throws:

BadCredentials - if the provided userName does not exist or if the password does • not match the userName.

ValueNotFound - if a critical value needed to perform the operation could not be • retrieved.

ActionNotDoneException - if there was another problem during the operation. •

9.2.4 deleteTweet public void deleteTweet(String userName, String password, int tweetnbr)

Deletes the tweet of the user with the specified number.

Parameters:

userName - the userName of the user performing the operation. • password - the password of the user performing the operation. • tweetnbr - number of the tweet to delete. •

Throws:

BadCredentials - if the provided userName does not exist or if the password does • not match the userName.

ValueNotFound - if a critical value needed to perform the operation could not be • retrieved.

ActionNotDoneException - if there was another problem during the operation. •

146 9.3 Lines

9.3.1 addUser public void addUser(String userName, String password, String lineName, String newFollowinguserName)

Adds the specified user to the specified line. From now all the tweets posted by the specified user will be displayed in the specified line.

Parameters:

userName - the userName of the user performing the operation. • password - the password of the user performing the operation. • lineName - name of the line to which the user should be added. • newFollowingUserName - name of the user that should be added. •

Throws:

BadCredentials - if the provided userName does not exist or if the password does • not match the userName.

ValueNotFound - if a critical value needed to perform the operation could not be • retrieved.

ActionNotDoneException - if there was another problem during the operation. •

9.3.2 removeUser public void removeUser(String userName, String password, String lineName, String followinguserName)

Removes the specified user from the specified line. From now all the tweets posted by the specified user will no longer be displayed in the specified line anymore.

Parameters:

userName - the userName of the user performing the operation. • password - the password of the user performing the operation. • lineName - name of the line from which the user should be removed • newFollowingUserName - name of the user that should be removed •

147 Throws:

BadCredentials - if the provided userName does not exist or if the password does • not match the userName.

ValueNotFound - if a critical value needed to perform the operation could not be • retrieved.

ActionNotDoneException - if there was another problem during the operation. •

9.3.3 allUsersFromLine public Collection allUsersFromLine(String userName, String lineName)

Retrieves all the users followed in the specified line owned by the specified user.

Parameters:

lineName - name of the line • userName - name of the user owning the line •

Returns:

A Collection of Strings containing all the userNames of the users followed in the specified line.

Throws:

ValueNotFound - if a critical value needed to perform the operation could not be • retrieved.

ActionNotDoneException - if there was another problem during the operation. •

9.3.4 allTweet public Collection allTweet(String userName)

Retrieves all the tweets from the specified user. Should only be used for testing the application.

Parameters:

userName - name of the user. •

148 Returns:

A LinkedList of all the Tweets of the user ordered chronologically.

Throws:

ValueNotFound - if a critical value needed to perform the operation could not be • retrieved.

ActionNotDoneException - if there was another problem during the operation. •

9.3.5 getTweetsFromLine public TweetChunk getTweetsFromLine(String userName, String lineName, int cNbr, String date)

Retrieves the tweets from the chunk with the number equal to cNbr from the line lineName of the user userName that were posted after date. If date is null all the tweets from the chunk are returned. If cNbr is negative the last chunk from the line is returned.

Parameters:

lineName - name of the line. • userName - name of the user owning the line. • cNbr - number of the chunk of the line you want to read. The chunks are ordered • from oldest to most recent, with the most recent chunk having the highest number.

date - String representing the limit date with the format “05/06/11 15 h 26 min • 03 s GMT”

Returns:

A TweetChunk containing a LinkedList of Tweets ordered chronologically and the number of the chunk in which they are stored.

Throws:

ParseException - if the date has not the correct format and could not be parsed. • ValueNotFound - if a critical value needed to perform the operation could not be • retrieved.

ActionNotDoneException - if there was another problem during the operation. •

149 9.3.6 createLine public void createLine(String userName, String password, String lineName)

Creates a new line with the specified name for specified user as an owner.

Parameters:

userName - the userName of the user performing the operation. • password - the password of the user performing the operation. • lineName - name of the new line to create. •

Throws:

LineAlreadyExists - if the user already has a line with the same name. • BadCredentials - if the provided userName does not exists or if the password does • not match the userName.

ValueNotFound - if a critical value needed to perform the operation could not be • retrieved.

ActionNotDoneException - if there was another problem during the operation. •

9.3.7 deleteLine public void deleteLine(String userName, String password, String lineName)

Deletes the specified line owned by the user

Parameters:

userName - the userName of the user performing the operation. • password - the password of the user performing the operation. • lineName - name of the line to be deleted, note that the userline and timeline can • not be deleted.

Throws:

BadCredentials - if the provided userName does not exists or if the password does • not match the userName.

ValueNotFound - if a critical value needed to perform the operation could not be • retrieved.

ActionNotDoneException - if there was another problem during the operation. •

150 9.3.8 getLineNames public Collection getlineNames(String userName)

Retrieves the names of all the lines of the user.

Parameters:

userName - the userName of the owner of the lines. •

Returns:

A LinkedList of Strings containing the names of all the lines.

Throws:

ValueNotFound - if a critical value needed to perform the operation could not be • retrieved.

ActionNotDoneException - if there was another problem during the operation. •

9.4 Lists

9.4.1 addTweetToList public void addTweetToList(String userName, String password, String listname, String tweetID)

Adds the referenced tweet to the specified list.

Parameters:

userName - the userName of the user performing the operation. • password - the password of the user performing the operation. • listName - name of the user’s list. • tweetID - reference to the tweet to add to the list. •

Throws:

ActionAlreadyPerformed - if the tweet has already been added to the list previ- • ously.

151 BadCredentials - if the provided userName does not exists or if the password does • not match the userName.

ValueNotFound - if a critical value needed to perform the operation could not be • retrieved.

ActionNotDoneException - if there was another problem during the operation. •

9.4.2 removeTweetFromList public void removeTweetFromList(String userName, String password, String listname, String tweetID)

Remove the referenced tweet from the specified list.

Parameters:

userName - the userName of the user performing the operation. • password - the password of the user performing the operation. • listName - name of the user’s list. • tweetID - reference of the tweet to remove from the list. •

Throws:

BadCredentials - if the provided userName does not exists or if the password does • not match the userName.

ValueNotFound - if a critical value needed to perform the operation could not be • retrieved.

ActionNotDoneException - if there was another problem during the operation. •

9.4.3 getTweetsFromList public TweetChunk getTweetsFromList(String userName, String listname, int cNbr, String date)

Retrieves the tweets from the chunk with the number equal to cNbr from the list listname of the user userName that were posted after date. If date is null all the tweets from the chunk are returned. If cNbr is negative the last chunk from the line is returned.

152 Parameters:

listName - name of the list • userName - name of the user owning the list. • cNbr - number of the chunk of the line you want to read. The chunks are ordered • from oldest to most recent, with the most recent chunk having the highest number.

date - String representing the limit date with the format “05/06/11 15 h 26 min • 03 s GMT”

Returns:

A TweetChunk containing a LinkedList of Tweets ordered chronologically and the number of the chunk in which they are stored.

Throws:

ValueNotFound - if a critical value needed to perform the operation could not be • retrieved.

ActionNotDoneException - if there was another problem during the operation. •

9.4.4 createList public void createList(String userName, String password, String listname)

Creates a new list with the specified name and the user as an owner

Parameters:

userName - the userName of the user performing the operation. • password - the password of the user performing the operation. • listName - name of the new list to create •

Throws:

ListAlreadyExists - if the user already has a list with the same name. • BadCredentials - if the provided userName does not exists or if the password does • not match the userName.

ValueNotFound - if a critical value needed to perform the operation could not be • retrieved.

ActionNotDoneException - if there was another problem during the operation. •

153 9.4.5 deleteList public void deleteList(String userName, String password, String listname)

Deletes the specified list

Parameters:

userName - the userName of the user performing the operation. • password - the password of the user performing the operation. • listName - name of the line to be deleted, note that the favoritelist cannot be • delete

Throws:

BadCredentials - if the provided userName does not exists or if the password does • not match the userName.

ValueNotFound - if a critical value needed to perform the operation could not be • retrieved.

ActionNotDoneException - if there was another problem during the operation. •

9.4.6 getListNames public Collection getlistnames(String userName)

Retrieves the names of all the lines of the user.

Parameters:

userName - name of the user owning the lists. •

Returns:

A LinkedList of Strings containing the names of all the lists.

Throws:

ValueNotFound - if a critical value needed to perform the operation could not be • retrieved.

ActionNotDoneException - if there was another problem during the operation. •

154 Chapter 10

The paper

During the course of our project we have co-written an article, along with Peter Van Roy and Boris Mej´ıas,entitled “Designing an Elastic and Scalable Social Network Application”. The contents of this paper were based on our second implementation of Bwitter, which we detail in section 5.1.2. It is thus not fully representative of our final imple- mentations and design choices. This article has been accepted for The Second International Conference on Cloud Computing, GRIDs, and Virtualization1 organized by the IARIA and held the 25th to the 30th of September 2011 in Rome, Italy. The submitted version of this paper can be found next page.

1CLOUD COMPUTING 2011, http://www.iaria.org/conferences2011/CLOUDCOMPUTING11.html, last accessed 13/08/2011

155 Designing an Elastic and Scalable Social Network Application

Xavier De Coster, Matthieu Ghilain, Boris Mej´ıas, Peter Van Roy ICTEAM institute Universite´ catholique de Louvain Louvain-la-Neuve, Belgium decoster.xavier,ghilainm @gmail.com boris.mejias,peter.vanroy @uclouvain.be { } { }

Abstract—Central server-based social networks can suffer architectures on which this social network can run, one fully from overloading caused by social trends and make the service distributed based on peer-to-peer and one centralised based momentarily unavailable preventing users to access it when on the cloud. We will then finish with the implementation they most want it. Central server-based social networks are not adapted to face rapid growth of data or flash crowds. of our prototype in Section VII and a small conclusion at In this work we present a way to design a scalable, elastic Section VIII. and secure Twitter-like social network application build on the II.A QUICK OVERVIEW OF REQUIRED OPERATIONS top of Beernet, a transactional key/value datastore. By being scalable and elastic the application avoids both overloading and Bwitter is designed to be a secure social network based wasting resources by scalung up and down quickly. on Twitter. Twitter is a microblogging system, and while it Keywords-Scalable; elastic; social network; design. looks relatively simple at first sight it hides some complex functionalities. We included almost all of those in Bwitter I.INTRODUCTION and added some others. We will only depict the relevant Social networks are an increasing popular way for people functionalities here that will help us to analyse the design to interact and express themselves. People can now create of the system and the differences between a centralised and content and easily share it with other people. The servers of decentralised architecture. those services can only handle a given number of requests A. Nomenclature at the same time, so if there are too many requests the There are only a few core concepts on which our appli- server can become overloaded. Social networks thus have to cation is based. A tweet is basically a short message with predict the amount of load they will have to face in order to additional meta information. It contains a message up to 140 have enough resources at their disposal. Statically allocating characters, the author’s username and a timestamp of when resources based on the mean utilisation of the service would it was posted. If the tweet is part of a discussion, it keeps lead to a waste during slack periods and overloading during a reference to the tweet it is an answer to and also keeps peak periods. Twitter (http://www.twitter.com) shows the the references towards tweets that are replies to it. A user “Fail Whale” graphic whenever overloading occurs. This is a is anybody who has registered in the system. A few pieces tricky situation as this load is related to many social factors, of information about the user are kept in memory by the some of which are impossible to predict. For instance we application, such as her complete name and her password, want to be able to handle the high amount of people sending used for authentication. A line is a collection of tweets and Christmas or New Year wishes but also reacting to natural users. The owner of the line can define which users he wants disasters. This is why we want to turn towards scalable and to associate with the line. The tweets posted by those users elastic solutions, allowing the system to add and remove will be displayed in this line. This allows a user to have resources on the fly in order to fit the required load. In several lines with different topics and users associated. this work we are going to focus on the design of a social network with elastic and scalable infrastructure: Bwitter, a B. Basic operations secure Twitter-like social network built on Beernet [1], a 1) Post a tweet: A user can publish a message by posting scalable key/value store. In the next section we will overview a tweet. The application will post the tweet in the lines to the basic required operations for a social network. We will which the user is associated. This way all the users following then explain why we chose Beernet for this project in her have the tweet displayed in their line. Section III and how to run multiple services on top of it in 2) Retweet a tweet: When a user likes a tweet from an Section IV, in this section we will also discuss some possible other user she can decide to share it by retweeting it. This improvements for DHTs in order to increase their security will have the effect of “sending” the retweet to all the lines and offer a richer application programming interface. We to which the user is associated. The retweet will be displayed then take a closer look at the design of our application in in the lines as if the original author posted it but with the Section V. In Section VI we will compare two types of retweeter’s name indicated. 3) Reply to a tweet: A user can decide to reply to a tweet. scalable and elastic key/value store providing transactional This will include a reference to the reply tweet inside the storage with strong consistency providing those data abstrac- initial tweet. Additionally a reply keeps a reference to the tions could be used too. tweet to which it responds. This allows to build the whole conversation tree. IV. RUNNING MULTIPLE SERVICES ON BEERNET 4) Create a line: A user can create additional lines with Multiple services running on the same DHT can conflict custom names to regroup specific users. with each other. We will now discuss two mechanisms 5) Add and remove users from a line: A user can asso- designed to avoid those conflicts. ciate a new user to a line, from then on all the tweets this newly added user posts will be included in the line. A user A. Protecting data with Secrets can also remove a user from a line, she will then not see the Early in the process, we elicited a crucial requirement. tweets of this user in her line anymore and will not receive The integrity of the data posted by the users on Bwitter her new tweets either. must be preserved. A classical mechanism, but not without 6) Read tweets: A user can read the tweets from a line flaws, is to use a capability-based approach. Data is stored at by packs of 20 tweets. She can also refresh the tweets of a random generated keys so that other applications and users line to retrieve the tweets that have been posted since her using Beernet cannot erase others values because they do not last refresh. know at which keys these values are stored. But in Bwitter, some information must be available for everybody and thus III.WHY BEERNET? keys must be known by all users, meaning that we cannot use Beernet [2] is a transactional, scalable and elastic peer-to- random keys. For example, any user must be able to retrieve peer key/value data store build on the top of a DHT. Peers the user profile of another user, it must thus know the key in Beernet are organized in a relaxed Chord-like ring [3] at which it is stored. The problem is that Beernet does not and keep O(log(N)) fingers for routing. This relaxed ring is allow any form of authentication so key/value pairs are left more fault tolerant than a traditional ring and its robust join unprotected, meaning that anybody able to make requests to and leave algorithm to handle churn make Beernet a good Beernet can modify or delete any previously stored data. candidate to build an elastic system. Any peer can perform We make a first and naive assumption that services lookup and store operations for any key in O(log(N)), where running on Beernet are bug free and respectful of each other. N is the number of peers in the network. The key distribution They thus check at each write operation that nothing else is is done using a consistent hash function, roughly distributing stored at a given key otherwise they cancel the operation. the load among the peers. These two properties are a strong Thanks to the transactional support of Beernet the check and advantage for scalability of the system compared to solutions the write can be done atomically. This way we can avoid like client/server. race conditions where process A reads, the process B reads, Beernet provides transactional storage with strong con- both concluding that there is nothing at a given key and both sistency, using different data abstractions. Fault-tolerance is writing a value leading to the lost of one of the two writes. achieved through symmetric replication, which has several This assumption is not realistic and adds complexity to the advantages that we will not detail here compared to leaf- code of each application running on Beernet. We thus relax set and successor list replication strategy [4]. In every it and assume that Beernet is running in a safe environment transaction, a dynamically chosen transaction manager (TM) like the cloud, which implies that no malicious node can guarantees that if the transaction is committed, at least the be added to Beernet. We allow any application to make majority of the replicas of an item stores the latest value requests directly to any Beernet node from the Internet. We of the item. A set of replicated TMs guarantees that the designed a mechanism called “secrets” to protect key/value transaction does not rely on the survival of the TM leader. pairs and key/value sets stored on Beernet enriching the Transactions can involve several items. If the transaction is existing Beernet API. committed, all items are modified. Updates are performed Applications can now associate secrets to key/value pairs using optimistic locking. and key/value sets they store. This secret is not mandatory, With respect to data abstractions, Beernet provides not if no secret is provided a “public” secret is automatically only key/value-pairs as in Chord-alike networks, but also added. This secret is needed to modify or delete what is key/value sets, as in OpenDHT-alike networks [5]. The com- stored at the key protected. For instance we could have the bination of these two abstractions provides more possibilities following situation. A first request stores at the key bar the in order to design and build the database, as we will explain value foo using the secret ASecret, then another request tries in Section V. Moreover, key/value sets are lock-free in to store at key bar another value using a secret different Beernet, providing better performance. from ASecret. Because secrets are different Beernet rejects We opted for Beernet because of those native data ab- the last request, which will thus have no effect on the data stractions and its elastic and scalability properties. But any store. A similar mechanism has been implemented for sets, allowing to dissociate the protection of the set as a whole the application only has to edit the stored tweet and does and the values it contains. not need go through every line that could contain the tweet. Secrets are implemented in Beernet and have been tested When loading the tweet the application can see if it has been through our Bwitter application. A similar but weaker mech- deleted or not. anism is proposed by OpenDHT [5]. Complete information 3) Minimise the changes to an object: We want the concerning the new secret API can be found at Bwitter’s objects to be as static as possible to enable cache systems. web site (http://bwitter.dyndns.org/). This is why we do not store potentially dynamic information B. Dictionaries inside the objects but rather have a pointer in them, pointing to a place where we could find the information. For instance, At the moment in Beernet, as in all key/value stores we Tweets are only modified when we delete them, if there is a know, there is only one key space. This can cause problems reply to them, the ID of the new child is stored in a separated if multiple services use the same key. For instance two set. services might design their database storing the user profiles at a key equal to the username of a user. This means they can 4) Do not make users load unnecessary things: Loading not both have a user with the same username. This problem the whole line each time we want to see the new tweets cannot be solved with the secrets mechanism we proposed. would result in an unnecessarily high number of messages We thus propose to enhance the current Beernet API with exchanged and would be highly bandwidth consuming. This multiple dictionaries. A dictionary has a unique name and is why we decided to cut lines, which in fact are just big refers to a key-space in Beernet. A new application can sorted set, into subsets, which are sets of x tweets, that can create a dictionary as it starts using Beernet. It can later be organised in a linked list fashion, where x is a tunable create new dictionaries at run-time as needed, which allows parameter. This way the user can load tweets in chunks of the developpers to build more efficient and robust imple- x tweets. The first subset contains all the references to the mentation. Dictionaries can be efficiently created on the fly tweets posted since the last time the user retrieved the line, in O(log(N)) where N is the number of peers in the Beernet it can thus be much larger than x tweets, it is not a problem network. Moreover dictionaries do not degrade storing and as users generally want to check all the new tweets when reading performance of Beernet. If two applications need to they consult a line. The cutting is then done as follows: the share data they just have to use the same dictionary. This application removes the x oldest references from the first has not yet been implemented, but API and algorithms are set, posts them in an new subset and repeats the operation currently being designed. An open problem is how to avoid until the loaded first set is smaller than x. malicious applications to access the dictionary of another 5) Retrieving Tweets in order: Due to the cutting mech- application. anism and delays in the network we can not be sure that V. DESIGN PROCESS each reference contained in a subset is strictly newer than the references stored in the next subset. So we also retrieve We will now present our design choices and explain how the tweet references from this one and only select the first we relieve machines hosting popular values. 20 newest references before fetching the tweets. A. Main directions 6) Filtering the references: When a user is dissociated We will start by discussing the main design choices we from a line we do not want our application to still display made for our implementation. the tweets he posted previously. We decided not to scan 1) Make reads cheap: While designing the construction the whole line to remove all the references added by this mechanism of the lines we were faced with the following user, but rather remove the user from the list of the users choice: Either push the information and put the burden on associated with the line and filter the references-based on the write, making the “post tweet” operation add a reference this list before fetching the corresponding tweets. to the tweet in the lines of each follower. Or pulling the 7) Only encrypt sensitive data: Most of the data in Twit- information and build the lines when a user wants to read ter is not private so there would be no point in encrypting them, by fetching all the tweets posted by the users he it. Only the sensitive data such as the password of the users follows and reordering them. As people do more reads than should be protected by encryption when stored. writes on social networks, based on the assumption that each posted tweet is at least read one time, we opted to make 8) Modularity: Even if our whole design and architecture reads cheaper than writes. relies on the features and API offered by Beernet it is always 2) Do not store full tweets in the lines but references: better to be modular and to define clear interfaces so we can There is no need to replicate the whole tweet inside each replace a whole layer by an other easily. For instance any line, as a tweet could be potentially contain a lot of in- other DHT could easily be used, provided it supports the formation and should be easy to delete. To delete a tweet same data abstractions or they can be simulated. B. Improving overall performance adding a cache but it could be replaced by any key/value store with similar 1) The popular value problem: Given the properties of properties. As a remainder the data store must provide read- the DHT, a key/value pair is mapped to a node or f /write operations on values and sets as well as implementing nodes, where f is the replication factor, depending of the the secrets we described before. This architecture is very redundancy level desired. This implies that if a key is modular, each layer can be changed assuming it respects frequently requested, the nodes responsible for it can be the API of the layer above. We now have to decide where overloaded while the rest of the network is mostly idle Beernet will run. We have two options, either let the Beernet and adding additional machines is not going to improve the nodes run on the users’ machines or run them on the situation. It is not uncommon on Twitter to have wildly cloud, leading to two radically different architectures: the popular tweets that are retweeted by thousands of users. completely decentralised architecture and the cloud-based In the worst case the retweets can be seen as exponential architecture. phenomenon as all the users following the retweeter are A. Completely decentralised architecture susceptible to retweet it too. 2) Use an application cache as solution: Adding nodes In a fully decentralised architecture the user runs a Beer- will not solve the problem, because the number of nodes net node and the Bwitter application on her machine. The responsible for a key/value pair will not change. In order to Bwitter application will do requests directly to this local reduce this number of requests we have decided to add a Beernet node. Ideally this local Beernet node should not be cache with a LRU replacement strategy at the application restricted to the Bwitter application but should also be acces- level. This solves the retweet problem because now the sible for other applications. The problem with this approach application, which is in charge of several users, will have is that the user can bypass protection mechanism enforced in its cache the tweet as soon as one of its user reads the at higher level by accessing DHT low level functions of popular tweet. This tweet will stay in the cache because the Beernet. Usually this is not a problem as untrusted users users frequently make requests to read it. This way we will would not know at which key the data is stored and thus reduce the load put on the nodes responsible for the tweet. can not compromise it. But in our case the data has to We now have to take into account that values are not be at known keys so that the application can dynamically immutable, they can be deleted and modified. A naive retrieve them. This means that any user understanding how solution would be to do active pulling to Beernet to detect our application works would be able to delete, edit or forge changes to the key/value pair stored in the cache. This would lines, users, tweets and references. This would be a security be quite inefficient as there are several values, like tweets, nightmare. that almost never change. In order to avoid pulling we need We tried to tackle this problem with the secret mecha- a mechanism that warns us when a change is done to a nism we designed to enrich Beernet’s interface. While this key/value pair stored in the cache. Beernet, as described prevented the users to edit or delete data they did not create in [1], allows an application to register to a key/value pair themselves we could not prevent them to forge elements. To and to receive a notification when this value is updated. Our avoid this we needed a way to authenticate every data posted application cache will thus register to each key/value pair by a user. There are cryptographic mechanisms to enforce that it actually holds and when it receives a notification from this and ways to efficiently manage the keys but they are Beernet indicating that a pair has been updated it will update outside the scope of this paper. its corresponding replicas. This mechanism has the big Even with those mechanisms in place we have to en- advantage of removing unnecessary requests. Notifications force security at the DHT level. Beernet uses encryption to are asynchronous, so the replicas in the cache can have communicate between different nodes to avoid confidential different values at a given moment, leading to an eventual information leak. But anyone could add modified Beernet consistency model for the reads. On the other hand writes do nodes behaving maliciously. Aside usual attacks [6], a not go through the cache but directly to Beernet, this allows corrupted node could be modified to reveal all the secrets to keep strong consistency for the writes inside Beernet. inside the requests going through it. We thus have to make This is an acceptable trade off as we do not need strong sure that the code running the Beernet node is not modified, consistency for reads inside a social network. so we need a mechanism that enforce remote attestation as described in [7]. This can be done by using a TPM, which VI.ARCHITECTURE provides cryptographic code signature in hardware, on the We will present two different scalable architectures for users’ machine in order to be able to prove to other Beernet our application. In both architectures our application is nodes that the client’s node is a trustworthy node. Until a decomposed in three loosely coupled layers. From top to Beernet node has a way to tell for sure it can trust another bottom, the Graphic User Interface (GUI), Bwitter used Beernet node we are in a dead end. Indeed anyone stealing to handle the operations described in Section II and the the secret of another user can erase any data posted by the key/value data store. For this last layer we use Beernet, user. Assuming that a Twitter session time is short, this can the Bwitter nodes will not be accessible directly, they will be a problem if our application is the only one running on be accessed through a fast and transparent reverse proxy that the top of Beernet. Indeed it will result in nodes frequently will be in charge of doing load balancing between Bwitter joining and leaving the network with a short connection nodes. At the moment Bwitter nodes use sessions to identify time. Each of those changes in the topology of Beernet the users, so the reverse proxy is forced to keep track of the will modify the keys for which the nodes are responsible sessions in order to be able to map the same client to the triggering key/value pairs reallocation itself leading to an same Bwitter node. We plan to change this behavior to offer important and undesirable churn. This would not be an ideal a completely REST Bwitter API. environment for a DHT. The top layer is the GUI, it connects to a Bwitter node using a secure connection channel that guarantee B. Cloud-based architecture the authenticity of the Bwitter node and encrypts all the With this architecture the Bwitter and the Beernet nodes communications between the GUI and the Bwitter node. will run on the cloud, which is an adequate environment Multiple GUI modules can connect to the same Bwitter for scalable and elastic applications. We can thus easily add node. The GUI layer is the only one running on the client or remove Bwitter and Beernet nodes to meet the demand, machine. increasing the efficiency of the network. A Bwitter node is a machine running Bwitter but generally also a Beernet node. This solution also allows us to keep a stable DHT as nodes C. Elasticity are not subject to high churn as it was the case in the first We previously explained that to prevent the Fail Whale architecture we presented. error, the system needs to scale up to allocate more resource Using this solution we do not have all the security issues to be able to answer an increase of user requests. Once the we had with the fully decentralised architecture. This is load of the system gets back to normal, the system needs to because the users do not have direct access to the Beernet scale down to release unused resources. We briefly explain nodes anymore but have to go through a Bwitter node how a ring-based key/value store needs to handle elasticity and can only perform operations defined in Section II. in terms of data management. We are currently working on Furthermore, the communication channel between the GUI making the elastic behaviour more efficient in Beernet. and the Bwitter node can guarantee authenticity of the server and encryption of data being transmitted, for instance using 1) Scale up: When a node j joins the ring in between https. Bwitter requires users to be authenticated to access or peers i and k, it takes over part of the responsibility modify their data. Doing so we provide data integrity and of its successor, more specifically all keys from i to j. authenticity because, for instance, Bwitter does not allow Therefore, data migration is needed from peer k to peer j. a user to delete a tweet that he did not post or to post The migration involves not only the data associated to keys a tweet using the username of someone else. The security in the range ]i, j], but also the replicated items symmetrically problem concerning possible revelations of user secrets due matching the range. Other noSQL databases such as HBase to a malicious node is not relevant anymore as our DHT is (http://hbase.apache.org) do not trigger any data migration fully under our control. upon adding new nodes to the system, showing better The cloud-based architecture is thus more secure and performance scaling up. stable, this is why we have finally chosen to implement this 2) Scale down: There are two ways of removing nodes solution, we now take a closer look at how the layer stack from the system: by gently leaving and by failing. It is very is build. Note that in spite of our researches we did not find reasonable to consider gently leaves in cloud environments, any information about current architecture so we because the system explicitly decides to reduce the size of are not able to compare both architectures. the system. In such case, it is assumed that the leaving peer As said before the Beernet layer runs on the cloud, this j has time enough to migrate all its data to its successor layer is monitored in order to detect flash crowds and who becomes the new responsible for the key range ]i, j], Beernet nodes will be added and removed on the fly to meet being i the predecessor. the demand. Scaling down due to the failure of peers is much more The intermediate layer, also running on the cloud, is complicated because the new responsible of the missing Bwitter, it communicates with Beernet and the GUIs. This key range needs to recover the data from the remaining layer can be put on the same machine as a Beernet node or replicas. The difficulty comes from the fact that the value on another machine. Normally there should be less Bwitter of application keys is unknown, since the hash function is nodes than Beernet nodes. One Bwitter node is associated to not bijective. Therefore, the peer needs to perform a range a Beernet node but can be re-linked to another Beernet node query, as in Scalaris [8], but based on the hash keys. Another if it goes down. Each Bwitter node should be connected to a complication is that there are no replica sets based on key different Beernet node in order to share the load. In practice ranges, but on each single key. VII.IMPLEMENTATION technology (http://www.adobe.com/products/flex). This GUI uses the web API we developed to access Bwitter. We have implemented a prototype based on our cloud- based architecture. Sources are freely available at [bwit- VIII.CONCLUSION ter.dyndns.org]. We will now detail how we actually imple- Our goal was to build a new system able to withstand flash mented it. You can see a full schema of our implementation crowd by relying on an elastic and scalable architecture. This in Figure 1. allows us to add resources to face heavier traffic and avoid As explained, our architecture has three main layers. The waste of resources. DHT layer is implemented using Beernet, build in Oz v1.3.2 While the prototype is not yet totally finished our whole (http://www.mozart-oz.org/) enhanced with the secret mech- design is totally scalable, meaning we do not have single anism. Beernet is accessible through a socket API , we used absurdly huge operations due to the high number of users it to communicate with the Bwitter layer. An alternative test one might follow or be followed by. We avoid overloading version of the data store layer used for testing the application specific machines because we do not rely on any global is also made available at http://bwitter.dyndns.org. keys and use our cache mechanism to prevent the retweet problem. Some preliminary scalability tests have been done on Amazon and are encouraging. During the implementation we also came across two potentially important improvements for key/value stores, namely duplicating the key space using multiple dictionaries and the protection of data via secrets, with the last one now implemented in Beernets latest release.

REFERENCES [1] B. Mej´ıas and P. Van Roy, “Beernet: Building self-managing decentralized systems with replicated transactional storage,” IJARAS: International Journal of Adaptive, Resilient, and Autonomic Systems, vol. 1, no. 3, pp. 1–24, July-Sept 2010.

[2] Programming Languages and Distributed Computing Research Group, UCLouvain, “Beernet: pbeer-to-pbeer network,” http: //beernet.info.ucl.ac.be, 2009. [Online]. Available: http:// Figure 1. Implementation structure scheme beernet.info.ucl.ac.be

At the top of the Bwitter layer is a Tomcat 7.0 application [3] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Bal- akrishnan, “Chord: A scalable peer-to-peer lookup service server (http://tomcat.apache.org) using java servlets from for internet applications,” SIGCOMM Comput. Commun. Rev., java EE. Bwitter is accessible from the internet through an vol. 31, pp. 149–160, August 2001. API that The Bwitter layer is connected to the bottom layer using sockets to communicate with an Oz agent controlling [4] A. Ghodsi, L. O. Alima, and S. Haridi, “Symmetric replication Beernet. The Bwitter nodes are accessible remotely via for structured peer-to-peer systems,” in Proceedings of the 2005/2006 international conference on Databases, information an http API, finally we would like to make it completely systems, and peer-to-peer computing, ser. DBISP2P’05/06. conform to REST API. The Tomcat servers are not directly Berlin, Heidelberg: Springer-Verlag, 2007, pp. 74–85. accessed, they are accessed through a reverse proxy server, in this case nginx (http://wiki.nginx.org), which is told to [5] S. Rhea, B. Godfrey, B. Karp, J. Kubiatowicz, S. Ratnasamy, S. Shenker, I. Stoica, and H. Yu, “Opendht: a public dht service support 10k concurrent connections. This nginx server is and its uses,” SIGCOMM Comput. Commun. Rev., vol. 35, pp. in charge of serving static content as well as doing load 73–84, August 2005. balancing for the Tomcat servers. This load balancing is performed so that messages of a same session are always [6] G. Urdaneta, G. Pierre, and M. v. Steen, “A survey of dht mapped to the same Tomcat server, this is necessary as security techniques,” ACM Computing Surveys, vol. 43, no. 2, jan 2011. authentication is needed to perform some of the Bwitter operations and we did not want to share the state of the [7] Wikipedia, “Trusted computing,” http://en.wikipedia.org/wiki/ Trusted computing #Remote attestation, 2011, [Online; ac- users sessions between the Bwitter nodes for performance \ \ \ reasons. The connection to the web-based API is performed cessed 28-June-2011]. using https to meet the secure channel requirement of our [8] T. Schutt,¨ F. Schintke, and A. Reinefeld, “Scalaris: reliable architecture. transactional p2p key/value store,” in Proceedings of the 7th The last layer is the GUI, we decided to implement it ACM SIGPLAN workshop on ERLANG, ser. ERLANG ’08. as a Rich Internet Application (RIA), using the Adobe Flex New York, NY, USA: ACM, 2008, pp. 41–48.