Aalto University School of Science Degree Programme of Computer Science and Engineering

Tuure Laurinolli

High-Availability Database Systems: Evaluation of Existing Open Source Solutions

Master’s Thesis Espoo, November 19, 2012

Supervisor: Professor Heikki Saikkonen Instructor: Timo L¨attil¨aM.Sc. (Tech.) Aalto University School of Science ABSTRACT OF Degree Programme of Computer Science and Engineering MASTER’S THESIS Author: Tuure Laurinolli Title: High-Availability Database Systems: Evaluation of Existing Open Source Solu- tions Date: November 19, 2012 Pages: 90 Professorship: Software Systems Code: T-106 Supervisor: Professor Heikki Saikkonen Instructor: Timo L¨attil¨aM.Sc. (Tech.) In recent years the number of open-source database systems offering high- availability functionality has exploded. The functionality offered ranges from simple one-to-one asynchronous replication to self-managing clustering that both partitions and replicates data automatically. In the thesis I evaluated database systems for use as the basis for high availability of a command and control system that should remain available to operators even upon loss of a whole datacenter. In the first phase of evaluation I eliminated systems that appeared to be unsuitable based on documentation. In the second phase I tested both throughput and fault tolerance characteristics of the remain- ing systems in a simulated WAN environment. In the first phase I reviewed 24 database systems, of which I selected six, split in two categories based on consistency characteristics, for further evaluation. Ex- perimental evaluation showed that two of these six did not actually fill my re- quirements. Of the remaining four systems, MongoDB proved troublesome in my fault tolerance tests, although the issues seemed resolvable, and Galera’s slight issues were due to its configuration mechanism. This left one in each category. They, Zookeeper and Cassandra, did not exhibit any problems in my tests. Keywords: database, distributed system, consistency, latency, causality Language: English

2 Aalto-yliopisto Perustieteiden korkeakoulu DIPLOMITYON¨ Tietotekniikan tutkinto-ohjelma TIIVISTELMA¨ Tekij¨a: Tuure Laurinolli Ty¨on nimi: Korkean saavutettavuuden tietokantaj¨arjestelm¨at: Olemassa olevien avoimen l¨ahdekoodin ratkaisuiden arviointi P¨aiv¨ays: 19. marraskuuta 2012 Sivum¨a¨ar¨a: 90 Professuuri: Ohjelmistotekniikka Koodi: T-106 Valvoja: Professori Heikki Saikkonen Ohjaaja: Diplomi-insin¨o¨ori Timo L¨attil¨a Viime vuosina korkean saavutettavuuden mahdollistavat avoimen l¨ahdekoodin tietokantaj¨arjestelm¨at ovat yleistyneet. Korkean saavutettavuuden ratkaisut vaih- televat yksinkertaisesta asynkronisesta yksi yhteen -toisintamisesta dataa it- sen¨aisesti hajauttavaan ja toisintavaan ryv¨astykseen. T¨ass¨a diplomity¨oss¨a arvioin tietokantaj¨arjestelmien soveltuvuutta pohjaksi kor- kean saavutettavuuden toiminnoille komentokeskusj¨arjestelm¨ass¨a, jonka tulee pysy¨a saavutettavana my¨os kokonaisen konesalin vikaantuessa. Arvioinnin en- simm¨aisess¨a vaiheessa eliminoin dokumentaation perusteella selv¨asti soveltumat- tomat j¨arjestelm¨at. Toisessa vaiheessa testasin sek¨a j¨arjestelmien viansietoisuutta ett¨a l¨ap¨aisykyky¨a simuloidussa korkean latenssin verkossa. Ensimm¨aisess¨a vaiheessa tutustuin 24 tietokantaj¨arjestelm¨a¨an, joista valitsin kuusi tarkempaan arviointiin. Jaoin tarkemmin arvioidut j¨arjestelm¨at kahteen kategoriaan konsistenssiominaisuuksien perusteella. Kokeissa havaitsin ett¨a kaksi n¨aist¨a kuudesta ei t¨aytt¨anyt asettamiani vaatimuksia. J¨aljellej¨a¨aneist¨a nelj¨ast¨a j¨arjestelm¨ast¨a MongoDB aiheutti ongelmia viansietoisuustesteiss¨ani, joskin ongel- mat vaikuttivat olevan korjattavissa, ja Galeran v¨ah¨aiset ongelmat johtuivat sen asetusj¨arjestelm¨ast¨a. J¨aljelle j¨aiv¨at ensimm¨aisest¨a kategoriasta Zookeeper ja toi- sesta Cassandra, joiden kummankaan viansietoisuudesta en testeiss¨ani l¨oyt¨anyt ongelmia. Asiasanat: tietokanta, hajautettu j¨arjestelm¨a, ristiriidattomuus, konsis- tenssi, viive, latenssi, kausaalisuus Kieli: Englanti

3 Acknowledgements

I would like to thank Portalify Ltd for offering me an interesting thesis project and ample time to work on it. At Portalify I’d especially like to thank M.Sc. Timo L¨attil¨a,my instructor, for putting me on the right track from the start. Outside Portalify, I would like to thank Professor Heikki Saikkonen for taking the time to supervise my thesis. I want to also thank my friends and family for providing me support and, perhaps even more importantly, welcome distractions. Aalto on Waves was downright disruptive, and learning to fly at Polyteknikkojen Ilmailukerho took its time too. However, constant support from old friends was the most important. Thank you, Juha and #kumikanaultimate!

Helsinki, November 19, 2012

Tuure Laurinolli

4 Abbreviations and Acronyms

2PC Two-phase Commit ACID Atomicity, Consistency, Isolation, Durability API Application Programming Interface ARP Address Resolution Protocol CAS Compare And Set FMEA Failure Modes and Effects Analysis FMECA Failure Modes, Effects and Criticality Analysis FTA Fault Tree Analysis HAPS High Availability Power System HTTP Hypertext Transfer Protocol JSON JavaScript Object Notation LAN Local Area Network MII Media Independent Interface NAT Network Address Translation PRA Probabilistic Risk Assessment REST Representational State Transfer RPC Remote Procedure Call RTT Round-Trip Time SDS Short Data Service SLA Service Level Agreement SSD Solid State Drive SQL Structured Query Language TAP Linux network tap TCP Transmission Control Protocol TETRA Terrestrial Trunked Radio VM Virtual Machine WAN Wide Area Network XA X/Open Extended Architecture

5 Contents

Abbreviations and Acronyms 4

1 Introduction 8 1.1 High-Availability Command and Control System ...... 8 1.2 Open-Source Database Systems ...... 9 1.3 Evaluation of Selected Databases ...... 9 1.4 Structure of the Thesis ...... 10

2 High Availability and Fault Tolerance 11 2.1 Terminology ...... 11 2.2 Overcoming Faults ...... 14 2.3 Analysis techniques ...... 16

3 System Architecture 24 3.1 Background ...... 24 3.2 Network Communications Architecture ...... 26 3.3 Software Architecture ...... 28 3.4 FMEA Analysis of System ...... 33 3.5 FTA Analysis of System ...... 37 3.6 Software Reliability Considerations ...... 39 3.7 Conclusions on Analyses ...... 40

4 Evaluated Database Systems 41 4.1 Database Requirements ...... 41 4.2 Rejected Databases ...... 42 4.3 Databases Selected for Limited Evaluation ...... 48 4.4 Databases Selected for Full-Scale Evaluation ...... 50

5 Experiment Methodology 54 5.1 Test System ...... 54 5.2 Test Programs ...... 56

6 5.3 Fault Simulation ...... 64 5.4 Test Runs ...... 65

6 Experiment Results 66 6.1 Throughput Results ...... 66 6.2 Fault Simulation Results ...... 75

7 Comparison of Evaluated Systems 84 7.1 Full-Scale Evaluation ...... 84 7.2 Limited Evaluation ...... 85

8 Conclusions 86

A Remaining throughput results 91

B Remaining fault test results 95

7 Chapter 1

Introduction

In this thesis I present my research related to adoption of an existing open- source database system as the basis for high availability in a command and control system being developed by Portalify Ltd.

1.1 High-Availability Command and Control System

The command and control system is designed to support operations of rescue personnel by automatically tracking status and location of field units so that dispatching operators always have correct and up-to-date view of available units. It tracks locations of TETRA handsets and vehicle radios, and handles status messages sent by field personnel in response to events such as receiving dispatch orders. The system also allows operators to dispatch a unit on a mission, and automatically sends necessary information to the unit. The system should scale to installations that span large geographical ar- eas, with dispatching operators located in multiple, geographically diverse control rooms, and thousands of controlled units spread over the geograph- ical area. Typically operators in one control room would be responsible for controlling units in a specific area, but it should be possible for another control room to take over the area in case the original control room cannot handle its tasks because it has for example lost electrical power. In this thesis I concentrate on hardware fault tolerance of the command and control system and also the database system, since studying software faults of large, existing software systems appears to be an unsolved problem. However, I touch on higher-level approaches that could be used to enhance software fault tolerance of a complex system in practice in Chapter 3. I introduce terminology and analysis methods related to availability and

8 CHAPTER 1. INTRODUCTION 9 fault tolerance in Chapter 2. In Chapter 3 I present more elaborate require- ments for the system, a system architecture based on those requirements and fault-tolerance analysis of the architecture model based on analysis methods introduced in Chapter 2.

1.2 Open-Source Database Systems

The system described above must be able to share data between operators working on different workstations, located in different control rooms, dis- tributed across a country. A database system for storing the data and con- trolling access to it is required. Because of the fault tolerance requirements presented in Chapter 3, the database system must be geographically dis- tributed. Main functional requirement for the database is that it must provide atomic update primitive, preferably with causal consistency and read com- mitted visibility semantics. Main non-functional requirements are quick, au- tomatic handling of software, network and hardware faults and adequate throughput when clustered over high-latency network. Even fairly low through- put is acceptable, since I limit the evaluation to open-source database systems both because of apparent high cost of commercial high-availability database systems, such as Oracle, and also because it is not possible to inspect how commercial, closed- source systems actually work. The transparency of open-source systems is not beneficial only for research purposes, it is also an operational benefit in that it is actually possible to find and fix problems in the system without having to rely on the database vendor for support. Already during the writing of this thesis I reported issues to several projects and fixed problems in database interfaces to be able to run my tests. I present the requirements placed on the database system and introduce a wide variety of open-source high-availability database systems in Chapter 4.

1.3 Evaluation of Selected Databases

Since the main objective for the system in question is to find a distributed database system that is fault tolerant, I use a virtualized test environment that capable of injecting faults and latency to a distributed system. The test environment uses Virtualbox1, Netem [20] and virtualized Debian2 systems

1https://www.virtualbox.org/ 2http://www.debian.org/ CHAPTER 1. INTRODUCTION 10 capable of running all the tested database systems. I test fault tolerance characteristics of the selected database systems in this environment by injecting process, network and hardware faults and measuring effects on clients connected to different nodes of the database cluster. In addition to fault tolerance, I test the update throughput of the database systems in various high-latency configurations with varying num- bers of clients. I elaborate on the test environment in Chapter 5 and present the test results in chapter 6 as well as a comparison of the evaluated systems based on the test results and features in Chapter 7.

1.4 Structure of the Thesis

In Chapter 1 I introduce the product from which criteria for evaluating the databases are derived. In Chapter 2 I introduce high-availability and fault tolerance terminology and fault tolerance analysis procedures used in other fields. In Chapter 3 I describe the system architecture, how fault-tolerance can be achieved with it and the requirements it places on the database. In Chapter 4 I describe various databases that I considered when selecting sys- tems for evaluation, and explain how the evaluated databases were selected. In Chapter 5 I present test methodology used in obtaining data for evaluation of the databases. In Chapter 6 I present test results for several databases using test methods from Chapter 5. In Chapter 7 I compare the evaluated databases based on the results presented in Chapter 6. In Chapter 8 I present conclusions about suitability of different databases for use as basis for sharing state in a high-availability command and control system. Chapter 2

High Availability and Fault Tol- erance

High-availability database system is a database system with the characteristic that an operator can achieve high availability using it. The meaning of availability and how it relates to fault tolerance is discussed below.

2.1 Terminology

The terms related to availability, and the meaning of availability itself needs to be carefully defined in order to be useful. In systems that operate con- tinuously, availability is often defined as the probability that the system is operating correctly at any given time [27]. This definition is problem- atic when applied to query-oriented computer systems such as databases for which continuous availability is not easily defined, since availability of the system is only measurable when a query is performed.

2.1.1 Availability It’s somewhat difficult to measure availability even when queries are per- formed. What is the availability of the system if query A executes success- fully, but during its execution, query B begins executing and fails because of spurious network error? Whether or not this is possible depends on the design of the network protocol that the database cluster uses in its internal communications, but it is certainly imaginable that some of the evaluated systems could allow this kind of behavior. Even without going as far as proposing simultaneity of failure and suc- cess, it is usually not enough for a query to eventually complete for it to

11 CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 12 be considered successful. Instead, there is usually an external requirement limiting execution time of database queries when the database is part of a larger system. In addition, due to concurrency control paradigms employed in certain databases, some queries are actually expected to fail. This happens when multiple clients attempt to concurrently update an entity in a system with optimistic concurrency control. Instead of one of two clients remaining blocked on a lock waiting for the other to complete its update, at least one of the updates must fail at or before commit time. The interface from the rest of the command and control system to the database system is designed so that failure of an individual query is not disastrous. Nor is unavailability of the database system for a few seconds upon for example failure of underlying hardware a problem for the rest of the system. The database system should thus only be considered unavailable when queries take disproportionately long time to execute or when they fail because of an error in the database system instead of transient error resulting from concurrent access protocol.

2.1.2 Reliability While availability is usually defined in terms of probability that the system is operating correctly at a point in time during continuous operation, reliability is defined as the probability that the system keeps operating correctly without failures for a defined period of time [27]. For the envisaged system, reliability is not a good metric, since the system does not have well-defined lifetime over which reliability would be meaningful to measure. For example, if the system had a second of downtime every 10 minutes, its availability would be .998 but its reliability over any 10 minute period 0. For the expected use case with short queries this might be entirely acceptable. However, for a batch system performing video processing tasks that each take 30 minutes, the reliability figure above would be absolutely disastrous, since no task could ever finish.

2.1.3 Faults and Errors According to Storey, “A fault is a defect within the system” and can take many forms such as hardware failures and software design mistakes. An error on the other hand is “a deviation from the required operation of the system or subsystem” [27]. Storey further classifies faults into random and systematic faults. Random faults include hardware faults due to wear and tear, cosmic rays and other random events. Systematic faults are faults due to design mistakes. CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 13

A fault may cause an error but the operation of the system may also mask the fault. For example, a software design mistake will not result in an error if the part of the software that contains the mistake is never executed. Hardware errors may similarly be masked. If a switch is never used, a fault in its installation cannot produce an error, and a fault in a computer hard disk may stay dormant for the lifetime of the system if the faulty sector is never accessed. In fact, a modern PC CPU contains hundreds of design faults [15], yet millions of the devices are in use everyday without any apparent errors arising from these faults. Storey also defines data integrity as “the ability of the system to pre- vent damage to its own database and to detect, and possibly correct, errors that do occur” [27]. Database terminology for data integrity is usually more nuanced, using terms such as atomicity, consistency, isolation and durabil- ity to describe characteristics of transaction in the database. According to Wikipedia [30], the terms originate from Haerder and Reuter [19]. Database system that ensures data integrity as defined by Haerder and Reuter is also fail-safe as defined by Storey in the sense that no committed transactions are lost upon error, and thus errors don’t endanger system state, they just prevent accessing it or changing it.

2.1.4 Maintainability Another concept of interest defined by Storey is maintainability. He defines maintainability as “the ability of a system to be maintained” and mainte- nance as “the action taken to retain a system in, or return a system to, its designed operating condition” [27]. In computer systems common mainte- nance tasks often include ensuring that sufficient resources are available in the system in form of for example disk space, possibly backing up the sys- tem state to external media and applying configuration changes and software updates. With databases, two intermingled properties often come up with back- ups. Most preferably the backup should be atomic, that is reflect the state of the database at a single point in time. The backup should also have no effect on normal operations of the database system, that is answering queries and performing updates. Some database systems achieve the first property but fall short of the second specifically because achieving the first requires blocking all write operations so that the backup can complete without inter- ference from updates. Others choose to achieve the second property but fail on the first one, yet it is usually possible to achieve both if filesystem-level snapshots are available or if the database uses a multiversion concurrency control scheme. In the first case restoring a backup made by copying the CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 14

filesystem snapshot is equivalent to restarting the database after power fail- ure. In the latter case the database explicitly keeps track of lifetime of items so that during backup old, deleted versions of items are simply kept around until the backup completes, and new items are not included in the backup. Another maintainability issue with a complex system is management of system configuration over time. In addition to issues that are possible in management of configuration of a centralized software system, distributed systems have additional complexity related to ensuring that the whole sys- tem has compatible configuration. For example, some distributed systems require that all nodes have mostly identical but subtly different configura- tions because their node configurations must specify the addresses of all other nodes but not the node itself. On a centralized system, a configuration change is performed once, on one computer. If the change requires restart of software, some downtime is unavoidable. In contrast, a distributed system may be able to tolerate config- uration changes that require restart of individual nodes without downtime. On-the-fly upgrades like this are often the preferred method in the world of distributed database systems, where the feature is often called ’rolling restart’ [25]. In practice the difficulty of having a dissimilar configuration for each node may not be great, since the configuration of each node must in any case be managed individually if changes are performed in a staggered fashion.

2.2 Overcoming Faults

Storey [27] divides techniques for overcoming effects of faults into four cat- egories: fault avoidance, fault removal, fault detection and fault tolerance. Fault avoidance covers techniques applied at design stage, fault removal tech- niques applied during testing and fault detection and fault tolerance detec- tion of faults and mitigating their effects when the system is operational. An example of fault avoidance would be use of formal methods during software development to prove that software matches its specification. Fault detection and fault tolerance are related in that fault tolerance in active systems typi- cally requires some form of fault detection so that faulty parts of the system can be isolated or spare components activated, and the fault reported so that it can be repaired. Several techniques for creating fault-tolerant software are described in literature. The Wikipedia article on Software Fault Tolerance [31] lists Re- covery Blocks, N-version Software and Self-Checking Software. In addition, Storey [27] mentions Formal Methods. Of these, Recovery Blocks are these CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 15 days a mainstream feature in object-oriented programming languages such as ++, , Python and Ruby in form of try-catch structures. Storey also mentions other language features common in today’s languages, such as pointer-safety and goto-less program structure, as enhancing reliability of software [27]. N-version (multiversion) software aims to achieve redundancy by creating multiple versions of the same software function. The idea behind multiver- sion software is that the different versions, called variants, will have different faults, and thus correct operation can be ensured by comparing their results and selecting the result that is most popular. Both Storey [27] and Lyu [22] mention that common-cause faults have been found surprisingly common when multiversion programming has been applied. To avoid common-cause faults, the variants should be developed with as much diversity as possi- ble. For example, separate hardware platform, programming language and development tools increase the likelihood of the different program versions actually having different faults. Multiversion software also usually has a single point of failure, namely the component that selects the final result based on variant results. However, it should be a simple component, maybe so simple as to allow exhaustive test- ing. As techniques for combining variant results, Lyu [23] mentions majority voting and median voting among others. Majority voting simply picks the majority value, if any. Majority voting cannot produce a result in all cases - namely situations where no majority exists. For example, if three variants each produce a different result, no ma- jority exists, and some other solution is required. Some possibilities in this case are switching control to non-computerized backup system, or shutting down the whole system into safe state. Median voting is an interesting alter- native in that for some special cases it allows the variants to be implemented so that their results do not have to match exactly in order for the combined result to be useful. For example, if diverse algorithms on diverse hardware are used to compute deflection of control surface of an aircraft, combining their outputs with median filter would allow the algorithms to produce slightly different results for common cases, yet choose a common value in case one algorithm produces obviously wrong results. The article on Self-Checking Software in Wikipedia [31] is actually about N-version Self-Checking Programming as described in Lyu [22, chapter 3], wherein the N-version aspect is the source of redundancy necessary to toler- ate faults and the self-checking part distinguishes it from regular N-version programming as described by Storey [27]. What distinguishes it from regular N-version programming is that in regular N-version programming, there is an external component that compares the results of the N diverse programs and CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 16 determines the correct output whereas in N-version self-checking program- ming each self-checking component must determine if its result is correct and signal the other components in case it detects a fault in its output. Correct use of formal methods ensure that software matches specification. For them to be applicable, a formal specification must first be created. Some software development standards, such as UK Defense Standard 00-55, require use of formal methods for safety-related software [12]. Techniques borrowed from formal methods are also used in less rigorous settings to find bugs in existing software [13]. In distributed systems, additional techniques that allow the system as a whole to proceed even if components fail are required. The problem of agreement in distributed systems is called the consensus problem. In theory, it is impossible to implement an algorithm solving the distributed consen- sus problem in an asynchronous network, that is a network that does not guarantee delivery of messages in bounded time. In practice this is overcome by employing fault detectors based on timeouts. In addition to distributed consensus, distributed transactions feature widely in literature. Transactions are a special case of distributed consensus, but a plethora of specialized al- gorithms exist for handling them, however lately the trend has perhaps been towards building databases on more generic consensus primitives. For ex- ample Google’s BigTable database is essentially based on the generic Paxos algorithm for solving distributed consensus. [16]

2.3 Analysis techniques

2.3.1 Failure Modes and Effects Analysis Failure Modes and Effects Analysis (FMEA) was originally developed in United States for military applications and codified in MIL-P-1629 in 1949. Later revisions were standardized in MIL-STD-1629 and MIL-STD-1629A. Early adopters of FMEA in civil applications include aerospace industry and automotive industry. According to Haapanen and Helminen [18], academic record of application of FMEA to software development originates from late 1970s. Haapanen and Helminen mention a paper by Reifer published in 1979 titled Software Failure Modes and Effects Analysis, and in this paper Reifer mentions some earlier work on software reliability, but nothing dating back further than 1974. [26] [18] Failure Modes, Effects and Criticality Analysis (FMECA) is a develop- ment of FMEA that includes assessment of criticality of failures. Criticality means, according to Haapanen and Helminen [18], “a relative measure of the CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 17 consequences of a failure mode and its frequency of occurrences”. FMECA was part of MIL-STD-1629A, which was published in 1980. In this thesis I will perform qualitative criticality analysis of identified failures in Chapter 3. The FMECA procedure itself is very simple. The procedure described here is based on the description in Storey [27]. For each system component:

1. Determine failure modes of the component

2. Determine consequences of failure of the component in each failure mode

3. Determine criticality of failure based on consequences and likelihood failure

The result of FMECA is a table that contains description of consequences and criticality of all single-component failures. The limitations of FMECA are in its simplicity. It prescribes analysis of all system components, which soon becomes burdensome on larger systems. Appropriate modularization helps with this issue. If module interfaces are sufficiently well-defined, internal failures of a module can be treated at a higher level as failures of the larger module, reducing complexity of analysis at higher level. A larger, more difficult problem is that, as prescribed, FMECA limits analysis to single component failures. Consequences of simultaneous failures of multiple components are not covered by the analysis. For example, FMECA analysis of a dual ring network topology would show that any single- link failure does partition the network, but would not cover the two-link failure case which does partition the network. It is difficult to envision how FMECA could practically be extended to multi-component failures, since already the obvious next step of applying the procedure to component pairs is often infeasible because the number of com- ponent pairs in a system grows quadratically to the number of components. As already mentioned, proper modularization of the system could help some- what, but even for small component count, the number of component pairs is prohibitively large. However, in certain cases reduction of analysis based on symmetries might make analysis of dual component failures. For example, in dual ring network with N identical nodes (Figure 2.1a), single-link failure has 2N identical cases (Figure 2.1b) and the 2N(2N − 1) dual-link failures can be reduced to only three cases with different behavior (Figure 2.2): links in same direction, both links between one pair of nodes and links in different directions between different pairs of nodes. The first two have no effect on CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 18 communications and the third splits the network in two. It is difficult to see how this could be generalized, though.

(a) Healthy network (b) Single failure C C

B D B D

A E A E

Figure 2.1: Dual ring network single failure example

2.3.2 Fault Tree Analysis According to NASA Office of Safety and Mission Assurance [24], the history of Fault Tree Analysis (FTA) dates back to the US aerospace and missile programs where FTA was popular in the 60’s. Towhidnejad et al. [29] mention that FTA evolved in the aerospace industry in the early 1960’s. Nowadays FTA and other Probabilistic Risk Assessment (PRA) techniques are used for example in nuclear and aerospace industries. [24] Storey [27] does not specifically mention probabilities in context of FTA and NASA Office of Safety and Mission Assurance [24] specifically mentions that the Fault Tree (FT) that is the result of FTA is a “qualitative model”. According to Towhidnejad et al. [29] however FTA is associated with prob- abilistic approach to system analysis, and in NASA Office of Safety and Mission Assurance [24] probabilistic aspects are also introduced later. In this thesis I will only perform qualitative FTA type analysis in Chapter 3. FTA procedure is in some ways the opposite of FMECA. In FTA the starting point is a top event, the causes of which are to be determined. The process is repeated recursively until the level of “basic events” is reached. The question in FTA is thus “What would have to happen for event X to happen?” rather than “What would happen were event X to happen?” as in FMEA. FTA is also advertised as a graphical method, with well-defined graphical notation for the tree structure produced through the recursion mentioned CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 19

(a) Same direction (b) Opposite directions, same nodes C C

B D B D

A E A E

(c) Opposite directions, different nodes

C

P artition2

B D

A E P artition1

Figure 2.2: Dual ring network multiple failure example above [27]. An example of the graphical representation is in Figure 2.3. Note that individual fault events are atomic and combined with Boolean operators when a multiple lower-level faults are required to cause a higher-level fault. FTA applied to the ring network example of previous section with top- level event “Network is partitioned” is presented in Figure 2.4. The reasoning already described in the previous section is visualized in the FTA model. However, if the symmetry arguments from previous example were not applied to the FTA, the tree would quickly grow prohibitively large (Figure 2.5). Also, there is nothing inherent in the construction of the Fault Tree that would ensure that faults caused by multiple failures are noticed. However, the focus in FTA is on determining causes for a specific event, which helps concentrate analysis on relevant aspects of the system. CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 20

Loss of cooling

Loss of Loss cooling of fluid fluid circulation

Coolant Circulation Pump leak blocked failure

Valve A Valve B Power loss Mechanical closed closed failure

Figure 2.3: Fault Tree Analysis notation example

In literature FTA is mostly mentioned in context of safety-critical sys- tems. However, it is also useful in more mundane software development and system design tasks. The output of FTA can be directly used as a guide for finding possible causes of problems in running software or operative systems. Automated construction of Fault Trees from programs has been researched by Friedman [17], although another name for the end result might be more suitable, since the top event is not necessarily a fault but rather any state of the program. Also note how selection of top-level event affects FTA analysis. If top- level event “Single-failure tolerance lost” is selected, the resulting FTA is quite different as can be seen in Figure 2.6. The process for selecting ap- propriate top-level events is not part of the FTA procedure and requires expertise beyond simply applying a prescribed method to a system. In soft- ware systems, both selecting appropriate top-level events and determining appropriate bottom-level for the analysis is challenging because of system complexity. If bottom-level is not set, then eventually all analyses on soft- ware programs end up at causes like “arbitrary memory corruption” which CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 21

Figure 2.4: FTA of dual-ring network with top-level event “Network parti- tion”

Network partition resulting from second link failure

Loss of Loss of link link in in counter- clockwise clockwise direction direction between different nodes

can cause any kind of behavior within limits set by laws of physics.

2.3.3 Hazard and Operability Studies Hazard and Operability Studies (HAZOP) is a technique developed in the 1960s for analyzing hazards in chemical processes. According to Storey [27], it has since become popular in other industries as well. The roots in chemical industry are apparent from the description by Storey [27], where the process is described as starting with a group of en- gineers studying operation of a process in steady state, and the effects of deviations from that steady state. The procedure undoubtedly fits a con- tinuous chemical process well, but requires adjustments to be applicable in other industries. The HAZOP procedure is also similar to FMEA in that one is supposed to pick a deviation, find out what could cause such deviation, and what the deviation could in turn cause. This is better reflected in the German acronym PAAG (Prognose von St¨orungen, Auffinden von Ursachen, Absch¨atzender Auswirkungen, Gegenmaßnahmen), or prediction of devia- tions, finding of causes, estimation of effects, countermeasures in English) [11]. CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 22 ... and C Loss of link be- tween B counter- clockwise Loss of links to incoming B-partition link Loss of between A and B clockwise from CDEA second Network partition resulting Partition link failure into B and link Loss of between B and C clockwise Loss of outgoing links from B-partition and B Loss of link be- tween A counter- clockwise link Loss of between A and E clockwise Loss of links to incoming A-partition and B Loss of link be- tween A counter- clockwise BCDE Partition into A and Figure 2.5: Naive FTA of dual-ring network with top-level event “Network partition” and E Loss of link be- tween A counter- clockwise Loss of outgoing links from A-partition link Loss of between A and B clockwise CHAPTER 2. HIGH AVAILABILITY AND FAULT TOLERANCE 23

Figure 2.6: FTA of dual-ring network with top-level event “Single failure tolerance lost”

Loss of single fault tolerance

Loss of link Loss of link Loss of link Loss of link Loss of link in between in between in between in between in between nodes A nodes B nodes C nodes D nodes E and B and C and D and E and A

In HAZOP, guide words are used to ease discovery of potential failure types. Examples of guide words are “no”, “more”, “less” and “reverse” which are easily applicable to for example material flows in a continuous chemical process, but perhaps less easily to computer systems. In computer systems, there is for example no reservoir pool from which traffic can be flow into network if current traffic flow is “less” than expected. It is still possible to apply the same guidewords to a limited extent, though, in the sense that for example “more queries” could lead to analysis of the effect of an overload of well-formed-as-such queries on a query-oriented computer system. Also, overflow and underflow conditions in input values, missing fields in protocol objects and such should of course be determined. However, the latter level of analysis is usually done found in unit tests, and is not part of whole-system analysis. Conceivably the guidewords of HAZOP would indeed be well-suited for unit test construction. Chapter 3

System Architecture

In this chapter I describe the system architecture of the high-availability com- mand and control system and how it enables tolerance of all single-component hardware failures, some multiple-component hardware failures and certain software failures.

3.1 Background

The background for this thesis is a command and control system with high availability. The system was described at a very high level in Chapter 1. In this section I elaborate on the requirements of the command and control system from which the system architecture in the rest of this chapter is derived.

3.1.1 Availability and Reliability It is clear that the command and control system should remain operational all the time. It is equally clear that it does not have to be as reliable as, for example, cooling systems of nuclear power plants. Exact reliability and availability requirements are, however, rather unclear, since no generally ap- plied reliability standards exist for command and control systems, unlike for nuclear facilities [28]. As noted in Chapter 1, it should be possible for operators normally re- sponsible for one area to take control of units in another area. To achieve this, operators in all the control rooms must have access to all the information necessary for taking over control of units in another area. The information must be up-to-date, and most importantly it must not be possible for an operator to do operations based on outdated data, such as dispatch a unit

24 CHAPTER 3. SYSTEM ARCHITECTURE 25 that has already been dispatched by another operator, but is still shown as free on his screen because of network delays.

3.1.2 Failure Model Failures of the command and control system can be divided into multiple categories with varying severity. One category are failures that result in sys- tem unavailability for all operators. Paradoxically, this is perhaps the easiest situation from the perspective of operating procedures. All operators must simply switch to a manual backup procedure. Similarly, failures that result in unavailability of the system for a single operator, or operators located in a single control room can be dealt with by switching control of affected area to another operator in the same control room, or another control room entirely if the whole control room is unavailable. Besides failures that cause total unavailability of the system for some sub- set of the operators, the system might also experience a partial failure that affects all operators for a myriad of reasons. For example, if the TETRA terminal in a vehicle loses power, communications with the vehicle are dis- rupted, and the system loses ability to locate the vehicle and communicate with it. These kinds of failures are expected, and usually detected with timeouts and acknowledgements in communications protocols. If the sys- tem is correctly designed and implemented, they will be detected and either mitigated or reported to the user. For example, the system always shows the location of a vehicle with a timestamp indicating when the location report was received, so that the operator can detect if some vehicle is not sending new location reports. Sim- ilarly, if the user attempts to dispatch a vehicle on a mission, and there are communication problems with the vehicle, the system will first attempt to mitigate the failure by resending the message and, after a certain number of failed retries, notify the operator that acknowledgement for the dispatch message was not received so that he can take appropriate action. In this thesis I will concentrate on the use of a distributed database to mitigate effects of hardware and network failures in internal components of the system. In particular, I will not spend effort in attempt to prove the system is free of software bugs or able to tolerate malicious behavior from internal components. In fact, it’s easy to imagine simple software problems that would result in difficult-to-detect problems on a running system. A bug in text encoding routines for outgoing TETRA SDS messages could cause a dispatch order to be illegible, or worse, legible but wrong, at the receiving terminal. Since the command and control system is used for disaster response, it CHAPTER 3. SYSTEM ARCHITECTURE 26 should be resistant to plausible disasters, such as fire in a datacenter where the system is running, preferably without human intervention. If the system is resistant to loss of a whole datacenter, it can obviously also be resistant to failure of whichever component inside the datacenter if the failure is handled the same way as loss of the whole datacenter. However, this may not be desirable for reasons of efficiency, so I will also look at handling of failures at lower level.

3.2 Network Communications Architecture

Conceptually the system operates as described in Chapter 1 and elaborated above. Dispatchers connect to the system using client software running on their workstations. The client software connects to a backend system that runs in multiple data centers. Multiple data centers are exposed to the dispatcher so that the dispatcher may choose which datacenter to connect to. The primary procedure in case of problems with one datacenter is for the client software to automatically switch to a different datacenter. The switch should not lose current state of the client, but may cause an interruption of a few seconds to client operations. See Figure 3.1 for an overview. To enable switching of datacenter at will, the backend system must main- tain consensus spanning multiple data centers. The minimum number of nodes for a system that maintains availability and consistency upon single crash-type fault is three according to Lamport [21]. It is obvious that one node is not enough (it is unavailable upon crash), and for two nodes it is impossible to distinguish failure of interconnection between the two nodes from one node crashing. Thus both nodes must stop upon communication failure in order to maintain consistency, else it could be that the failure was in the interconnection and both nodes could proceed causing divergence in system states. Three nodes are sufficient to distinguish failure of a network link between two nodes from the failure of one of the nodes using simple majority vote. It is not even necessary for all the nodes to store the database. One node may instead act as a witness for the other nodes, allowing them to decide whether the other data node is down or the interconnection between data nodes has failed. However, the system architecture assumes that all nodes also store the data. This has implications on data durability upon multiple component failures. Within a minimum 3-node backend, each of the nodes maintains con- nectivity with both of the other nodes. The logical network is thus a ring network as presented in Chapter 2. See Figure 3.2 for illustration. In re- CHAPTER 3. SYSTEM ARCHITECTURE 27

Client 1 Client 2

DC1 DC2 DC3

Figure 3.1: System communications architecture overview ality, it is likely that the network topology also resembles a star, since the connections from e.g. DC1 to DC2 and DC3 are not actually independent, but at least inside DC1 likely pass through common wiring and switching equipment (see Figures 3.1 and 3.9). This is not an issue, since it is expected that network connectivity within data centers has redundant physical links with quick enough failover to prevent triggering failure detectors in the ac- tual backend software. Even if failure detectors are triggered, the problem is small, since the system is designed to tolerate the failure of a datacenter.

DC2

DC1

DC3

Figure 3.2: Cluster communications architecture overview

The network configuration described above is later assumed when de- scribing software architecture and database requirements. The test system, described in detail in Chapter 5, is also designed to simulate this configura- tion. CHAPTER 3. SYSTEM ARCHITECTURE 28

3.3 Software Architecture

At a high level the application software of the command and control system uses a messaging system to communicate changes to other application nodes in real-time and a database to persistently store the current state. Figure 3.3 illustrates this. Dashed lines in the figure are connections to other datacen- ters. Among the information stored is the current state of each unit. Updates to unit state may be initiated by the client software or external system con- nected to any of the application nodes. I will not describe the connectivity with external systems in detail here, since from the application’s perspective, it can be handled the same way as updates initiated by client software.

Client

Application

MQ DB

Figure 3.3: Software architecture overview

It is imperative that state updates are committed to the database before being broadcast over the messaging system, since upon restart an application node will first start listening to updates from the messaging system and then refresh its internal state from the database. If an update were first broadcast using the messaging system and only then became visible through the database, the application node might start listening for updates from the messaging system after the update had been broadcast there and still receive an old version of the object from the database. The application nodes also partially keep the system-state in-memory so that when a client application fetches a particular object, it is primarily returned from memory by the application node, and if not present in-memory, retrieved from the database. Application nodes also forward relevant updates to clients connected to them. The database should also be causally consistent, that is if client A per- forms a write, then communicates with client B and client B does a read, client B should not be able to see a version of the item written that is older CHAPTER 3. SYSTEM ARCHITECTURE 29 than what A wrote. This is not an absolute requirement, since with the described system architecture, lack of causal consistency causes unnecessary conflicts but does not cause malfunction. Application software in the command and control system is designed so that consistency can be maintained as long as the underlying database pro- vides an atomic update primitive. The atomic update primitive must be able to provide a guarantee similar to the CAS memory operation commonly found in modern processor instruction sets. As a memory operation CAS replaces value at address X with value B if current value is A, else it does nothing and somehow signals this. In a database setting, some sort of row or object identifier replaces address, but otherwise the operation remains the same. The ABA problem is avoided by using version counters. Importantly, the software is designed so that it does not require transactions that span multiple rows or objects. The messaging protocol is designed so that messages are idempotent. For example, state update for unit X contains complete unit state including the version number rather than just the updated fields. Including version numbers in messages also allows nodes to ignore obsolete information. For example, it is possible that nodes A and B could update state of unit X in quick succession so that the update messages are delivered out-of-order to node C. Using version information C can then ignore the obsolete update from A.

3.3.1 Update Operation

Client Application Server Database Messaging Update(X,1,2)

Update(X,1,2)

Success

Updated(X,2)

Success

Success

Figure 3.4: Successful update operation

In nominal case, state update for unix X initiated by a client is performed as shown in Figure 3.4. First the client application requests the application CHAPTER 3. SYSTEM ARCHITECTURE 30 server to update unit X from version 1 to version 2. The application server requests the database server to perform the same update. In nominal case the update succeeds and the messaging system is used to communicate the up- date to other application server nodes. Finally the application server informs the client that the update was successful. In case the client does not receive a success response within a timeout, it displays a failure message to the user. The software then switches to another application server, on which the update procedure succeeds. Figure 3.5 illus- trates the update procedure in case a failure occurs on Application Server 1 before it updates the database.

Client Application Server 1 Application Server 2 Database Messaging Update(X,1,2)

Crash

Update(X,1,2)

Update(X,1,2)

Success

Updated(X,2)

Success

Success

Figure 3.5: Application server crashes before performing database update

If the database update had already been performed, the recovery proce- dure is different. When application server 2 attempts to perform the update for the client, the database operation fails because the current version (ver- sion 2, as updated by application server 1) does not match the version pro- vided (version 1, provided by the client). Application server 2 then fetches the current version from the database and compares it with the new version provided by the client. Since they are the same, the database update had already been completed before, and the server application proceeds to broad- cast the update via the messaging system. Since messages are idempotent, it does not matter whether the crash of the original application server hap- pened before the message was broadcast as in Figure 3.6 or afterwards as in Figure 3.7. CHAPTER 3. SYSTEM ARCHITECTURE 31

Client Application Server 1 Application Server 2 Database Messaging Update(X,1,2)

Update(X,1,2)

Success

Crash

Update(X,1,2)

Update(X,1,2)

Failed, version conflict

Fetch(X)

Result(X,2)

Compare client’s (X,2) and fetched (X,2) Updated(X,2)

Success

Success

Figure 3.6: Application server crashes after performing database update but before broadcasting the update CHAPTER 3. SYSTEM ARCHITECTURE 32

Client Application Server 1 Application Server 2 Database Messaging Update(X,1,2)

Update(X,1,2)

Success

Updated(X,2)

Success

Crash

Update(X,1,2)

Update(X,1,2)

Failed, version conflict

Fetch(X)

Result(X,2)

Compare client’s (X,2) and fetched (X,2) Updated(X,2)

Success

Success

Figure 3.7: Application server crashes after performing database update and broadcasting it

The version comparison detailed above is also used to detect actual con- flicts. In Figure 3.8 two clients race to update unit X and client 2 wins the race. The application node serving client 1 receives a failure indicating version conflict as in Figure 3.6 or 3.7. However, after fetching the current version from database, it does not match the version that client 1 was offering as version 2. The only possibility upon conflict like this is to return an error to the client, since the system does not know how to resolve the conflict. CHAPTER 3. SYSTEM ARCHITECTURE 33

Client 1 Application Server 1 Client 2 Application Server 2 Database Messaging Update(X,1,2)

Update(X,1,2)

Update(X,1,2’)

Update(X,1,2’)

Success

Failed, version conflict

Fetch(X)

Result(X,2’)

Compare (X,2) of client and fetched (X,2’) Failed, current version is (X,2’)

Updated(X,2’)

Success

Success

Figure 3.8: Update conflict

3.4 FMEA Analysis of System

In this section I present FMEA analysis of the system. I concentrate on the hardware components of a concrete derivative of the abstract network configuration shown in Figure 3.2. In this concrete system, each datacenter has one physical computer (CPUn) connected to a switch (SWn) with two cables using interface bonding for redundancy. The switch has a single exter- nal connection to an external routed network, the topology of which is such that routes to other data centers are symmetric and common up to a point (Rn) but split after that. The components are shown in Figure 3.9. In this analysis, the physical computers are treated as a single component. In a real installation, the computers will have redundant subsystems such as multiple disks and power supplies, but also single points of failure such as the motherboard chipset, and some analysis of effects of subsystem failures should be performed. However, this FMEA analysis is not complete at sub- computer level because the configurations of individual computers are not so standardized as to facilitate analysis of anything except actual installations CHAPTER 3. SYSTEM ARCHITECTURE 34

Rn

CPU2

SW2 SWn Client

DCn R2 SW1 R3 R1 CPUn CPU3

SW3 CPU1

Figure 3.9: Network configuration under FMEA analysis of the system, and no such installations are available for analysis at present time. Similarly, as mentioned in Chapter 1, software errors are not part of this analysis. As expected, the FMEA shows that no single non-byzantine failure will cause the system to stop operating. Since the system is not designed to tolerate byzantine behavior, the result is good. However, it should be noted that increased latency due to for example misconfiguration of the network will not be detected in the system, unless the latency is so high as to trigger timeouts in network protocols. Even lower latency levels however will result in lower system performance. In an actual installation, SLAs (service level agreement) should be used to ensure that network links have sufficiently low latency, and in addition latencies should be monitored with standard tools. CHAPTER 3. SYSTEM ARCHITECTURE 35 Effectstomatic AfterProcedure Au- Recovery Service continues using another datacenter. After thehas new been cluster vice formed, continues. ser- No automatic recovery After thehas new been cluster vice formed, continues. ser- After thehas new been cluster vice formed, continues. ser- CPUn-SWn connectiv- ity resumes after(up small to hundreds ofliseconds) mil- delay.cause Does delay to database queries, but not enough to cause failures. AutomaticProcedure Recovery Client connects toother an- datacenter. Software in othercenters data formscluster a thatservice. new resumes No automatic recovery Software in othercenters data formscluster a thatservice. new resumes Software in othercenters data formscluster a thatservice. new resumes CPUnalternate switches CPUn-SWn link. to Detection Client detectslost it connectivity using timeouts. has Database softwareother in data centerstects de- the error through absence of communica- tions. Detection may bepossible. im- Database softwareother in data centerstects de- the error through absence of communica- tions. Otherdatabase switchesdetect corruption using or checksums. software Softwaredetects onthrough CPUn theures ARP failure orsniffing. through fail- MII FMEA Analysis Table 3.1: FMEA analysis of the system Immediate Effects Client stopsupdates receiving from theacenter dat- it isto. connected CPUn stops processing transactions. CPUn behavestrarily. arbi- SWn stops transferring network traffic. SWn corrupts network trafficthrough it. that passes CPUn-SWnconnection network delivering packets. stops Failure Modes and Causes Connectivity lost Stop failure Byzantine fail- urefor duemalicious to soft- example ware orcorruption data Stop failure Datation corrup- Cable is cut Component Client-Rn con- nectivity CPUn CPUn SWn SWn CPUn-SWn primarywork cable net- CHAPTER 3. SYSTEM ARCHITECTURE 36 Effectstomatic AfterProcedure Au- Recovery No automatic recovery After thehas new been cluster vice formed, continues. ser- No automatic recovery After thehas new been cluster vice formed, continues. ser- No automatic recovery. AutomaticProcedure Recovery No automatic recovery SoftwareCPUs on formscluster other a thatservice. new resumes No automatic recovery SoftwareCPUs on formscluster other a thatservice. new resumes No automatic recovery. Detection Softwaredetects onthrough CPUn theures ARP failure orsniffing. through fail- MII SoftwareCPUs detects on the error through other communications. absence of Can be detectedstandard using system moni- toring tools. Software on CPUa and CPUb detects commu- nication failure. Can be detectedstandard using system moni- toring tools. Continued... Immediate Effects No effect Transactions doreach CPUn. not Decreasedthroughput system Communicationstween be- CPUb interrupted CPUa and Decreasedthroughput system Failure Modes and Causes Cable is cut Connectivity lost Increasedtency la- Connectivity lost Increasedtency la- Component CPUn-SWn al- ternate network cable SWn-Rnnectivity con- SWn-Rnnectivity con- Ra-Rb connec- tivity Ra-Rb connec- tivity CHAPTER 3. SYSTEM ARCHITECTURE 37

3.5 FTA Analysis of System

In this section I present FTA analysis of the system network configuration based on the following root event: “service inaccessible to client”. See Fig- ure 3.9 for the network configuration on which this analysis is based. Three analysis trees are generated: one for the entire system (Figure 3.10), one for the case where inaccessibility is caused by problems in the cluster (Fig- ure 3.11) and one for failure of a data center and its possible causes in Fig- ure 3.12. All data centers are identical, and thus the single datacenter analy- sis is applicable to all three data centers in the system. In the FTA analysis, each datacenter is also considered to include its external network link, since it is assumed that only a single link exists, and this simplifies higher-level analysis.

Service inaccessible to client

Client Cluster network failed problem

Client-R1 Client-R2 Client-R3 connection connection connection failure failure failure

Figure 3.10: System-level FTA

The failure modes discovered through FTA are not surprising. The only difference from analysis of the abstract dual-ring model done in Chapter 2 is that it is assumed that connectivity can break in such a way that node 1 can communicate with node 2 and node 2 with node 3 but node 1 may still be unable to communicate with node 3. This produces a new failure mode for the cluster, named “Datacenter and intra-datacenter link failed” in Figure 3.11. CHAPTER 3. SYSTEM ARCHITECTURE 38 R1-R3 link failed R2-R3 All intra- link failed datacenter links failed R1-R2 link failed DC3 failed DC1 and DC3 failed DC1 failed DC3 failed failed failed Cluster centers 2 data- DC2 and DC3 failed DC2 failed DC2 failed DC1 and DC2 failed Figure 3.11: Backend system FTA DC1 failed R2 link R1- failed R1-R2 DC3 and link failed DC1 failed R3 link R1- failed R1-R3 DC2 and link failed link failed and intra- datacenter Datacenter DC1 failed R3 link R2- failed R2-R3 DC1 and link failed DC1 failed CHAPTER 3. SYSTEM ARCHITECTURE 39

Datacenter failed

External network Switch failed Internal network Computer failed link failed links failed

Internal Internal link 1 link 2

Figure 3.12: Datacenter FTA

3.6 Software Reliability Considerations

In software reliability literature such as Storey [27], the problem of common- cause failures is often mentioned as an issue particularly prevalent in software systems. The reason why common-cause failures are particularly problematic for software systems is that software does not wear out, and thus most of the failure categories that apply to mechanical and electrical systems are not ap- plicable. Naturally software systems still require some underlying hardware to function, but failures in that hardware can largely be tolerated through hardware-level redundancy such as parallel power supplies or ECC RAM, mixed hardware and firmware means such as multiple disks in RAID config- uration or multiple computer units and higher-level clustering software. Even though multiple-computer configurations help tolerate hardware failures, the software itself remains a single point of failure. Design mis- takes are automatically replicated to all copies of the software. If all the computers run identical software, a fault in the software may easily cause an identical error on all computers in a cluster, possibly stopping operation of the whole computer system. The solution is to ensure that the software run- ning on different computers is different, or diverse. However, all instances of the software must still have identical external behavior, which in prac- tice leads to some common-cause failures even in different programs written against the same specification. It is also notable that software systems tend to be complex, and even system specifications often contain mistakes, which will propagate to all correct implementations of the specification. With some of the database systems evaluated, some amount of software CHAPTER 3. SYSTEM ARCHITECTURE 40 diversity can be achieved by running different versions of the database soft- ware in one cluster. This is possible to some extent, since for most of the evaluated systems, the preferred software update method is so-called rolling update in which nodes are taken down to update one node at a time. There is no apparent reason why multiple versions could not be left operational as well. However, for many of the databases it is not specified whether more than two versions can be operated simultaneously. If not, two of the three cluster nodes would still run the same software, possibly suffering from common-cause failures. Two of three cluster nodes failing would cause sys- tem downtime, and thus this is not very attractive configuration, especially since new versions of the database software typically fix issues, and running some nodes without these fixes would expose the system to known failures. Another method of achieving software diversity in a complex software system that uses lots of standard components such as the C library or Java runtime would be to use as different versions as possible of the standard com- ponents on different systems. For example, the C library could be GNU C Li- brary, BSD libc or musl. For Java runtimes, there are fewer high-performance alternatives, but at least Oracle and IBM offer such. It would even be pos- sible to use diverse operating systems, such as FreeBSD, NetBSD and Linux in the same cluster.

3.7 Conclusions on Analyses

FMEA and FTA analyses of the architectural model of the system produce further evidence of the suitability of the architecture for high-availability operation. It appears that the design goal of single-hardware-fault tolerance is achieved at architectural level. However, since the architecture is so simple, this was evident from the start. The main use for methodical analysis in this case is not as design tool but as documentation. As documentation FMEA and FTA benefit from having standard structure, which makes them easier to interpret quickly than for example free-form text. It should also be noted that the architectural model presented here is a simplification and in reality for example network topologies between deploy- ment data centers should be analyzed to discover whether they comply with the architecture or not. If noncompliant network topology is discovered, the effects on fault tolerance should be separately analyzed. This is an example of componentization, which allows simplification of the higher-level model to a level at which it can be analyzed within economical bounds. In general componentization also allows generalization of the analysis so that it is not limited to a specific instance of the system. Chapter 4

Evaluated Database Systems

I considered many databases for evaluation. In the following sections I first present requirements for the database based on system architecture described in Chapter 3 and then list both the databases that I selected for evaluation based on the requirements and the ones that I rejected together with reasons for rejection.

4.1 Database Requirements

As presented in Chapter 3, the command and control system is designed to depend on the database for highly-available shared state. Ability to per- form CAS-like atomic updates is the primary functional requirement for the database. In SQL, the necessary construct is SELECT FOR UPDATE, which ensures that the row is not updated by others before the current transaction ends. In non-SQL databases the APIs vary, but typically the procedure is optimistic so that the SELECT-like read operation and UP- DATE-like write operation are not connected. Instead, the write operation takes both the old and the new version as parameters, and only succeeds if the old version is current. This is exactly equivalent to CAS. Additional requirements for the database stem from the requirement that the system remain operational despite datacenter-scale failures. To achieve necessary fault tolerance, the database system must support multi-datacenter installations with some form of synchronous replication so that atomic up- dates are available cluster-wide. Successful commits must be replicated to at least two sites so that failure of any single site does not cause data loss. In addition, the database system must have adequate throughput both before and after hardware failures. The database system should also allow backups to be made of a live system without disruption to service.

41 CHAPTER 4. EVALUATED DATABASE SYSTEMS 42

Many modern so-called NoSQL databases do provide excellent through- put but with limited consistency guarantees. Why they do not provide cluster-wide atomic updates varies, but generally it is a design choice that allows writes to complete even when quorum is not available. Even when a database offers quorum writes as an option, the quorum often only ensures that the write is durable, not that a conflicting write cannot succeed con- currently. In essence, the database assumes that writes are independent in that they must not rely on previous values for the same key. Typically both conflicting writes succeed and depending on the database the application de- veloper must either do reconciliation of the conflicting updates at read-time or some sort of timestamp comparison is used to automatically resolve the conflict.

4.2 Rejected Databases

The following sections describe database systems, and some approaches to the problem that are not based on a single database system, that I rejected based on a cursory survey.

4.2.1 XA-based Two-phase Commit The update problem could possibly be solved using any database that sup- ports the standard XA two-phase commit (2PC) protocol and an external transaction manager such as Atomikos or JBoss Transactions. However, 2PC is not suited for high-availability system that should recover quickly from failures, since the behavior of 2PC upon member failure at certain points requires waiting for that member to come back up before processing of other transactions can proceed on the remaining members. [16, chapter 14.3] The cause for this design decision is clear. The XA protocol was designed to allow atomic commits to happen across disparate systems (such as a message queue and different database engines), where the main concern is that the transac- tion either happen on all the systems or none of them. The problem here, however, is to allow atomic commits to happen in a distributed environment so that the system can proceed even when nodes are not present. In addition to potentially hazardous failure mode, the application itself would have to manage replication by writing to each replica within a single transaction. The application would thus have to know about all the replicas, and for example bringing down one replica and replacing it with another would require configuration changes to the application itself. On the whole, it appears that a solution based on standard XA 2PC is likely to be a source CHAPTER 4. EVALUATED DATABASE SYSTEMS 43 of much trouble and not worth investigating further.

4.2.2 Neo4j Neo4j1 is a graph database package developed by Neo Technology. It is dis- tributed under GPLv3 [7] and AGPLv3 [6] licenses and commercial licensing is also possible. The database can also be used as regular object store, and offers transactions and clustering for high availability. However, transactions are only available through the native Java API. The only API offered to other languages is an HTTP REST API that does not provide transactions or even simpler conditional updates. I decided to not proceed further in my tests with Neo4j because of the interface limitations.

4.2.3 MySQL MySQL2 is a well-known database system nowadays developed by Oracle Corporation. Oracle distributes MySQL under GPLv2 [2] license and com- mercial licensing is also possible. MySQL APIs exist for most popular pro- gramming languages and MySQL offers transactions and replication for high availability. In MySQL version 5.5, only asynchronous replication was available, so I decided not to evaluate MySQL further. During the writing of this thesis Or- acle released MySQL 5.6 with semisynchronous replication. A cursory look at the semisynchronous replication feature suggests that with high synchro- nization timeout values it might have been suitable for further testing. Several forks of MySQL also exist. After a cursory review of the biggest three (Drizzle, MariaDB and Percona Server) I found that none of them provide synchronous replication.

MySQL MMM MySQL MMM3 is a multi-master replication system built on top of stan- dard MySQL. It does not offer any consistency guarantees for concurrent conflicting writes. A write made to one master is simply replicated to the other asynchronously. Because MySQL MMM does not provide cluster-wide atomic commits, I decided to not evaluate it further.

1http://neo4j.org/ 2http://www.mysql.com/ 3http://mysql-mmm.org/ CHAPTER 4. EVALUATED DATABASE SYSTEMS 44

4.2.4 PostgreSQL PostgreSQL4 is an open source database system developed outside the con- trol of any single company. PostgreSQL is distributed under the PostgreSQL license [8], which is similar to the standard MIT [1] license. APIs for Post- greSQL exist for most popular programming languages. PostgreSQL 9.1 is the first version that offers synchronous replication. However, PostgreSQL 9.1 limits synchronous replication to single target [10]. It would be possible to build a system fulfilling all the database requirements on top of PostgreSQL, but I decided to not start such an ambitious project in the scope of this thesis, thus abandoning further evaluation of PostgreSQL.

4.2.5 HBase HBase5 is an open source distributed database built on Apache Hadoop dis- tributed software package. HBase is distributed under Apache Software Li- cense version 2.0 [5]. Native API for HBase is only available in Java, however an API based on the Thrift RPC system can be used from many popular lan- guages such as Python and C++. HBase allows atomic updates of single rows through checkAndPut API call. HBase depends on Hadoop Distributed Filesystem (HDFS) and Zookeeper for clustering. HDFS is used to achieve shared storage through which HBase nodes access data. In Hadoop versions before 2.0.0 HDFS architecture has a single point of failure (SPOF) in form of NameNode. During the writing of this thesis, initial release of Hadoop 2.0.0 partially remedies this issue through HDFS HA feature, although it still only provides master-standby operation and failover remains manual. I originally abandoned further evaluation of HBase because of the HDFS single point of failure issue.

4.2.6 Redis Redis6 is an open source key-value store. Redis is distributed under a BSD 3-clause [4] license. APIs for Redis are available for most popular languages. Redis supports compare-and-set type operations through its WATCH, MULTI and EXEC commands. Redis implements master-slave type asynchronous replication. I aban- doned further evaluation of Redis, since it does not support synchronous

4http://www.postgresql.org/ 5http://hbase.apache.org/ 6http://redis.io/ CHAPTER 4. EVALUATED DATABASE SYSTEMS 45 replication and thus transactions may be lost after the master node has con- firmed them to the client but before the master node has replicated them to any slaves. There is also a specification for Redis Cluster which would support synchronous replication but it has not been implemented yet.

4.2.7 Hypertable Hypertable7 is an open source distributed database built on Apache Hadoop distributed software package. Hypertable is developed by Hypertable Inc. and available under GPLv3 [7] license as well as commercial license to be negotiated separately with Hypertable Inc. APIs for Hypertable are based on the Thrift RPC system, and bindings are available for several popular programming languages. Hypertable does not support atomic operations beyond counters. Hypertable’s high availability features are restricted by its use of Hadoop HDFS, which introduces single point of failure. I abandoned further evaluation of Hypertable because of lack of support for atomic updates of even single rows and because of the HDFS single point of failure issue.

4.2.8 H-Store H-Store8 is an experimental memory-based database developed in collabora- tion between MIT, Brown University, Yale University and HP Labs. H-Store is available under GPLv3 [7] license. The H-store website claims that it is experimental and unsuitable for production environments, which is why I abandoned further evaluation of it.

4.2.9 Infinispan Infinispan9 is an in-memory key/value store developed by RedHat, originally designed to be used as a cache. It is available under LGPLv2.1 [3] license. Infinispan supports atomic updates, on-disk storage and clustering with syn- chronous replication. However, it has the same failing as MySQL Cluster, namely it cannot be configured so that writes to minority partition of a cluster couldn’t succeed. 7http://hypertable.org/ 8http://hstore.cs.brown.edu/ 9http://www.jboss.org/infinispan/ CHAPTER 4. EVALUATED DATABASE SYSTEMS 46

4.2.10 Project Voldemort Project Voldemort10 is a distributed key-value store developed at LinkedIn. Project Voldemort is available under Apache Software License 2.0 [5]. Volde- mort uses vector clocks and read repair to provide eventual consistency but it is unclear whether it can be used to implement fault-tolerant atomic up- dates. Rather than attempt to do so, I abandoned further evaluation of Project Voldemort.

4.2.11 Membase / Couchbase Couchbase11 is a distributed database developed by Couchbase. Couchbase is available under Apache Software License 2.0 [5]. APIs for Couchbase are available for various popular programming languages, including Java, C and Python. Couchbase supports replication, but I could not find a complete description of the replication semantics when I originally selected databases for further evaluation in summer 2011. In summer 2011 it appeared that Couchbase 2.0 might include some improvements to replication, but only developer preview versions were available. In July 2012, Couchbase 2.0 is still available only as developer preview versions. The features offered by the replication subsystem are still unclear.

4.2.12 Terrastore Terrastore12 is an open-source NoSQL database licensed under Apache Soft- ware License 2.0 [5]. It appears to be an independent project hosted on Google Code. Terrastore can be accessed using a Java API or an HTTP API. Both interfaces offer conditional updates. Replication features of Terrastore are based on Terracotta in-memory clustering technology from Terracotta Inc. However, I could not find infor- mation on durability or persistence features of Terrastore in the Terrastore wiki, so I abandoned further evaluation.

4.2.13 Hibari Hibari13 is a strongly consistent key-value store licensed under Apache Soft- ware License 2.0 [5]. Hibari was originally developed by Gemini Mobile

10http://project-voldemort.com/ 11http://www.couchbase.com/ 12http://code.google.com/p/terrastore/ 13http://hibari.github.com/hibari-doc/ CHAPTER 4. EVALUATED DATABASE SYSTEMS 47

Technologies Inc. Hibari has a native Erlang API and a cross-platform API based on the Thrift RPC system. When I originally read Hibari documenta- tion it appeared that it did not support atomic conditional updates, which is why I did not select it for further evaluation. However, upon later reading it appears that I was mistaken and I should have evaluated it more carefully.

4.2.14 Scalaris Scalaris14 is a transactional distributed key-value store. The development of Scalaris has been funded by Zuse Institute Berlin, onScale solutions GmbH and several EU projects. Scalaris is available under Apache Software License 2.0 [5]. Scalaris does not appear to be production-ready. I originally decided not to evaluate Scalaris further because it appeared to not be ready for production use. At the time of writing of this chapter in July 2012, the links to Users and Developers Guider on Scalaris homepage do not lead anywhere, so it would appear that the original decision was correct.

4.2.15 GT.M GT.M15 is a key-value database engine originally developed by Greystone Technology Corp. Nowadays it is maintained by Fidelity Information Ser- vices. GT.M is available under GPLv2 [2] license. APIs for GT.M are available for some popular languages, including Python. GT.M offers ACID transactions. GT.M offers Business Continuity replication, which on a closer look ap- pears to be asynchronous replication for disaster recovery purposes. I aban- doned further evaluation of GT.M because it does not have synchronous replication.

4.2.16 OrientDB OrientDB16 is an open-source graph-document database system. OrientDB is distributed under Apache Software License 2.0 [5]. The native language for OrientDB is Java. In addition to Java, language bindings are available for at least Python, using the HTTP interface of OrientDB. The HTTP REST API of OrientDB is limited in that it does not offer conditional updates. Conditional updates are, however, supported using the native Java API.

14http://code.google.com/p/scalaris/ 15http://www.fisglobal.com/products-technologyplatforms-gtm 16http://www.orientdb.org/ CHAPTER 4. EVALUATED DATABASE SYSTEMS 48

OrientDB supports both synchronous and asynchronous replication. It is not clear what visibility guarantees the replication offers. I abandoned fur- ther evaluation of OrientDB because of the limitations of its cross-language support.

4.2.17 Kyoto Tycoon Kyoto Tycoon17 is a key-value database engine developed and maintained by FAL Labs. Kyoto Tycoon is distributed under GPLv3 [7] license. APIs for Kyoto Tycoon exists at least for C/C++ and Python. Kyoto Tycoon supports atomic updates. High-availability features of Kyoto Tycoon are limited to hot backup and asynchronous replication. I abandoned further evaluation of Kyoto Tycoon because of its lack of synchronous replication.

4.2.18 CouchDB CouchDB18 is an open-source distributed database. CouchDB is distributed under Apache Software License 2 [5]. CouchDB is accessed using an HTTP API and bindings are available for many popular programming languages including Java and Python. CouchDB supports atomic updates on a single server but not cluster-wide. CouchDB supports peer-to-peer replication with automatic conflict reso- lution that ensures all nodes resolve conflicts the same way. The replication is not visible to users in that users could for example select how many replicas must receive a write before it being considered successful. Manual conflict resolution that differs from the automated procedure is also possible, since losing copies in the automatic resolution process can also be accessed. I did not consider CouchDB for further evaluation because of the limited atomic update support.

4.3 Databases Selected for Limited Evalua- tion

Databases selected for limited evaluation were evaluated for throughput in non-conflicting update test without failures and for fault-tolerance with a suite of fault-inducing tests. The databases selected for limited evaluation

17http://fallabs.com/kyototycoon/ 18http://couchdb.apache.org/ CHAPTER 4. EVALUATED DATABASE SYSTEMS 49 are popular among application developers but aren’t suitable for use as the main database for the command and control system because they do not support cluster-wide atomic updates. All evaluated databases can be set up in a cluster so that the cluster does not have a single point of failure, and their APIs allow the client to specify that the write must be replicated to a certain number of hosts before the write is considered successful. These features make them suitable for write-mostly applications such as logging.

4.3.1 Cassandra Cassandra19 is an open-source distributed database system. Cassandra is distributed under Apache Software License 2.0 [5]. Cassandra API is based on the Thrift RPC system and bindings exist for many popular languages such as Java, Python and Ruby. I used version 1.1.2 of Cassandra in my tests. Consistency requirements in Cassandra are specified for each write and read operation separately. If both readers read from a quorum of nodes and writers write to a quorum of nodes, then causal consistency exists between writes and reads so that after a write has completed, all readers will see that write. However, it is not possible to detect conflicting writes, which makes atomic updates impossible to implement and thus I only consider Cassandra suitable for limited evaluation. Cassandra has limited support for backups. Each node can be backed up separately, but there is not way to get a backup of the whole cluster at a point in time. Considering that atomic updates are not possible, and thus Cassandra is not suited for storing things that must be updated in a consistent fashion, this is probably acceptable. Based on Cassandra documentation20, nodetool repair command should be run on each node periodically, to ensure that data that is not written to all nodes at commit and rarely read gets replicated appropriately. No other periodic maintenance tasks are mentioned in the documentation. In addi- tion to repairing of data, nodetool also allows other maintenance tasks to be performed, such as removing nodes from the cluster and rebalancing the hash ring Cassandra uses to locate data in the cluster. New node is added by starting it with a configuration that includes some hosts of the existing cluster in so-called “seed list”. 19http://cassandra.apache.org/ 20http://wiki.apache.org/cassandra/ CHAPTER 4. EVALUATED DATABASE SYSTEMS 50

4.3.2 Riak Riak21 is a distributed database system developed by Basho Technologies Inc. Riak is distributed under Apache Software License 2.0 [5]. Riak is accessed via a REST-style HTTP API and bindings exist for several popular programming languages including Java, Python and Ruby. I used version 1.1.4 of Riak in my tests. Similar to Cassandra, consistency requirements in Riak are specified for each write and read operation separately. Causal consistency is achievable as with Cassandra. As with Cassandra, atomic updates are not possible because conflicts are resolved at read time. When conflicting writes occur, the reader must select which one is retained. Because of this impossibility of atomic updates I only consider Riak suitable for limited evaluation. Based on dual fault tests, Riak cannot ensure that writes are only suc- cessful if a quorum of nodes is available. I stopped evaluation of Riak once I noticed this behavior, so not details on its backup mechanism or adminis- trative tools are presented here.

4.4 Databases Selected for Full-Scale Evalu- ation

Databases selected for full-scale evaluation were evaluated for throughput in non-conflicting update tests and update tests with various conflict rates without failures and for fault tolerance with a suite of fault-inducing tests.

4.4.1 Galera Galera22 is a multi-master clustering solution for MySQL developed by Coder- ship Oy. Galera is distributed under GPLv2 [2]. Galera is an add-on to MySQL, so standard MySQL clients can be used to access it. I used version 2.2rc1 in my tests. Galera offers transparent multi-master clustering on top of standard MySQL. Galera uses synchronous replication that ensures that only one of concurrent conflicting transactions can commit. Galera also ensures that only a parti- tion of the cluster with the majority of nodes in it can process transactions, thus ensuring that successful commits are always replicated to at least two sites in the command and control system architecture.

21http://basho.com/products/riak-overview/ 22http://codership.com/ CHAPTER 4. EVALUATED DATABASE SYSTEMS 51

Galera maintains a copy of all data on each cluster node, so it is possible to take a block device-level snapshot of the data on one node and to use that as a backup. Galera also supports standard MySQL/Innodb backup tools such as Percona XtraBackup. Based on documentation on the Galera website, Galera cluster does not require any periodic maintenance. Connecting and removing nodes from clus- ter is done by manipulating values of configuration variables, either through configuration file or through MySQL command line interface. Cluster state can also be monitored through the regular MySQL interface by inspecting configuration and state variables. Each Galera node has different configura- tion settings, since each node must be configured with the addresses of other nodes in the cluster.

4.4.2 MongoDB MongoDB23 is a document database system developed by 10gen Inc. The MongoDB server itself is distributed under AGPLv3 [6] license, but the client APIs developed by 10gen are distributed under Apache Software License 2.0 [5]. Client APIs are available for many popular programming languages including C++, Java and Python. MongoDB allows atomic updates of single documents. I used version 2.0.6 of MongoDB in my tests. MongoDB clustering works in two dimensions that are managed sepa- rately. To increase throughput, data is distributed to multiple shards. Each shard is backed by either a single node or a cluster that facilitates replication. For this thesis, I ignore the sharding dimension, since the main focus is on high availability. In MongoDB terminology a replicating cluster is called a replica set. A replica set has one master and a number of slaves. By default all reads and writes go to the master, facilitating causal consistency and atomic updates. The slaves asynchronously pull updates from the master, but write operations support an option to specify that the write is considered complete until a specific set of slaves has replicated the update. MongoDB only offers READ UNCOMMITTED semantics cluster-wide for trans- actions. Writes done to the primary are visible to reads on the primary before they have been replicated to secondaries. In extreme case this means that a client may read an object from current primary, but the object may dis- appear if the primary immediately fails and the new primary that is elected had not yet received it. There are also theoretical limitations on data size that stem from how MongoDB handles access to on-disk resources. The on- disk data representation is mapped into memory, so there must be enough

23http://www.mongodb.org/ CHAPTER 4. EVALUATED DATABASE SYSTEMS 52 address space to map all data. This is not a practical limitation on systems with 64-bit address bus, but 32-bit systems are typically limited to about two gigabytes of data. MongoDB can be backed up by copying the data directory from a filesys- tem snapshot as long as MongoDB has journaling enabled. Only a single node of a replica set needs to be backed up. Additionally, MongoDB allows point-in-time backups even without filesystem snapshots with mongodump utility. MongoDB does not have a separate utility for cluster administration. Ad- ministrative tasks, such as adding and removing replicas from replica sets, are performed using special commands with the regular command line ap- plication. Configuration files on all nodes in a MongoDB replica set are identical.

4.4.3 MySQL Cluster MySQL Cluster24 is a high-availability database system that replaces the standard storage engines in MySQL with a cluster system that allows syn- chronous replication. Nowadays Oracle packages it separately from the stan- dard MySQL software, so I also present it separately here. Oracle offers MySQL Cluster under GPLv2 [2] license and various commercial licensing schemes. MySQL Cluster can be accessed using standard MySQL APIs, so its programming language support good. MySQL Cluster also has a sepa- rate API for directly accessing the replicated storage system. I tested version 7.1.15a of MySQL Cluster. MySQL Cluster offers synchronous replication and maintains a config- urable number of replicas of the data. However, as I later noted in my failure handling tests, MySQL Cluster only ensures that all replicas of the data that are present stay synchronized, not that a commit cannot succeed without more than one replica having received the data. As this is against the requirement that a successful commit must be durable against failure of a datacenter, I stopped evaluating MySQL Cluster after discovery of this failure mode.

4.4.4 Zookeeper Zookeeper25 is an open-source distributed coordination system. Zookeeper is distributed under Apache Software License 2.0 [5]. Zookeeper has native

24http://www.mysql.com/products/cluster/ 25http://zookeeper.apache.org/ CHAPTER 4. EVALUATED DATABASE SYSTEMS 53

APIs for Java and C, and wrappers for the C API exist for various languages such as Python and Ruby. Zookeeper provides cluster-wide atomic updates but by default they are not causally consistent, i.e. client A can perform an update, tell client B about it and client B may not see the update with- out performing a sync() operation before the read. I used version 3.3.3 of Zookeeper in my tests. Zookeeper cluster has a leader that coordinates writes. The cluster re- mains operational as long as the leader can communicate with a majority of nodes. If leader loses communication with a majority of nodes, due to for example network partition, and some other partition has a majority of nodes but no leader, the partition with majority elects a new leader and continue operations as a new cluster. When communications are later restored, the other members will join this new cluster. In command and control system ar- chitecture this quorum requirements guarantees that after successful commit, data is always stored on at least two nodes. Zookeeper only supports data objects up to 1 MiB in size. There is no hard limit for the size of a database, but it should fit within memory. Backups of a Zookeeper cluster data can be made by taking a block device- level snapshot on a single node and creating a copy of its data directory. By default, Zookeeper does not remove old transaction logs and data snapshots. It can be configured to do so, or an external process can be used instead. Zookeeper also includes a command line utility that can be used to check current configuration, cluster state and access statistics. Nodes can only be added and removed by changing the configuration files. All nodes must have the same configuration file, so adding or removing a node requires restart of whole cluster. [9] Chapter 5

Experiment Methodology

In this chapter I describe the test environment, test workloads used and mea- surements made. First I describe the test environment used for simulating a three-datacenter deployment of the tested databases on a single physical computer using standard virtualization technologies. Then I continue with description of the test workloads and related measurements, and database- specific changes made to them. Finally I describe fault simulation methods and related measurements.

5.1 Test System

The physical test system is a standard desktop PC with 4-core CPU and 8 GiB of physical memory. The system has a single SSD device for the op- erating system and operative virtual machines. Inoperative virtual machine images and other data are stored on a separate disk. Operating system on the physical test system is Debian 6.0.2 with kernel version 2.6.32-5-amd64. For virtualization, Virtualbox installed from the Debian package virtualbox-ose version 3.2.10-dfsg-1 is used. The virtual machines used in tests use the same operating system distri- bution as the host, including same kernel version. The test systems are based on a shared virtual machine image which is cloned and customized with set- tings specific to the database system being tested and the role of the node in the test in automated fashion using scripts. The goal was to have the test systems be as identical to each other as possible, since in reality administer- ing systems with identical configurations would be easier than administering systems with diverse configurations. Diversity of configurations complicates for example crash recovery procedures and normal configuration updates. Each of the virtual machines has three network interfaces: eth0, eth1 and

54 CHAPTER 5. EXPERIMENT METHODOLOGY 55 eth2. Eth0 is connected to a virtual NAT adapter to allow Internet access from the virtualized operating system. Eth1 and eth2 are connected to TAP devices on the host operating system. The host system has a total of six TAP devices and two bridge devices that simulate switches. The TAP interfaces corresponding to eth1 interfaces on test systems are bridged together using bridge br0 and the TAP interfaces corresponding to eth2 using bridge br1 on the host system as show in Figure 5.1.

Node Host interface eth0 tap1 192.168.1.1 test1 eth1 tap4 192.168.2.1 br0 192.168.1.100 eth0 tap2 192.168.1.2 test2 eth1 tap5 192.168.2.2 br1 192.168.2.100 eth0 tap3 192.168.1.3 test3 eth1 tap6 192.168.2.3

Figure 5.1: Test network setup

The test systems have two different internal network connections because both both external and internal communications of data centers must be sim- ulated. The load-generating and measuring client programs are for the most part executed on the host operating system and they must have low-latency simulated LAN access to one node and high-latency simulated WAN access to other nodes. The server software running in the test systems only com- municates via the high-latency simulated WAN links. On test systems, eth0 is the high-latency simulated WAN link and eth1 the low-latency simulated LAN link, as shown in Figure 5.1. I use netem [20] to generate latency on the simulated WAN links. Netem delays outgoing packets on a network interface for some time with the pos- CHAPTER 5. EXPERIMENT METHODOLOGY 56 sibility of varying the delay according to a configurable probability distri- bution. Netem can also be used to simulate packet loss, corruption and reordering. In my tests I have applied Netem queuing rules on host side to the TAP interfaces corresponding to eth0 interfaces, so that the rules are applied every time a packet is sent from host side of the TAP interface to the VM. This causes the latency settings set via Netem to be applied twice when sending a request and receiving a reply from VM to VM, so that RTT latency between VMs is twice the Netem setting. With Netem and bridge interfaces, it is also be possible to simulate net- work where latencies between all nodes are not equal. However, the topolo- gies that can be simulated are limited to a star on which each branch can have a specific latency. However, with three nodes, this isn’t a limitation. Nevertheless I, did not perform tests with unequal latencies due to time con- straints.

5.2 Test Programs

The test programs should perform database operations similar to those that the actual system would perform. The tested operations should also be such that they can be applied to all the databases tested. There appears to be one existing multi-database benchmarking tool, called Yahoo Cloud Serving Benchmark [14]. It does not support all the databases in my evaluation, and its documentation does not indicate whether it could be parametrized to for example connect to specific hosts. I decided to write the test programs myself. The test programs described in the following sections are designed to stress the database under two kinds of update load: read-modify-write and write-only. Pure read operations are not included in the throughput test, since it is expected that read-load in the command and control application is sporadic, since read-only queries to the database are only necessary when starting up an application node. The read-modify-write update test is applicable to databases selected for full evaluation in Chapter 4. The write-only update test is applicable to those, and also the databases selected for limited evaluation. However, only databases selected for limited evaluation were tested with write-only update loads due to time constraints. Before each test run, the databases are reset to clean state with database-specific scripts. CHAPTER 5. EXPERIMENT METHODOLOGY 57

5.2.1 Read-Modify-Write Update Test The read-modify-write test increments a number of counters. The test runs for a certain period of time, during which each client increments the counters assigned to it in a random order. In the end, the client counts how many times it successfully managed to increment a counter, and writes this value for each counter. In addition, the client produces timestamped log messages upon every retried transaction and upon each successful commit. A separate verification program later verifies that the counters in the database match the increments recorded by client clients. The test programs perform updates so that confirmation is received for each commit before the next update is started. The confirmation must imply that the write is visible to the whole cluster and stable in case of crash immediately after completion of the write. These requirements imply that read and write sets used in the test transactions must intersect, and that each write is necessarily delayed by at least one RTT. Thus it seems probable that latency will dominate write throughput. The set of counters for each client is selected based on two parameters: offset and rows. Each client updates rows counters, starting at offset. Fig- ure 5.2 shows counter distribution with varying offset and rows values and varying number of clients per database node. A few special cases are interest- ing. If rows = 1 and offset = 0, all clients will race to update the same counter repeatedly, producing many update conflicts. This produces the maximum number of conflicts. On the other hand, if rows = 1 and offset = 1, each client will have its own counter to update, and all the updates may proceed without necessary conflicts. With some additional assumptions, it is possible to simulate the conflict rate that a specific configuration should produce. The assumption in my projections was that clients proceed in lock-step, and clients that pick the same counter are in conflict. It is easy to determine for example configu- rations that result in 50 % of clients participating in conflicts and 25 % of commits failing due to conflict (assuming that one of conflicting transactions succeeds and others fail). See Figures 5.2c and 5.2d for graphic illustrations and Figure 5.3 for an illustration of the lock-step simulation. CHAPTER 5. EXPERIMENT METHODOLOGY 58

Node 1 2 3 4 5 6 Counter 0 0 0 0 0 0 0 (a) Counters for each node in maximum conflict rate configuration Node 1 2 3 4 5 6 Counter 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5 (b) Counters for each node in zero conflict rate configuration Node 1 2 3 4 5 6 Counter 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5 (c) Counters for each node in 50 % conflict rate configuration Node 1 2 3 4 5 6 Counter 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 6 6 6 7 7 7 7 7 7 7 8 8 8 8 8 8 8 9 9 9 9 9 9 9 10 10 10 10 10 10 10 11 11 11 11 11 11 11 (d) Counters for each node in 25 % conflict rate configuration

Figure 5.2: Workload distribution examples CHAPTER 5. EXPERIMENT METHODOLOGY 59

With the parametrization system used, it is not possible to achieve ar- bitrarily high conflict rate with varying numbers of clients. If client count is kept constant, it is possible to achieve arbitrary conflict rate by having offset = 0 and simply adjusting rows to achieve desired conflict rate. Note that this assumes sufficiently high client count so that quantization effects of low row counts required for low conflict rates don’t dominate the result. An easier way to achieve arbitrary conflict rate would be to assign specific counter distribution to each client. This could be achieved for example by specifying a list of counters instead of number of rows and offset, so that each counter could be present multiple times in the list. The resulting dis- crete distribution could then simulate any desired distribution, and thus any conflict rate could be achieved for an arbitrary number of clients. Four conflict settings were evaluated. The selected settings were as many conflicts as possible (rows = 1, offset = 0), no conflicts at all (rows = 1, offset = 1), 50 % modeled conflict rate (rows = 2, offset = 1) and 25 % modeled conflict rate (rows = 4, offset = 2). As noted above, more compre- hensive selection of parameters would have required more flexible parameter assignment model. Results for rates such as 90 % and 75 % would have been especially interesting to see. Notes on effect of conflict rate on results are in Chapter 6. In addition to program parameters above, I also varied the number of clients connecting to the database and the delay between database nodes in the automated update throughput tests. Several client counts between 1 client per database node and 30 clients per database node were tested. In total, the number of parameter combinations threatened to grow to such numbers that, combined with each test run lasting over a minute to ensure large number of samples, the final test run on all four database lasted over 26 hours. RTT delays from 0 milliseconds to 100 milliseconds were tested at 50 millisecond intervals. Client counts tested were 1, 2, 3, 5, 10, 15, 20, 25 and 30 clients per database node. In total I tested 396 parameter sets for each database. CHAPTER 5. EXPERIMENT METHODOLOGY 60

Simulation round Node 1 2 3 4 5 6 7 8 9 10 11 12 Conflicts 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 Node 0 2 2 2 2 2 2 2 2 2 2 2 2 4 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11 11 11

0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 Node 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11 11 11

0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 Node 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 Simulation round 10 10 10 10 10 10 10 10 10 10 10 10 Node 1 2 3 4 5 6 7 8 9 10 11 12 Conflicts 11 11 11 11 11 11 11 11 11 11 11 11 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 Node 0 2 2 2 2 2 2 2 2 2 2 2 2 5 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 Node 3 2 2 2 2 2 2 2 2 2 2 2 2 5 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 6 6 6 6 6 6 6 6 6 6 6 6 1 1 1 1 1 1 1 1 1 1 1 1 7 7 7 7 7 7 7 7 7 7 7 7 Node 1 2 2 2 2 2 2 2 2 2 2 2 2 4 8 8 8 8 8 8 8 8 8 8 8 8 3 3 3 3 3 3 3 3 3 3 3 3 9 9 9 9 9 9 9 9 9 9 9 9 4 4 4 4 4 4 4 4 4 4 4 4 10 10 10 10 10 10 10 10 10 10 10 10 5 5 5 5 5 5 5 5 5 5 5 5 11 11 11 11 11 11 11 11 11 11 11 11

0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 Node 2 2 2 2 2 2 2 2 2 2 2 2 2 8 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 Node 4 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 6 6 6 6 6 6 6 6 6 6 6 6 1 1 1 1 1 1 1 1 1 1 1 1 7 7 7 7 7 7 7 7 7 7 7 7 Node 3 2 2 2 2 2 2 2 2 2 2 2 2 7 8 8 8 8 8 8 8 8 8 8 8 8 3 3 3 3 3 3 3 3 3 3 3 3 9 9 9 9 9 9 9 9 9 9 9 9 4 4 4 4 4 4 4 4 4 4 4 4 10 10 10 10 10 10 10 10 10 10 10 10 5 5 5 5 5 5 5 5 5 5 5 5 11 11 11 11 11 11 11 11 11 11 11 11

0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 Node 4 2 2 2 2 2 2 2 2 2 2 2 2 6 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 Node 5 2 2 2 2 2 2 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 6 6 6 6 6 6 6 6 6 6 6 6 1 1 1 1 1 1 1 1 1 1 1 1 7 7 7 7 7 7 7 7 7 7 7 7 Node 5 2 2 2 2 2 2 2 2 2 2 2 2 8 8 8 8 8 8 8 8 8 8 8 8 8 3 3 3 3 3 3 3 3 3 3 3 3 9 9 9 9 9 9 9 9 9 9 9 9 4 4 4 4 4 4 4 4 4 4 4 4 10 10 10 10 10 10 10 10 10 10 10 10 5 5 5 5 5 5 5 5 5 5 5 5 11 11 11 11 11 11 11 11 11 11 11 11 (a) Simulation with 50 % conflict rate: (b) Simulation with 25 % conflict rate: rows=2, offset=1 rows=2, offset=1

Figure 5.3: Conflict simulation examples CHAPTER 5. EXPERIMENT METHODOLOGY 61

5.2.2 Write-Only Update Test The write-only update test simply updates data in the database as fast as possible from each client. All clients update different entities, so no conflicts occur. No read is done before update. Each client counts its successful commits similar to the update test. The test runs for a certain period of time, after which the update counts and other statistics are written to log files for post processing. A separate verification program can later verify that as many updates were performed as the clients claim. The test program is not parametrized, but similar to the update test, latency and client count vary. Latencies and client counts tested are the same as for the update test. The test programs perform writes so that confirmation is received for each commit. The confirmation must imply that the write is visible to the whole cluster and stable in case of crash immediately after completion of the write. Depending on the database system used, this may require waiting for replication of each commit to a quorum of the cluster nodes before starting the next write, however some databases also offer asynchronous interfaces where further writes can be issued even before previous writes have commit- ted. When possible, such asynchronous interfaces are used. However, to be useful, they must preserve the order of writes from each client, to be useful in the logging use case mentioned in Chapter 4.

5.2.3 Test Programs in Fault Tests The test programs used in failure handling tests are derivations of the update test programs described before. However, the programs were modified so that they only apply limited load on the database, to ease interpretation of results and to reduce the volumes of data produced. The limitation was implemented by adding a delay of 100 ± 50 ms after each database operation. For databases with update test programs, read-only test programs were also implemented, since read and write accessibility characteristics of the databases differ during failures. The read test programs simply read data and throw it away. No correctness checks are done on the read results.

5.2.4 Database-Specific Adaptations of Test Programs All test programs accept same arguments, and perform same essential func- tions, but internal differences exist in mapping of the test specification to database functionality. CHAPTER 5. EXPERIMENT METHODOLOGY 62

Galera Galera tests use oursql version 0.9.3.1 to access the database over TCP/IP sockets. The representation of counter data in Galera/MySQL is a single table with two columns: id and value created with CREATE TABLE test (id int primary key, val int) ENGINE innodb. Oursql does not offer abil- ity to connect to any available node in the cluster. I used haproxy to achieve this in the fault simulation tests. The update test does a BEGIN, SELECT FOR UPDATE, UPDATE, COMMIT se- quence for the selected counter. Counter access order is randomized by shuf- fling the list of counters to be used with random.shuffle from Python stan- dard library and then repeatedly looping over this list. The Galera update test interprets MySQL result codes 1205 and 1213 as conflicts that warrant retry and all other failure-indicating result codes as failures that cause the test to be aborted in case of throughput test and reconnection attempt in case of failure test. The read test for Galera simply repeatedly does BEGIN, SELECT, COMMIT sequences over a sequence of counters . The test checks that the result is of correct form, but not that the value is sensible.

Zookeeper Zookeeper tests use the zkpython Python module version 0.4.2 to access the database. The data is represented as Zookeeper nodes so that the node is simply named with the counter identifier and its data is a string representa- tion of the value. For example, counter 7 would reside at /7 and its value 123 would be represented as string 123. The Zookeeper read-modify-write update test randomizes the order of counter values the same way as Galera tests. The operation sequence in update test is similar to that of Galera in principle: first zookeeper.get to get current value of the counter and then zookeeper.set conditional update to update the value. The read-only Zookeeper test program only fetches the current value, and doesn’t perform any checks on it. The variants of update test used in throughput and failure clients differ, as with other databases, in that the failure test induces a delay after each set operation. In addition, error handling is different. The throughput test variant is only designed to cope with zookeeper.BadVersionException which indicates conflict upon update and is handled by retrying the update. The failure test client on the other hand copes with any error resulting from any database operation by closing and reopening the database connection and retrying with the new connection. CHAPTER 5. EXPERIMENT METHODOLOGY 63

MongoDB MongoDB tests use pymongo Python module version 2.2 to access MongoDB. The counter data is represented in MongoDB as BSON documents with fields id and val. The field id is also used by MongoDB as its internal object identifier. Both fields are integers. The read-modify-write update test in MongoDB randomized the order of counter values the same way as Zookeeper and Galera tests. The actual up- date operation first does r collection.find one and then w collection.update with the old document as update key and with w=2 to ensure that the write is durable upon crash of current master. The read-only test program only does the read operation and doesn’t perform any checks on the value returned. The throughput-oriented update test handles failures that cause excep- tions by aborting the test. Update conflicts are detected by inspecting updatedExisting field of the update result. If updatedExisting is false, the test retries the update. The failure test is different in that it handles exceptions from the database module by logging a failure and retrying the operation. One notable difference from the other test programs in MongoDB tests is that it maintains two connections to the server. One for reading and another for writing. This is because MongoDB only supports writes to the master node of the cluster and because only values read done from slaves are guaranteed to be stable in case the master node crashes.

Cassandra Cassandra tests use pycassa Pythod module version 1.7.0 to access Cassan- dra. The test data is represented as as single column family with columns name and value. The first is a string and the second a long int. The first value is not actually used in tests, rather the tests use the name of the counter as key and only use the value column to hold the counter value. The write-only test simply does columnfamily.insert operations for given keys, incrementing the to-be-inserted value after each successful insert. The throughput test repeats these operations as quickly as possible and re- tries the insert indefinitely upon all errors. The failure test client is otherwise equivalent but adds a delay after each database operation. The Cassandra throughput test code does not expect any failures, it sim- ply aborts the test upon failure. The failure test code simply retries upon failure. CHAPTER 5. EXPERIMENT METHODOLOGY 64

Riak Riak tests use commit c4affa40f2828324fe5af2d69757c32a6abf0854 of a custom fork of the official Riak python module that can be found at https://github.com/tazle/riak-python-client. The fork fixes some time- out issues. The data is stored in Riak using counter identifier as key for a JSON object that contains the counter value in field val. The Riak library does not offer ability to connect to any node of the cluster. I used haproxy to enable connections to any node that is up. The write-only test repeatedly increments values of given counters. Client identifier is included in the counter name, so no conflicts occur. The Riak- level operations performed are obj = bucket.new, obj.set value and obj.store. The failure test client is otherwise equivalent but adds a delay after each store operation. Note that the get() operation above does not imply The Riak throughput test code handles failures by aborting the test. The failure test handles all failures by retrying the operation.

5.3 Fault Simulation

I did three kinds of fault simulations in the tests. The simulations correspond to faults analyzed in Chapter 3. I performed fault simulation runs so that the test programs simulating clients are started first, and then commands introducing and removing faults are run manually through a framework that logs the commands run together with timestamps so that the commands run can be shown in result visualizations.

5.3.1 Software Fault I simulated fault of database software by abruptly killing database server processes on one cluster node. For databases which have multiple processes in a hierarchy, I also tested killing just the child process to see if the database has watchdog functionality that would restore the child process automatically. To restore a database node brought down by killing the server process or processes, I used the same commands used to start the server process normally.

5.3.2 Network Fault I simulated network fault by causing 100% packet loss on the simulated WAN interface of a single cluster node. The observable effects of this failure mode differ from the software fault in that in the software fault case the operating CHAPTER 5. EXPERIMENT METHODOLOGY 65 system terminates TCP connections of the database server processes whereas in the network fault cases the connections are left in open state. Thus the remote ends of connections must use timeouts to detect the fault. To restore a database node from the network fault, packets are again permitted to flow through the simulated WAN interface with no packet loss. Packet loss is controlled using Netem with loss parameter on the host-side tap interface corresponding to the WAN interface of the virtual machine.

5.3.3 Hardware and System Software Fault I simulated hardware faults by abruptly terminating one or more of the vir- tual machines running cluster nodes. The immediate observable effect of this failure mode may be the same as with network fault, but may also enable quicker fault detection through transmit failures at link level. I did not in- vestigate the exact observable effect in the test setup. A bigger difference to network fault case is that, as in process termination case, any in-memory state of the terminated database node is lost. To restore a database node from simulated hardware fault, the virtual machine is simply restarted. Some of the database installations used in tests additionally require that the database service is started separately. I used ping to monitor the virtual machine startup process. If a separate database startup command was necessary, I issued it a few seconds after the virtual machine had started replying to ping messages.

5.4 Test Runs

For throughput tests, the test runner script first runs an initialization script that ensures the database is in a known state before test runs. Then it launches as many client programs as the test requires. One third of the clients connects to each server, to simulate even load. The client programs terminate themselves after the test period is finished. Once they all have terminated, the test runner runs a verification script that ensures the state of the database matches the commits the clients recorded in their log files. Fault simulation tests are somewhat more complex. The test runner script runs a set of clients, one for each database node and one that connects to any database node that is available. For read-modify-write databases there are two sets of clients, one for read-only access and another for read-modify- write access. In addition to the clients run by test runner script, there is a fault insertion script that is used manually to insert and remove failures by running specific commands. Chapter 6

Experiment Results

In this chapter I present results of the experiments that I performed while evaluating the databases listed in Chapter 4. As I described in Chapter 5, the tests fall in two broad categories: throughput tests and fault tolerance tests. Within these categories I present the results for databases selected for limited evaluation separately from the results for databases selected for full-scale evaluation. This is because the tests performed were substantially different.

6.1 Throughput Results

I present throughput test results mainly in form of figures that show the effect of latency and client count on throughput measured as commits per second when the test configuration is constant. A separate scale is used for each latency setting, but all results for a single latency share the same scale so that comparing the effect of client count is easy. The original data consists of commit timestamps recorded when a database write operation is acknowledged to the client. The data is aggregated over time by binning the counts by second. Only full seconds are included in the data set. In addition, three different aggregations over system components are shown: per-client, per-database cluster node and total. The visualization includes mean and median bin commit counts and an estimate of distribution of bin commit counts. The mean value is presented using a solid-color line and a dot, the median using a dashed line and the distribution estimate using a colored solid area.

66 CHAPTER 6. EXPERIMENT RESULTS 67

6.1.1 Read-Modify-Write Results This section contains results of read-modify-write tests as described in Chap- ter 5. The tests were run with various latency, client count and conflict rate settings. Results for each conflict rate setting are shown in separate figures.

100% Conflict Rate The test at 100% conflict rate shows how the database system behaves un- der unrealistically harsh load conditions. Any system that must apply non- commutative updates to a shared value over high-latency communication links is bound by the latency in throughput. It is unlikely that any real- world system that is intended to be useful for interactive use would have all the clients constantly competing over access to one piece of data. However, because actual characteristics of real-world workload are not known at this time, I decided to include also the most extreme possible workload in the test.

Commits/second/client mean and distribution at different latencies and client counts

190.0

95.0 0 ms

0.0 57.0

28.5 5 ms

0.0 36.0

18.0 10 ms

0.0 Latency 19.0

9.5 20 ms

0.0 10.0

5.0 40 ms

0.0 GaleraG M Z G M Z G M Z G M Z G M Z 3 clients 9 clients 30 clients 60 clients 90 clients Mongo Zk Client count

Figure 6.1: Commits/second/client at 100% conflict rate

Per-client results at 100% conflict rate in Figure 6.1 show multimodal commit rate distribution, indicating that either the commit rate varies over CHAPTER 6. EXPERIMENT RESULTS 68 time or that different clients achieve significantly different commit rates. The multimodality is most pronounced in Galera and MongoDB but present to lesser extent in both Zookeeper results too.

Commits/second total mean and distribution at different latencies and client counts

1076.0

544.0 0 ms

12.0 492.0

251.0 5 ms

10.0 335.0

172.0 10 ms

9.0 Latency 338.0

173.0 20 ms

8.0 206.0

105.5 40 ms

5.0 GaleraG M Z G M Z G M Z G M Z G M Z 3 clients 9 clients 30 clients 60 clients 90 clients Mongo Zk Client count

Figure 6.2: Commits/second total at 100% conflict rate

Results aggregating all commits in Figure 6.2 do not show the multi- modality exhibited in Figure 6.1, indicating that the multimodality is not caused by variance of commit rate over time. However, the Figure does show that commit rates for MongoDB tend to vary a great deal more than those for Galera or Zookeeper at certain client count/latency combinations. I did not research the cause of this further. Per-node results Figure 6.3 do show similar multimodality as per-client results. It appears that, especially for MongoDB, only a single server can actually process writes at any one time. The plot in Figure 6.4 shows this well. Most of the time clients connected to nodes 1 and 2 are completely starved. As for general throughput, the total numbers in Figure 6.2 show that cluster-wide throughput mostly depends on latency, except for MongoDB. It would appear that throughput of a MongoDB as a whole is not bound by latency even if the throughput of a single client in such cluster is, as can CHAPTER 6. EXPERIMENT RESULTS 69

Commits/second/server mean and distribution at different latencies and client counts

419.0

209.5 0 ms

0.0 467.0

233.5 5 ms

0.0 335.0

167.5 10 ms

0.0 Latency 338.0

169.0 20 ms

0.0 199.0

99.5 40 ms

0.0 GaleraG M Z G M Z G M Z G M Z G M Z 3 clients 9 clients 30 clients 60 clients 90 clients Mongo Zk Client count

Figure 6.3: Commits/second/node total at 100% conflict rate

tc3 tc2 tc1

Commit timelines 0 10 20 30 40 50 60 70 Time (s)

Figure 6.4: MongoDB commit timeline at 40 ms and 30 clients be seen in Figure 6.1. Cluster-wide throughput appears to be stable for the other database systems, as exhibited by narrow commit rate distributions in Figure 6.2. The throughput of MongoDB is caused by its architecture: only the mas- ter node of the cluster can process writes. The node shown in the results is not the node to which a write was made but rather the node from which data was read. All writes are done to the master node. In the test network one third of clients is connected to the master node with latency-free link whereas the other two thirds connect via links that have latency. Further, CHAPTER 6. EXPERIMENT RESULTS 70 the read-modify-write loop has a race condition in that clients are all racing to perform the same update, and since the clients that have zero-latency access to the master node tend to win more races, those clients also perform more commits, explaining the actual results. Another factor that makes it easier for remote clients to starve in Mon- goDB is that it inly supports READ UNCOMMITTED semantics for writes when the whole cluster is considered. A write made to master is visible to reads made on master before it has been replicated.

0% Conflict Rate As with 100% conflict rate, 0% conflict rate is an unrealistic scenario, but it is certainly the best case. Without conflicts, all the clients should be able to proceed independently of one another. Per-client throughput should be limited by latency, but throughput across the whole cluster could scale linearly with the number of clients.

Commits/second total mean and distribution at different latencies and client counts

2380.0

1214.5 0 ms

49.0 1926.0

989.0 5 ms

52.0 1495.0

762.5 10 ms

30.0 Latency 896.0

457.0 20 ms

18.0 477.0

243.5 40 ms

10.0 GaleraG M Z G M Z G M Z G M Z G M Z 3 clients 9 clients 30 clients 60 clients 90 clients Mongo Zk Client count

Figure 6.5: Commits/second total at 0% conflict rate

As can be seen in Figure 6.5, Zookeeper shows exactly this linear scaling that is expected. MongoDB also appears to scale roughly linearly with client CHAPTER 6. EXPERIMENT RESULTS 71 count until it peaks at certain client count, after which throughput drops. At lower client counts Galera appears to scale well, but at higher client counts, especially at higher latencies, its throughput distribution is fairly wide, even though its mean throughput is roughly equal to that of Zookeeper.

Commits/second/server mean and distribution at different latencies and client counts

800.0

400.5 0 ms

1.0 746.0

378.0 5 ms

10.0 594.0

300.5 10 ms

7.0 Latency 361.0

182.5 20 ms

4.0 204.0

103.0 40 ms

2.0 GaleraG M Z G M Z G M Z G M Z G M Z 3 clients 9 clients 30 clients 60 clients 90 clients Mongo Zk Client count

Figure 6.6: Commits/second/node at 0% conflict rate

Figure 6.6 shows per-node throughput numbers. At higher latencies both MongoDB and Zookeeper show multimodal throughput distributions. Since the total throughput distributions are not multimodal, this indicates differ- ences in throughput of individual nodes. Despite the throughput differences apparent from multimodality, no client appears to be starved as indicated by the per-client throughput numbers in Figure 6.7. Per-client data in Figure 6.7 shows multimodality better than per-node data in Figure 6.6. Large differences in per-node throughput at different client counts hide multimodality, but since throughput scales with number of clients, per-client data shows it well. Looking at per-client data, Galera throughput lowers considerably as client count increases. Both MongoDB and Zookeeper retain fairly constant per-client throughput. CHAPTER 6. EXPERIMENT RESULTS 72

Commits/second/client mean and distribution at different latencies and client counts

249.0

124.5 0 ms

0.0 61.0

30.5 5 ms

0.0 36.0

18.0 10 ms

0.0 Latency 20.0

10.0 20 ms

0.0 11.0

5.5 40 ms

0.0 GaleraG M Z G M Z G M Z G M Z G M Z 3 clients 9 clients 30 clients 60 clients 90 clients Mongo Zk Client count

Figure 6.7: Commits/second/client at 0% conflict rate

25% and 50% Conflict Rate I ran tests with settings that, with the model described in Chapter 5, produce 25% and 50% theoretical conflict rates. I did not measure actual conflict rates with these settings. Throughput means and distributions at both 25% and 50% conflict rates are similar enough to results at 0% that I will omit presenting them here. The figures are available in Appendix A.

Conclusions Based on the results presented here, and considering that the number of transactions in the envisioned system is limited by real-world events, all of the databases evaluated appear to have good enough throughput to serve as basis for implementation of the system.

6.1.2 Write-Only Results Write-only tests were only run for databases selected for limited evaluation. The tests were executed as described in Chapter 5. Results are presented in CHAPTER 6. EXPERIMENT RESULTS 73

Figures 6.8, 6.10 and 6.11. Expected results are the same as for read-modify-write test at 0% conflict rate. That is, optimally per-client throughput would only depend on latency and total throughput would scale linearly with number of clients. Figure 6.8 shows that at highest latencies per-client throughput indeed stays fairly con- stant. At lower latencies, with tested parameter combinations, total system throughput improves until certain client count and then stays fairly constant as can be seen in Figure 6.11.

Commits/second/client mean and distribution at different latencies and client counts

244.0

122.0 0 ms

0.0 208.0

104.0 5 ms

0.0 59.0

29.5 10 ms

0.0 Latency 219.0

109.5 20 ms

0.0 47.0

23.5 40 ms

0.0 C R C R C R C R C R 3 clients 9 clients 30 clients 60 clients 90 clients Cassandra Riak Client count

Figure 6.8: Commits/second/client in write-only test

Figure 6.8 also shows odd scaling at 5 ms, 20 ms and 40 ms latencies. The scaling is caused by odd results in Riak tests. One of the clients achieves results that should not be possible as shown in Figure 6.9. It is clear that the client is not waiting for its own writes to replicate throughout the cluster before starting the next write, since that requires at least one RTT or 40 ms. The maximum throughput should thus be in the order of 25 commits per second but the problematic client achieves over 200 at times. I did not have time to figure out how exactly the Riak client in question was misbehaving, but this issue decreases my confidence in Riak as a whole. Looking at the total results in Figure 6.11, it appears that Riak allows CHAPTER 6. EXPERIMENT RESULTS 74

250

200

150

100 Commits/second (estimate)

50

0 0 10 20 30 40 50 60 70 Time

Figure 6.9: Riak commits/second estimate for each client at 20 ms and 30 clients higher throughput than Cassandra at low concurrency levels, but with more clients Cassandra eventually passes it. It also appears that at lower laten- cies, at which peak throughput is reached, it is essentially the same for each database regardless of latency. Per-node results in Figure 6.10 reveal that most of the distributions only have single mode, showing that throughput does not depend on the server to which a client connects. At 20 ms and 30 clients the results show multiple modes for Riak, but this is caused by the problem described above. On the whole per-node throughput behaves like total throughput. One interesting thing in the results is that, as seen in Figure 6.8, Cassan- dra per-client throughput increases when client count increases from 3 to 9. I did not have time to research the causes of this either. CHAPTER 6. EXPERIMENT RESULTS 75

Commits/second/server mean and distribution at different latencies and client counts

409.0

204.5 0 ms

0.0 399.0

212.0 5 ms

25.0 403.0

213.5 10 ms

24.0 Latency 351.0

182.0 20 ms

13.0 218.0

112.0 40 ms

6.0 C R C R C R C R C R 3 clients 9 clients 30 clients 60 clients 90 clients Cassandra Riak Client count

Figure 6.10: Commits/second/node in write-only test

6.2 Fault Simulation Results

I present fault simulation results by describing test runs textually and with a figure representing a successful run. I also present all test runs with un- expected results and diagnoses of those results to the extent diagnosis was possible within time limitations. Figures of test runs not presented here are available in Appendix B. The figures show database operations of each client on timelines, plus commands run through the fault insertion script above the timelines. Green dot on the timeline indicates successful operation and red dot a failure. The timeline labels indicate the type of operation the client performs, R for read, W for write, and which host the client connects to.

6.2.1 Process Fault In process failure test runs I killed and restarted the database server process as described in Chapter 5. Expected result is that clients connected directly to the affected database node fail and do not resume operations until the CHAPTER 6. EXPERIMENT RESULTS 76

Commits/second total mean and distribution at different latencies and client counts

1207.0

644.0 0 ms

81.0 1174.0

626.5 5 ms

79.0 1184.0

631.5 10 ms

79.0 Latency 1029.0

534.5 20 ms

40.0 641.0

330.5 40 ms

20.0 C R C R C R C R C R 3 clients 9 clients 30 clients 60 clients 90 clients Cassandra Riak Client count

Figure 6.11: Commits/second total in write-only test server process is restarted. Clients connecting to any node should either be unaffected or quickly resume operations by connecting to another node. Additionally, the entire cluster may be unavailable for a short period of time after the process is killed and after it is restarted. All tested databases handle the process fault case as expected. As example of a successful test run, Figure 6.12 shows how Galera handles death of mysqld on one node. As can be seen in the figure, operations on all hosts stop for about five seconds when mysqld is killed on host tc1. Similarly, after it is restarted, there is a short service interruption when it rejoins the cluster and another interruption on host tc2, which was selected to copy cluster state to tc1. Results for MongoDB are somewhat different from the other database sys- tems because the write tests cannot be configured to use a specific database node. Instead, all the writing tests for MongoDB connect to whichever node is currently primary in the cluster. The effect is visible in Figure 6.13 for example. Riak process fault tests also differ somewhat from the generic, since every Riak database node has multiple processes: beam and run erl. In the first CHAPTER 6. EXPERIMENT RESULTS 77

fab -f ~/setups/galera.py just_start:hosts=tc1 ssh tc1 sudo killall -9 mysqld R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 10 20 30 40 50 60 Time (s)

Figure 6.12: Galera process fault simulation (Galera test 1)

fab -f ~/setups/mongo.py start:hosts=tc1 ssh tc1 killall mongod

R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 10 20 30 40 50 60 Time (s)

Figure 6.13: MongoDB process fault simulation (MongoDB test 1) test depicted in Figure 6.14 only beam is killed. I was originally under the impression that run erl would act as watchdog and automatically restart it, but apparently this is not the case. In the second test both run erl and beam are killed and the service restarted normally, that is the test is the same as for other databases, except multiple processes are terminated.

ssh tc1 sudo killall beam

W test3

W test2

W test1 Test client W any 0 5 10 15 20 25 30 Time (s)

Figure 6.14: Riak process fault simulation (Riak test 1) CHAPTER 6. EXPERIMENT RESULTS 78

6.2.2 Network Fault In network fault test runs I disconnect one of the nodes from the others by preventing its network communications as described in more detail in Chap- ter 5. Expected result is that clients connected directly to the disconnected node fail and do not resume successful operations until the affected node is reconnected to the network. Clients connecting to any database node should either be unaffected or quickly reconnect to another node and resume opera- tion even while the affected node is disconnected. The cluster as a whole may be unavailable for a short while after the node is disconnected or reconnected. All tested databases except MongoDB behaved as expected. As an ex- ample of expected operation, Figure 6.15 shows how Galera performs when network connectivity of host test1 is first removed and then restored.

killnet test1 restorenet test1 R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 10 20 30 40 50 60 Time (s)

Figure 6.15: Galera network fault simulation (Galera test 3)

I ran two separate tests on MongoDB because of its distinction between primary and secondary nodes. In the first test, depicted in Figure 6.16, node 3 was primary. As is evident in the figure, the cluster did not reconfigure to allow writes during the outage that lasted over 10 seconds. I did not investigate why this is. In the second test depicted in Figure 6.17 node 2 was secondary. In this case the cluster behaved as expected, only causing reads from node 2 to fail until it was restored. This may be related MongoDB bu SERVER-6300, which is further discussed in the next section.

6.2.3 Hardware and System Software Fault In hardware and system software fault test runs I terminate one of the virtual machines participating in the database cluster as described in more detail in Chapter 5. Expected result is that clients connected directly to the discon- nected node fail and do not resume successful operations until the affected node is reconnected to the network. Clients connecting to any database node should either be unaffected or quickly reconnect to another node and resume CHAPTER 6. EXPERIMENT RESULTS 79

killnet test3 restorenet test3

R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 10 20 30 40 50 60 Time (s)

Figure 6.16: Mongo network fault simulation, primary failure (MongoDB test 3)

killnet test2 restorenet test2

R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 10 20 30 40 50 60 Time (s)

Figure 6.17: Mongo network fault simulation, secondary failure (MongoDB test 4) operation even while the affected node is disconnected. The cluster as a whole may be unavailable for a short while after the node is disconnected or reconnected. Again, all database systems except MongoDB behave as expected. As an example of expected operation, Figure 6.18 shows Zookeeper operations while node 1 is terminated, restarted and finally Zookeeper is started on it. As expected, clients connected directly to node 1 only resume operations some time after Zookeeper is brought up on node 1. Read test client connecting to any node initially fails but resumes operations on a different node before node 1 is back up. MongoDB again fails to form a working cluster. However, this time the problem affects both the case in which the primary node is killed and the case in which a secondary node is killed. Figures 6.19 and 6.20 depict the test runs. I suspect these are instances of MongoDB bug SERVER-6300. The problem is that the timeouts in MongoDB for internal connections used for replication are very high considering my use case. This causes the replication subsystem to not catch up with cluster configuration changes, causing replication to fail for a long period of time after cluster configuration changes, if nothing CHAPTER 6. EXPERIMENT RESULTS 80

fab -f ~/setups/zk.py start:hosts=tc1 killvm zk1 startvm zk1 R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 20 40 60 80 Time (s)

Figure 6.18: Zookeeper hardware fault simulation (Zookeeper test 5) terminates the connections.

killvm mongo3 fab -f ~/setups/mongo.py start:hosts=tc3 startvm mongo3 R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 20 40 60 80 Time (s)

Figure 6.19: MongoDB hardware fault simulation, primary failure (Mon- goDB test 5)

startvm mongo3 killvm mongo3 fab -f ~/setups/mongo.py start:hosts=tc3

R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 20 40 60 80 Time (s)

Figure 6.20: MongoDB hardware fault simulation, secondary failure (Mon- goDB test 6)

6.2.4 Dual Fault The purpose of dual fault tests is to ensure that the database systems really stop accepting writes upon dual failure and recover form dual fault situation CHAPTER 6. EXPERIMENT RESULTS 81 without administrator assistance. In the test I cause two hardware faults one after another. The expected result is that after the second fault no write operations are performed until one of the failed nodes is brought back up, after which operations should resume on clients that are not connected directly to the node that is still down. Of the tested systems, only Cassandra and Zookeeper perform as expected. Riak continues accepting writes even though there is no quorum of physical nodes and Galera and MongoDB fail to restore operations without administrative intervention. killvm cassandra2 fab -f ~/setups/cassandra.py start:hosts=tc1 startvm cassandra1 fab -f ~/setups/cassandra.py start:hosts=tc2 killvm cassandra1 startvm cassandra2

W test3

W test2

W test1 Test client W any 0 50 100 150 Time (s)

Figure 6.21: Cassandra dual hardware fault simulation (Cassandra test 6)

fab -f ~/setups/zk.py start:hosts=tc1 killvm zk1 startvm zk1 fab -f ~/setups/zk.py start:hosts=tc2 killvm zk2 startvm zk2 R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 20 40 60 80 100 120 Time (s)

Figure 6.22: Zookeeper dual hardware fault simulation (Zookeeper test 7)

Figures 6.21 and 6.22 show expected results from Cassandra and Zookeeper. Both stop handling of writes as expected and resume operations once a quo- rum of nodes is back up again. Figure 6.23 shows how Riak continue accepting writes even when quorum is lost. It is not clear to whether this it is working as designed or whether this is a bug. Figure 6.24 shows that MongoDB fails to recover from the faults. This probably at least partially caused by SERVER-6300 as described in the pre- vious section. Figure 6.25 shows how the Galera cluster fails to resume operation after dual fault. This is by design. The Galera cluster does not know its members CHAPTER 6. EXPERIMENT RESULTS 82

stopvm riak2 stopvm riak1 startvm riak1 startvm riak2

W test3

W test2

W test1 Test client W any 0 20 40 60 80 100 120 Time (s)

Figure 6.23: Riak dual hardware fault simulation (Riak test 7)

fab -f ~/setups/mongo.py start:hosts=tc1 killvm mongo1 startvm mongo1 fab -f ~/setups/mongo.py start:hosts=tc2 killvm mongo2 startvm mongo2 R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 20 40 60 80 100 Time (s)

Figure 6.24: MongoDB dual hardware fault simulation test (MongoDB test 7)

startvm galera1 fab -f ~/setups/galera.py just_start:hosts=tc1 killvm galera2 startvm galera2 fab -f ~/setups/galera.py just_start:hosts=tc2 killvm galera1 fab -f ~/setups/galera.py just_start:hosts=tc2 R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 20 40 60 80 100 120 Time (s)

Figure 6.25: Galera dual hardware fault simulation test (Galera test 7) at startup. Instead, nodes are always connected to a running cluster, which must have quorum of known nodes online to accept new nodes. Existing nodes that were already part of the cluster can reconnect to clusters that do not have quorum. In Figure 6.25, at around 80 seconds and afterwards, nodes 1 and 2 are attempting to join the cluster that only has node 3 online. However, because the processes were restarted, they have lost their identity in the cluster. From the perspective of node 3, they are not the same nodes 1 and 2 that it is waiting to come back online. The correct way to solve the problem with Galera would be to restart CHAPTER 6. EXPERIMENT RESULTS 83 one of the nodes (probably node 3) in bootstrap mode so that it forms a new cluster with a single node and thus quorum, and to connect nodes 1 and 2 to it. Figure 6.26 also shows that if the dual fault is caused by network fault instead of hardware fault, the cluster is able to reform after the faults are resolved.

killnet test2 killnet test1 restorenet test1 restorenet test2 R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 10 20 30 40 50 60 70 Time (s)

Figure 6.26: Galera dual network fault simulation (Galera test 8) Chapter 7

Comparison of Evaluated Systems

In this chapter I compare the evaluated systems based on their test results as described in Chapter 6 and other features as described in Chapter 4. In short, all the systems I tested offer adequate throughput for the command and control application, but performance in fault tolerance tests brings out differences between the systems.

7.1 Full-Scale Evaluation

As-is, only Zookeeper performs as expected in the dual fault simulation tests in Chapter 6. The problems with Galera in dual fault test are design issues that appear unlikely to be solved in near future. The issues of MongoDB are more likely to be solved, since only a timeout value needs to be changed, but it is not clear whether it will be possible to configure the timeouts suitably for my use case based on discussions in comments of SERVER-63001 issue in MongoDB issue tracker. In addition to the problems in fault simulation tests, MongoDB also only provides READ UNCOMMITTED semantics for transactions by default. This can be worked around by always performing reads from secondary nodes, but in that case causal consistency is lost. As noted in earlier chapters, both READ UNCOMMITTED semantics and loss of causal consistency lead to unnec- essary conflicts. Because of these interface issues and its handling of faults, MongoDB compares unfavorably to Zookeeper. Besides the design problem with dual fault tolerance, Galera also suffers from poor throughput. The cause of poor throughput is unclear, and it may be a bug in the Galera version used in final tests, since previous versions I tested produced much better throughput numbers. However, the throughput

1https://jira.mongodb.org/browse/SERVER-6300

84 CHAPTER 7. COMPARISON OF EVALUATED SYSTEMS 85 would be adequate for the command and control system, since the number of users will be limited. Also, the previous versions tested suffered from additional problems in fault simulation tests. Both MongoDB and Galera allow storing more data than Zookeeper, which limits practical database size to RAM capacity and size of individual data items to 1 MiB. They both also offer better query capabilities than Zookeeper, which only allows access by element path. Both MongoDB and Galera allow complex queries against a single table or document collection, in addition to which Galera also allows all the same SQL constructs that MySQL allows. All of the databases can be backed up using filesystem-level snapshots. MongoDB also comes with built-in utility that can perform consistent back- ups without stopping operations and without filesystem-level snapshots. Of the tested systems, only Zookeeper requires stopping the system in order to add or remove nodes, although a failed node can be replaced online. With both Zookeeper and MongoDB, all nodes should have the same configuration. In contrast, Galera requires the initial node to have different configuration. In addition, if the configuration of the initial node is not changed after it has been started, it cannot rejoin the cluster after being restarted. Instead, it forms a new cluster alone and even accepts queries. This complicates deployment and automatic management of Galera clusters. Based on these tradeoffs, I would choose Zookeeper as the database sys- tem for the command and control system presented in Chapter 3. If I ex- pected to store larger quantities of data, I might choose MongoDB instead, despite its interface issues. Galera would be an easy choice if adding always- on type high-availability features to an existing system that already inter- faced with an SQL database.

7.2 Limited Evaluation

It appears that Riak does not really fulfill the requirement of not accepting writes without quorum despite claims to the contrary in its documentation. Riak also requires rather obscure configuration options before it can be run with any data safety at all in three-node configuration. In my tests Cassandra offers good throughput and handles faults without a hitch. The downside for Cassandra is that its data model is somewhat com- plex, requiring some care in design of the database representation for domain objects. Its configuration options are also complex, but for a simple system, the defaults are mostly suitable. If atomic updates are not a requirement, it would be an easy choice. Chapter 8

Conclusions

The purpose of my thesis was to evaluate existing open-source database sys- tems as basis for building a highly available distributed command and control system. The evaluation as a whole was successful in that a suitable database system was found and necessary in that there appear to be remarkable dif- ferences in fault tolerance characteristics of the evaluated database systems. First I presented existing work on the subject of high availability and fault-tolerance and a summary of system architecture from which require- ments for the database system were derived. Then I selected six open-source database systems for evaluation, noting also why systems not selected were disqualified. Finally I tested the database systems selected for evaluation us- ing both throughput metrics and fault scenarios derived from fault tolerance analysis of the system architecture. I used the results of these tests as basis for comparison of the databases. It appears that there are open-source database systems that fulfill the requirements set forth in analyses of Chapter 3. All the databases I tested had adequate throughput even at high latencies, and a few also did well in in my fault tolerance tests. Especially Zookeeper and Cassandra proved very resilient as presented in Chapters 6 and 7. Some of the others could, with a little work, be enhanced to fare better in the fault tolerance scenarios I tested. I did report problems that came up in my tests to different projects, and some of the problems will be fixed in future versions of the systems in question. Based on the problems uncovered in my tests, it seems likely that other distributed fault-tolerant systems have similar, hidden problems. Integrating failure tests similar to those I performed as part of the regular test suite for such project would be helpful in uncovering such problems.

86 Bibliography

[1] The mit license. http://opensource.org/licenses/mit.

[2] Gnu general public license, version 2. http://www.gnu.org/licenses/ gpl-2.0.html, June 1991.

[3] Gnu lesser general public license, version 2.1. http://www.gnu.org/ licenses/lgpl-2.1.html, February 1991.

[4] The bsd 3-clause license. http://opensource.org/licenses/ BSD-3-Clause, 1998.

[5] Apache license, version 2.0. http://www.apache.org/licenses/ LICENSE-2.0, January 2004.

[6] Gnu affero general public license, version 3. http://www.gnu.org/ licenses/agpl-3.0.html, November 2007.

[7] Gnu general public license, version 3. http://www.gnu.org/licenses/ gpl-3.0.html, June 2007.

[8] The postgresql license. http://www.postgresql.org/about/licence/, June 2007.

[9] Zookeeper 3.3.3 documentation. http://zookeeper.apache.org/doc/r3. 3.3/, 2011.

[10] Postgresql 9.1.6 documentation. http://www.postgresql.org/files/ documentation/pdf/9.1/postgresql-9.1-A4.pdf, 2012. Section 25.2.6.

[11] K. Bartels, H. Hoffmann, and Luca Riccardo Rossinelli. PAAG- Verfahren (HAZOP): Prognose von St¨orungen, Auffinden von Ur- sachen, Absch¨atzender Auswirkungen, Gegenmaßnahmen. Number 2002 in ISSA prevention series. Internationale Sektion der IVSS f¨ur die Verh¨utung von Arbeitsunf¨allen und Berufskrankheiten in der

87 BIBLIOGRAPHY 88

chemischen Industrie, Heidelberg, 1990. ISBN 928437037X. URL http://gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT=YOP&IKT=1016&TRM= ppn+196245486&sourceid=fbw_bibsonomy. [12] J.P. Bowen and M.G. Hinchey. Ten Commandments Ten Years On: An Assessment of Formal Methods Usage. In G. Eleftherakis, editor, 2nd South-East European Workshop on Formal Methods (SEEFM 05), Formal Methods: Challenges in the Business World, Ohrid, 18-19 Nov 2005, pages 1–16. South-East European Research Centre (SEERC), June 2006. ISBN 960-87869-8-3. [13] Patrice Chalin. Towards support for non-null types and non-null-by- default in java. In Proceedings of the 8th Workshop on Formal Tech- niques for Java-like Programs (FTfJP’06, 2006. [14] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC ’10, pages 143–154, New York, NY, USA, 2010. ACM. ISBN 978-1-4503- 0036-0. doi: 10.1145/1807128.1807152. URL http://doi.acm.org/10. 1145/1807128.1807152.

[15] Intel Corporation. Intel R coretm2 duo and intel R coretm2 solo pro- cessor for intel R centrino R duo processor technology intel R celeron R processor 500 series specification update, 2010. URL http://www.intel. com/design/mobile/specupdt/314079.htm. [16] G. Coulouris, J. Dollimore, T. Kindberg, and G. Blair. Distributed Systems: Concepts and Design. International computer science series. Pearson Education Canada, 2011. ISBN 9780132143011. URL http: //books.google.fi/books?id=GmYpKQEACAAJ. [17] M.A. Friedman. Automated software fault-tree analysis of pascal pro- grams. In Reliability and Maintainability Symposium, 1993. Proceed- ings., Annual, pages 458 –461, jan 1993. doi: 10.1109/RAMS.1993. 296815. [18] Pentti Haapanen and Atte Helminen. Failure Mode and Effects Analysis of software-based automation systems, STUK-YTO-TR 190. Technical report, STUK, Laippatie 4, 00880 Helsinki, August 2002. URL http: //www.stuk.fi/julkaisut/tr/stuk-yto-tr190.html. [19] Theo Haerder and Andreas Reuter. Principles of transaction-oriented database recovery. ACM Comput. Surv., 15(4):287–317, December 1983. BIBLIOGRAPHY 89

ISSN 0360-0300. doi: 10.1145/289.291. URL http://doi.acm.org/10. 1145/289.291.

[20] Stephen Hemminger. Network emulation with NetEm. In Mar- tin Pool, editor, LCA 2005, Australia’s 6th national Linux confer- ence (linux.conf.au), Sydney NSW, Australia, April 2005. Linux Aus- tralia, Linux Australia. URL http://www.linux.org.au/conf/2005/ abstract2e37.html?id=163.

[21] Leslie Lamport. Lower bounds for asynchronous consensus, 2004.

[22] Michael R. Lyu. Software Fault Tolerance. John Wiley & Sons Ltd., 1995. ISBN 0-471-95068-8.

[23] Michael R. Lyu, editor. Handbook of software reliability engineering. McGraw-Hill, Inc., Hightstown, NJ, USA, 1996. ISBN 0-07-039400-8.

[24] NASA Office of Safety and Mission Assurance. Fault Tree Handbook with Aerospace Applications. NASA, 2002. URL http://www.hq.nasa. gov/office/codeq/doctree/fthb.pdf.

[25] Oracle. MySQL 5.1 Reference Manual, 2012. URL http://downloads. mysql.com/docs/refman-5.1-en.a4.pdf.

[26] Donald J Reifer. Software failure modes and effects analysis. Ieee Trans- actions On Reliability, R-28(3):247–249, 1979. URL http://ieeexplore. ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5220578.

[27] Neil R. Storey. Safety Critical Computer Systems. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1996. ISBN 0201427877.

[28] S¨ateilyturvakeskus. Ydinvoimalaitosten j¨arjestelmien,rakenteiden ja laitteiden turvallisuusluokitus, 26.6.2000, 2000. URL http://www. finlex.fi/data/normit/3185-YVL2-1.pdf.

[29] M. Towhidnejad, D. R. Wallace, and A. M. Gallo. Fault tree analysis for software design. pages 24–29, 2002. URL http://ieeexplore.ieee. org/xpls/abs_all.jsp?arnumber=1199446.

[30] Wikipedia. Acid — Wikipedia, the free encyclopedia, 2012. URL http: //en.wikipedia.org/w/index.php?title=ACID&oldid=483489664. [On- line; accessed 2012-03-30]. BIBLIOGRAPHY 90

[31] Wikipedia. Software fault tolerance — wikipedia, the free encyclopedia, 2012. URL http://en.wikipedia.org/w/index.php?title=Software_ fault_tolerance&oldid=481289177. [Online; accessed 2012-04-03]. Appendix A

Remaining throughput results

Commits/second/client mean and distribution at different latencies and client counts

237.0

118.5 0 ms

0.0 60.0

30.0 5 ms

0.0 36.0

18.0 10 ms

0.0 Latency 20.0

10.0 20 ms

0.0 10.0

5.0 40 ms

0.0 GaleraG M Z G M Z G M Z G M Z G M Z 3 clients 9 clients 30 clients 60 clients 90 clients Mongo Zk Client count

Figure A.1: Commits/second/client at 25% conflict rate

91 APPENDIX A. REMAINING THROUGHPUT RESULTS 92

Commits/second/server mean and distribution at different latencies and client counts

761.0

380.5 0 ms

0.0 736.0

369.5 5 ms

3.0 595.0

300.0 10 ms

5.0 Latency 360.0

182.0 20 ms

4.0 202.0

101.5 40 ms

1.0 GaleraG M Z G M Z G M Z G M Z G M Z 3 clients 9 clients 30 clients 60 clients 90 clients Mongo Zk Client count

Figure A.2: Commits/second/node at 25% conflict rate

Commits/second total mean and distribution at different latencies and client counts

2268.0

1158.0 0 ms

48.0 1888.0

965.5 5 ms

43.0 1507.0

768.0 10 ms

29.0 Latency 885.0

450.5 20 ms

16.0 475.0

242.0 40 ms

9.0 GaleraG M Z G M Z G M Z G M Z G M Z 3 clients 9 clients 30 clients 60 clients 90 clients Mongo Zk Client count

Figure A.3: Commits/second total at 25% conflict rate APPENDIX A. REMAINING THROUGHPUT RESULTS 93

Commits/second/client mean and distribution at different latencies and client counts

237.0

118.5 0 ms

0.0 60.0

30.0 5 ms

0.0 36.0

18.0 10 ms

0.0 Latency 20.0

10.0 20 ms

0.0 10.0

5.0 40 ms

0.0 GaleraG M Z G M Z G M Z G M Z G M Z 3 clients 9 clients 30 clients 60 clients 90 clients Mongo Zk Client count

Figure A.4: Commits/second/client at 50% conflict rate

Commits/second/server mean and distribution at different latencies and client counts

761.0

380.5 0 ms

0.0 736.0

369.5 5 ms

3.0 595.0

300.0 10 ms

5.0 Latency 360.0

182.0 20 ms

4.0 202.0

101.5 40 ms

1.0 GaleraG M Z G M Z G M Z G M Z G M Z 3 clients 9 clients 30 clients 60 clients 90 clients Mongo Zk Client count

Figure A.5: Commits/second/node at 50% conflict rate APPENDIX A. REMAINING THROUGHPUT RESULTS 94

Commits/second total mean and distribution at different latencies and client counts

2268.0

1158.0 0 ms

48.0 1888.0

965.5 5 ms

43.0 1507.0

768.0 10 ms

29.0 Latency 885.0

450.5 20 ms

16.0 475.0

242.0 40 ms

9.0 GaleraG M Z G M Z G M Z G M Z G M Z 3 clients 9 clients 30 clients 60 clients 90 clients Mongo Zk Client count

Figure A.6: Commits/second total at 50% conflict rate Appendix B

Remaining fault test results

fab -f ~/setups/galera.py just_start:hosts=tc2 ssh tc2 sudo killall -9 mysqld R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 10 20 30 40 50 60 Time (s)

Figure B.1: Galera process fault simulation test (Galera test 2)

killnet test2 restorenet test2 R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 10 20 30 40 50 60 Time (s)

Figure B.2: Galera network fault simulation test (Galera test 4)

95 APPENDIX B. REMAINING FAULT TEST RESULTS 96

startvm galera1 killvm galera1 fab -f ~/setups/galera.py just_start:hosts=tc1 R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 10 20 30 40 50 60 Time (s)

Figure B.3: Galera hardware fault simulation (Galera test 5)

startvm galera2 killvm galera2 fab -f ~/setups/galera.py just_start:hosts=tc2 R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 10 20 30 40 50 60 Time (s)

Figure B.4: Galera hardware fault simulation (Galera test 6)

fab -f ~/setups/mongo.py start:hosts=tc1 ssh tc1 killall mongod

R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 10 20 30 40 50 60 Time (s)

Figure B.5: MongoDB process fault simulation (MongoDB test 2)

fab -f ~/setups/zk.py start:hosts=tc1 ssh tc1 killall -9 java R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 10 20 30 40 50 60 Time (s)

Figure B.6: Zookeeper process fault simulation (Zookeeper test 1) APPENDIX B. REMAINING FAULT TEST RESULTS 97

fab -f ~/setups/zk.py start:hosts=tc2 ssh tc2 killall -9 java R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 10 20 30 40 50 60 Time (s)

Figure B.7: Zookeeper process fault simulation (Zookeeper test 2)

killnet test1 restorenet test1 R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 10 20 30 40 50 60 Time (s)

Figure B.8: Zookeeper network fault simulation (Zookeeper test 3)

killnet test2 restorenet test2 R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 10 20 30 40 50 60 Time (s)

Figure B.9: Zookeeper network fault simulation (Zookeeper test 4)

fab -f ~/setups/zk.py start:hosts=tc2 killvm zk2 startvm zk2 R test3 R test2 R test1 R any W test3 W test2 Test client W test1 W any 0 20 40 60 80 Time (s)

Figure B.10: Zookeeper hardware fault simulation (Zookeeper test 6) APPENDIX B. REMAINING FAULT TEST RESULTS 98

ssh tc1 sudo killall run_erl beam ssh tc1 sudo service riak start

W test3

W test2

W test1 Test client W any 0 5 10 15 20 25 30 Time (s)

Figure B.11: Riak process fault simulation (Riak test 2)

killnet test1 restorenet test1

W test3

W test2

W test1 Test client W any 0 5 10 15 20 25 30 Time (s)

Figure B.12: Riak network fault simulation (Riak test 3)

killnet test2 restorenet test2

W test3

W test2

W test1 Test client W any 0 5 10 15 20 25 30 Time (s)

Figure B.13: Riak network fault simulation (Riak test 4)

stopvm riak1 startvm riak1

W test3

W test2

W test1 Test client W any 0 10 20 30 40 50 60 Time (s)

Figure B.14: Riak hardware fault simulation (Riak test 5) APPENDIX B. REMAINING FAULT TEST RESULTS 99

stopvm riak2 startvm riak2

W test3

W test2

W test1 Test client W any 0 10 20 30 40 50 60 Time (s)

Figure B.15: Riak hardware fault simulation (Riak test 6)

fab -f ~/setups/cassandra.py start:hosts=tc1 ssh tc1 killall -9 java

W test3

W test2

W test1 Test client W any 0 10 20 30 40 50 60 Time (s)

Figure B.16: Cassandra process fault simulation (Cassandra test 1)

killnet test1 restorenet test1

W test3

W test2

W test1 Test client W any 0 10 20 30 40 50 60 Time (s)

Figure B.17: Cassandra network fault simulation (Cassandra test 2)

killnet test2 restorenet test2

W test3

W test2

W test1 Test client W any 0 10 20 30 40 50 60 Time (s)

Figure B.18: Cassandra network fault simulation (Cassandra test 3) APPENDIX B. REMAINING FAULT TEST RESULTS 100

fab -f ~/setups/cassandra.py start:hosts=tc1 startvm cassandra1 killvm cassandra1

W test3

W test2

W test1 Test client W any 0 20 40 60 80 Time (s)

Figure B.19: Cassandra hardware fault simulation (Cassandra test 4)

fab -f ~/setups/cassandra.py start:hosts=tc2 startvm cassandra2 killvm cassandra2

W test3

W test2

W test1 Test client W any 0 20 40 60 80 Time (s)

Figure B.20: Cassandra hardware fault simulation (Cassandra test 5)