Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Computer Engineering (Datateknik)

2018 | LIU-IDA/LITH-EX-A--18/008--SE

Object Migration in a Distributed, Heterogeneous SQL Network

Datamigrering i ett heterogent nätverk av SQL-databaser

Joakim Ericsson

Supervisor : Tomas Szabo Examiner : Olaf Hartig

Linköpings universitet SE–581 83 Linköping +46 13 28 10 00 , www.liu.se Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c Joakim Ericsson Abstract

There are many different database management systems (DBMSs) on the market today. They all have different strengths and weaknesses. What if all of these different DBMSs could be used together in a heterogeneous network? The purpose of this thesis is to explore ways of connecting the many different DBMSs together. This thesis will explore suitable architectures, features, and performance of such a network. This is all done in the context of Ericsson’s wireless communication network. This has not been done in this context before, and a big part of the thesis is exploring if it is even possible. The result of this thesis shows that it is not possible to find a solution that can fulfill the requirements of such a network in this context. Acknowledgments

Thanks to my family that has encouraged and supported me through all my years of education. It has not always been easy but entirely worth it looking back.

This thesis marks a cornerstone of my formal education, but it is only the start of a lifelong learning.

iv Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables viii

List of Listings ix

List of Acronyms x

1 Introduction 1 1.1 Motivation ...... 1 1.2 Old System ...... 2 1.3 Aim...... 2 1.4 Initial System Description ...... 3 1.4.1 Requirements ...... 4 1.4.1.1 Heterogeneous ...... 4 1.4.1.2 SQL Database ...... 4 1.4.1.3 Data Migration ...... 5 1.4.1.4 Performance Requirements ...... 5 1.5 Research Questions ...... 5 1.6 Delimitations ...... 5

2 Method 7 2.1 Pre-study ...... 7 2.2 System Design and Implementation ...... 8 2.3 Evaluation ...... 8

3 Theory 9 3.1 Relational Database Management Systems and SQL ...... 9 3.1.1 SQL ...... 9 3.1.2 ACID ...... 9 3.2 OLAP versus OLTP ...... 10 3.3 Distributed Database Management Systems ...... 11 3.4 ODBC, JDBC, and OLE DB ...... 12 3.5 Microsoft Linked Servers ...... 13 3.6 Distributed Query Engines ...... 13 3.6.1 PrestoDB ...... 13 3.6.2 Apache Drill ...... 14

v 3.7 Multistore and Polystore Systems ...... 15 3.7.1 CloudMdsQL and CoherentPaaS ...... 15 3.7.2 BigDAWG ...... 15 3.8 Data Warehousing ...... 16

4 Distributed Database 17 4.1 Database Schema ...... 17 4.2 Distribution Schema ...... 17

5 System Architecture 21 5.1 Centralized Approach ...... 21 5.2 Distributed Approach ...... 22 5.3 Possible Architectures ...... 23 5.3.1 CellApp Architectures ...... 23 5.3.1.1 Architecture A ...... 23 5.3.1.2 Architecture B ...... 24 5.3.1.3 Architecture C ...... 25 5.3.2 Machine Learning Application ...... 25 5.3.2.1 Architecture D ...... 26 5.3.2.2 Architecture E ...... 26

6 Results 28 6.1 Performance Measurements ...... 28 6.1.1 Test Setup ...... 28 6.1.1.1 Mock Application ...... 29 6.1.1.2 Database Management Systems (DBMSs) Tested ...... 30 6.1.1.3 Test Parameters ...... 31 6.1.2 Result using ODBC ...... 31 6.1.2.1 ODBC Using a High-performance Machine ...... 34 6.1.3 Using JDBC ...... 35 6.1.4 Introducing a Network Delay ...... 35 6.1.5 Introducing a Distributed Query Engine ...... 37 6.1.6 Comparing Relational Database Management Systems (RDBMSs) to a non SQL (NoSQL) Alternative ...... 38 6.2 The Machine Learning Application ...... 39 6.3 Data Migration ...... 40 6.4 Final System ...... 41

7 Discussion 43 7.1 Literature Study ...... 43 7.2 Performance Measurements ...... 44 7.3 Uniform Structured Query Language (SQL) Syntax ...... 44 7.4 Data Migration ...... 45 7.5 Method ...... 45 7.6 Final System ...... 46 7.7 Future Work ...... 46 7.8 Societal and Ethical Considerations ...... 46

8 Conclusion 48

Bibliography 50

A Protobuf Model 52

vi List of Figures

1.1 Data format ...... 2 1.2 Initial sketch of the proposed system architecture ...... 3

4.1 SQL database schema ...... 18 4.2 States table distribution schema ...... 19 4.3 Neighbors table distribution schema ...... 20

5.1 Architecture A - Simple direct connection ...... 23 5.2 Architecture B - Simple middleware ...... 24 5.3 Architecture C - Simple ...... 25 5.4 Architecture D - Middleware ...... 26 5.5 Architecture E - Database access component ...... 27

6.1 Test setup 1 ...... 29 6.2 MySQL Memory engine read and write operations ...... 33 6.3 SQLite read and write operations ...... 33 6.4 SQLite spike read and write operations ...... 34 6.5 SQLite main memory read and write operations ...... 34 6.6 Test setup 2 ...... 36 6.7 Test setup 3 ...... 37 6.8 Redis read and write operations ...... 39 6.9 Proposed solution ...... 42

vii List of Tables

6.1 Test parameters ...... 31 6.2 Test machine ...... 31 6.3 ODBC - Initial measurements of query read operations ...... 32 6.4 Test machine 2 ...... 35 6.5 ODBC high-performance machine - Measurements of query read operations . . . . 35 6.6 JDBC - Initial measurements of query read operations ...... 35 6.7 ODBC - Over network ...... 36 6.8 Distributed query engines - Over network ...... 38 6.9 Performance measurements non SQL (NoSQL) ...... 39

viii List of Listings

3.1 PrestoDB distributed query example ...... 14 3.2 Drill distributed query example ...... 15 6.1 Mock CellApp pseudo code ...... 30 6.2 Read request ...... 30 6.3 Write request ...... 30 6.4 GET operation ...... 38 6.5 SET operation ...... 39 6.6 SQL data migration ...... 40 6.7 Delete functionality ...... 41 A.1 Protobuf model ...... 52

ix List of Acronyms

5G 5th generation mobile network.

ACID atomicity, consistency, isolation, and durability. ANSI American National Standards Institute. API application programming interface.

DBMS database management system. DDBMS distributed database management system.

ISTC Intel Science and Technology Center for Big Data.

JDBC java database connectivity.

NoSQL non SQL.

ODBC open database connectivity. OLAP online analytical processing. OLE DB object linking and embedding, database. OLTP online transaction processing.

RDBMS relational database management system. RTT round-trip time.

SQL structured query language.

x 1 Introduction

The field of distributed systems is growing rapidly today. Nowadays an increasing number of systems are in some way distributed to enhance performance and/or stability but at the same time, the distribution often increases the complexity of the system.

1.1 Motivation

The 5th generation mobile network (5G) architecture is a highly distributed system with a lot of different requirements such as the need to route huge amounts of data in real-time. To keep this system up and running there is a need to store and share a lot of configuration data between the applications in the system. This configuration data is used to calculate handovers between cells. It is therefore important to maintain a common configuration throughout the entire network.

Distributed databases are also a growing field, with new techniques developing at a rapid pace. Having a large number of databases distributed over large geographical areas leads to interesting problems that need to be solved. There exist two distinctly different architectural approaches, homogeneous and heterogeneous.

In a heterogeneous distributed database system, each site may run a different database software. Even the operating system or hardware may differ. This has the advantage of making the entire system easier to expand when it is compatible with a variety of configurations. This flexibility does, however, come with some disadvantages; communication between the different types of databases is not as straightforward as it would have been if the system was homogeneous. Transfer of data must be translated when the exchange of data occurs between databases of different types. This increases the complexity of the system and can prove to be both a technical and economical challenge. [1]

A homogeneous distributed database system consists of a network of similar machines running the same hardware and software. This is often simpler and less costly to implement, but at the same time, it puts more requirements on the system. [2]

1 1.2. Old System

The customer, in this case Ericsson, have a set of requirements that the system needs to conform to in order to be a useful replacement for their current system. To measure if the system reaches these requirements some metrics must be used. These metrics represent an important part of the requirements. More specific information about the requirements and metrics that will be studied in order to measure the system can be found below in the method and theory sections.

1.2 Old System

Ericsson has thousands of base stations that currently hold a lot of configuration data using in-memory storage. This data is used to keep track of neighboring cell towers, optimize handovers, and store application state. For simplicity these functions will be combined into a factitious application called "CellApp" in this thesis. The data that should be stored for each CellApp is using the structure illustrated in Figure 1.1.

Figure 1.1: Data format

The attributes in Figure 1.1 are key-value pairs where the keys are strings and the values are of simple primitive types like integers, booleans, and short strings. Each neighbor entry in the list consists of a small number (usually ten to fifty) of key-value attributes representing the state of one of the neighboring base stations. This data about neighbors is used for handover calculations.

1.3 Aim

This master’s thesis will implement and evaluate ways of communicating with distributed heterogeneous structured query language (SQL) database systems. The solutions will be compared, using some selected metrics, to similar implementations in a heterogeneous designed network. The metrics used to compare the systems will be selected by examining the existing solution and architecture of the system. It may not be feasible to implement this

2 1.4. Initial System Description system using a heterogeneous architecture. In that case, different heterogeneous solutions will be evaluated against each other.

1.4 Initial System Description

The problem with the current way of data storage, using the in-memory storage, is that it is hard to change if you want to use another database or move the data to a remote database. This would require rewrites of the code and would also require specially written code for each of the different ways of storing data. This thesis aims to find solutions to this problem by moving towards a more abstracted approach of data storage for this application. This will enable data to more easily be stored using different database systems and both be stored locally and remotely, which is not possible at the moment, all using a uniform syntax. The thought, by Ericsson, is that SQL could be a suitable abstraction to solve this. Therefore, SQL DBMSs will be an important focus. An initial sketch of the architecture wanted for this system can be seen in Figure 1.2.

Machine Learning Application

SQL MIDDLEWARE

Distributed Database

Local Remote Cloud SQLite MySQL SQL

SQL MIDDLEWARE

CellApp CellApp CellApp

Figure 1.2: Initial sketch of the proposed system architecture

Another reason why this abstraction is wanted is the possibility for some machine learning application to be able to read all the data in the entire network of databases and analyze it (see Figure 1.2). This application will then do some calculation to optimize handovers between CellApps and make changes to configuration data in the heterogeneous database network, based on these optimizations. This change could for example be adding or removing neighbors for a specific CellApp. This machine learning program has significantly fewer performance requirements than the CellApp. This is because the machine learning algorithm will only be run during some maintenance period, typically once a day. The CellApp however, needs to continuously run in real-time.

3 1.4. Initial System Description

It is possible that the data of a CellApp need to be migrated to another database. One case for this could be that too many CellApps are running on the same piece of hardware and some of them need to be moved to another machine due to performance reasons. In this case, the data will also need to be moved into a remote or cloud database. This migration should preferably not interfere with the availability of the data. Data also needs to be consistent between databases after a migration so that it is not the case that data is corrupted or only partially migrated or, in the worst case, removed completely. The decision to start a migration of the data is made by human intervention.

Another aspect of the research is to check if it is possible to move this data to a remote or cloud database or if this will add too much of a performance penalty that it will be unusable.

It will often be the case that multiple CellApps use the same DBMS.

1.4.1 Requirements In this section, the requirements of the system will be listed and explained. The requirements are tightly coupled to the research questions. The requirements come from the customer, Ericsson, and are what this thesis will focus on. It is not certain that all the requirements below are feasible to implement together, but the goal of the project is to find and evaluate such a system. This section will also try to explain the reason why these requirements exist and why the requirements are important in the context of this project and the system in question.

The source of the information in this chapter is the customer, Ericsson, if nothing else is specified. The information was extracted during meetings with the technical supervisor of the thesis project and from that the requirements were formed.

The following requirements on the system exist. The requirements are tightly coupled with the research questions that are listed in Section 1.5.

1.4.1.1 Heterogeneous Databases The network is heterogeneous and distributed because the system will consist of many different machines, that handle different load, are located at different locations and, most importantly, run different DBMSs. This becomes even more complex when taking into account that not all of the DBMSs will have the same features or even run on the same system architecture. Some of the DBMSs could be running in the cloud and have access to thousands of gigabytes of data storage and a lot of processing power and RAM. These computers may even be able to scale up and down depending on the demand of the service. Some of the other parts of the system may not have such resources and instead contain the processing power and storage of an average desktop PC.

All of this increases the complexity of connecting and using the network of databases using a uniform syntax.

1.4.1.2 SQL Database One of the requirements of the system is that it should be an SQL database system. SQL will be used because SQL is considered, by Ericsson, to be a good abstraction for communication in a heterogeneous database system. From this requirement it follows that relational databases are a good choice because they almost always use SQL. The data that is being stored in the system is relatively simple and will not use all the features that a relational SQL database provides. The data, as it looks today, is almost entirely key-value storage without any complex relations. It is also believed to be worth to use SQL for future development if the data stored should need some more advanced structure in the future.

4 1.5. Research Questions

1.4.1.3 Data Migration Data in this system should be able to be moved from one DBMS to another, this will be called a migration of data. Whenever such a migration takes place, the data should at the same time be available to the applications. Hence, data migration should impose as little loss of availability as possible. The data must also remain consistent between the migrations, so it is not the case that data is corrupted or only partially migrated.

1.4.1.4 Performance Requirements These are the performance requirements for the system. There are two types of applications that are part of the system: the CellApps and a machine learning application. These two applications have different data access patterns and different performance requirements.

• CellApp typical usage: query data from 100 different neighbors per second, and only a few write operations every minute. Each of the neighbors that are being queried contains 20 attributes. One read query will be issued every 10ms and should not overrun that time.

• Machine learning application: Reads all the data periodically (daily) and makes updates to selected attributes. The time frame for this is hours.

To benchmark this system, mock applications with the same access patterns as the real system will be developed. Based on these mock applications, I will measure the time it takes for typical queries in the system to run. More specifically the time is measured from the when the query is executed until the data queried is returned to the application.

1.5 Research Questions

The following questions will be considered in this master‘s thesis:

1. What are the important properties for a distributed network of heterogeneous SQL DBMSs in the given setting?

2. What frameworks and techniques can be used to communicate with a heterogeneous network of DBMSs using a uniform SQL syntax?

a) How do these frameworks and techniques compare to each other? b) Is it feasible to build such a system, using the frameworks and techniques, that fulfills the functional and performance requirements of this project?

3. Will the system built, using the frameworks and techniques, enable data to migrate from one DBMS to another while keeping consistency and availability of the data?

1.6 Delimitations

The focus of the project will be to find already existing solutions, frameworks, and techniques to fulfill the requirements and not to implement and design such a system from scratch. Ericsson acknowledges that a system that achieves all the requirements could be built but it would take a lot of time and resources to implement from scratch. They also believe that there are similar, already existing, solutions and techniques that could be used to solve the problem. Finding these existing solutions and techniques will be the focus of this master‘s thesis.

5 1.6. Delimitations

For the machine learning application, only solutions, frameworks, and techniques that can query data in place, at the different DBMSs, will be considered. This is opposed to copying data from the different DBMSs to a single location and then running queries on the collection of data. This delimitation was made to limit the scope of the thesis.

One delimitation is that the frameworks and techniques must be able to run on Linux since this is the intended operating system that it will run on in production.

Another delimitation that has been made in this project is that only open source and other "free to use projects" will be considered. This is because it is in, most cases, a hassle to get trial subscriptions of paid products while the open source projects can just be downloaded and run instantly. While these paid alternatives will not be tested in this thesis, some of them will be evaluated in the theory section and used to compare, conceptually, with the open source alternatives found.

6 2 Method

This chapter aims to provide a method of how to build a system consisting of a distributed database, to store the data in the system, and a middleware to handle communication with the many DBMSs that the distributed database consists of.

To develop this system and find answers to the research questions the first thing were to extract the requirements for the system from the customer. The requirements are by design coupled with the research questions. After the requirements were collected, a preliminary design of the system architecture could be established. These requirements were then used to perform a pre-study of techniques believed to solve the problem. These techniques were then tested and benchmarked, using the requirements, in order to find the most suitable solution for this project. To benchmark the different techniques, they were first integrated to fit into the system. Following this, tests were made with synthetic data to measure the different metrics. Synthetic data was used because of the fact that the system in question is not in production yet and there is no real data available at this moment. The synthetic test data is based on data from a legacy system. Therefore, results and measurements made in this thesis can be expected to be similar to the result in production with real data.

The method chapter is divided into three parts. The first part is the pre-study were a literature study was performed to find suitable techniques for the problem. Secondly, the system design and implementation phase. Here the techniques found in the pre-study were developed into possible architectures for the system. A test environment was then set up in order to test these techniques with different architectures. In the last part, an evaluation of the different techniques and architectures was done with data measurements collected.

2.1 Pre-study

The pre-study assessed different alternative techniques and architectures to build the system during a literature study. The focus was on finding systems that satisfy the requirements stated in Section 1.4.1. To do this evaluation, techniques brought up in the theory section of this master’s thesis were studied with the help of the theory as a foundation. The techniques that according to the theory are believed to be able to fulfill all of the requirements were

7 2.2. System Design and Implementation researched further and implemented in the design phase of the project. In the case that no system is found that can satisfy all the requirements, a system that accomplishes most requirements will be used.

The result of the pre-study is a theory chapter with techniques that could be used to build the distributed database and the system around it. Based on these techniques alternative architectures of the system could be developed. These architectures and techniques will be implemented and tested in the next phase of the method.

2.2 System Design and Implementation

In the design and implementation phase, the different techniques and architectures were implemented in a realistic environment. Because of the massively distributed system that the final result will be, a less complex version of the environment was used to run the tests and benchmarks in this project. The test environment will try to capture the most important features of the real environment, but it is possible that the result of the tests in the test environment could differ from the real environment.

In this step the database schema for the databases was developed. All the databases in the system combine into a distributed database. A distribution schema for the entire distributed database was also developed.

2.3 Evaluation

Here the data collected in the performance measurements, and the data collected in the literature study were evaluated. The performance measurements were collected by creating a testbench for the CellApps, to emulate real-world conditions. This test was then run in a realistic environment. The measurements from the test were collected and evaluated against each other.

The candidate for the final system architecture was chosen based on the measured performance and how well the architecture fit in accordance with the other requirements.

8 3 Theory

3.1 Relational Database Management Systems and SQL

A database is a collection of data. In an relational database management system (RDBMS) the data is stored as rows in a table. Data in an RDBMS can be connected through relations and constraints. It is up to the RDBMS to ensure that all the data adheres to the constraints. One example of such a constraint could be that all rows in the table should have a unique primary key. [3]

There are a lot of different features and requirements that are needed in a traditional RDBMS. A lot of features are supported in almost all DBMSs, both non-relational and relational. This chapter will highlight some techniques that are used in RDBMSs.

3.1.1 SQL Structured Query Language, SQL for short, is a language used in an RDBMSs. The language supports to describe the structure of the database and the data stored. SQL can also be used to query the database for data. A query is a request to read from the database. An SQL statement is a request to either read or write to the database. [4]

SQL is a standard published by the American National Standards Institute (ANSI). They progressively publish new additions to the SQL standard every couple of years. [5]

Even if many RDBMSs use SQL they do not always follow the standard completely. Different RDBMSs often support different data types and other features. This means that a query that is supported in one RDBMS not necessarily works in another RDBMS. Although, they both should be able to support roughly the same functionality. [4]

3.1.2 ACID ACID is an acronym that stands for atomicity, consistency, isolation, and durability. These are properties of a transaction in an RDBMS. A transaction in SQL is one or multiple SQL statements that are always run in full. Either all statements are executed completely or none

9 3.2. OLAP versus OLTP of the statements should have any effect on the database. An RDBMS ensures that every transaction is run in accordance with the ACID properties. [6]

Most of the ACID concepts do not only concern database theory but are general concepts used in computer science. [7]

• Atomicity: This property guarantees that a transaction always is executed in full. It can never happen that a transaction partially updates the database. This includes unexpected situations where for example the DBMS crashes due to loss of power or crashes due to other failures or bugs.

• Consistency: This property makes sure that every transaction that is run in the database will result in a valid state of the database. To be in a valid state means that all data that is written follows the rules and constraints in the database.

• Isolation: Isolation is used to ensure database integrity when running transactions concurrently in the database. What this means is that the result of concurrent transactions should be able to be reproduced by the same transactions running in some sequential order.

• Durability: Durability means that a committed transaction will continue to be committed indefinitely. Even if the program or the computer running the database crashes. To implement this, the DBMS must make sure to always save committed transactions to a permanent memory. Durability will therefore not be achieved if just storing the data in regular RAM.

3.2 OLAP versus OLTP

There are two main classes of use cases when it comes to database systems: online analytical processing (OLAP) and online transaction processing (OLTP).

OLTP: An OLTP system is characterized as a system that performs small and fast transactions against a database. This means that queries need to be fast. Often, in an OLTP system, the data read or modified is directly facing a user or application. [8]

A typical example of this kind of OLTP system is a database storing user login credentials for a website. The queries will, in this case, be exceedingly small and need to be fast. If the user wants to log in, a simple query to read a data for one user will be executed. If the user wants to change password, a simple update statement will execute. These queries are simple and need to be fast.

OLAP: An OLAP system, on the other hand, is as the name suggests, more suitable for analytics. Queries in an OLAP system could be complex and require data from many sources, with different database schemas, to resolve. These queries could take a long time to execute compared to queries in an OLTP system. These systems often query a lot of historical data compared to OLTP which often only query current data. Because queries here is used for analytics, these systems are often read-only by design. These systems often do not face an end user the same way that OLTP does. [8]

An example of this kind of OLAP system could be a big collection of databases. Data from these databases will be aggregated and queried using complex and long-running functions. Based on this analysis some changes or decisions will then be made to optimize the system.

10 3.3. Distributed Database Management Systems

Most systems today support either OLTP or OLAP, not both. However, there has been some research of combining them to a system that can handle them both, but this increases the complexity of the system. [9]

3.3 Distributed Database Management Systems

A distributed database management system (DDBMS) is a management system for a collection of distributed databases. Similar to a DBMS that manages a local database, the DDBMS manage a distributed collection of databases. It hides the complexity of the distributed system to the user. [2]

Although the system developed in this thesis will not directly be a DDBMS it will be similar in many ways, therefore some conceptual background of these systems will be required.

Before a decision is made to implement a distributed database system for storage of data, there are several factors to consider. A distributed database system has both advantages and disadvantages compared to a regular centralized system. A distributed system will probably be more expensive in terms of hardware required, but runtime costs may benefit from the possibility to fine-tune each machine separately from the rest to achieve maximum performance. When the data is distributed between several locations it has both security and integrity related benefits. The protection of data is improved when the data is not located at the same site. [10]

A distributed system may be heterogeneous in a few different ways. The different components of a heterogeneous distributed system may differ in hardware, software, or communication protocols. Two different systems can have different data models which in turn have differences in data structures, constraints, and query language. The difference in structure is easier to deal with if the two representations have the same data content. If not, the difference in content may require significant work to make the systems compatible with each other. [11]

Data is fetched from a database by executing so-called queries. In a conventional, non-distributed database system, all data asked for in the query can be provided by the single database. This may not be the case in a distributed system. To retrieve all data a user is interested in, several queries may have to be executed, by the user, on different databases. The results of the queries then have to be combined, by the user, into a resulting dataset. To make the distributed system convenient to use for the end-user, a distributed query manager can be used. The query manager makes the distributed system behave like its non-distributed counterpart by taking a single query provided by the user and combining data from its respective sources to form a single resulting set of data. Creating an efficient distributed query manager might be a more or less difficult task, depending on the differences between the databases in the system. [12]

When a transaction is executed in a distributed environment it could be the case that it writes data located in multiple databases at once. In this case, synchronization problems arise. This is because in a distributed system there are no native synchronization or atomic operations. This could lead to unwanted situations where data in the database becomes corrupt because of race conditions, or deadlocks could occur because of some partially updated data. There are some cases where this kind of synchronization is not needed but in the cases in which it is, there are two properties that need to be taken into consideration. The first is local synchronization that makes sure that concurrent running queries on the same database are synchronized and run in order. The other is global synchronization that makes sure that the

11 3.4. ODBC, JDBC, and OLE DB entire network of databases keeps a consistent state. The latter is harder to achieve and can add additional overhead. [12]

For any database system, there has to be a way of performing various administrative tasks. These include authorization of users and management of semantic integrity rules. In a heterogeneous and distributed database system, the method chosen to perform these tasks depends on the degree of centralization. When authorizing users, there can be an advantage of providing permissions from a centralized system. Thomas et al. [12] bring up the example of giving a user access to the average salary for employees at a company while denying access to the salary of individual employees, where the salaries and employees are stored in different databases in a distributed system. If the authorization of users is centralized, it is trivial to create a database view that captures the necessary queries and operations to obtain the average salary. The user can then be granted access to this view by the centralized authorization system. If one chooses to use a decentralized authorization system, no method exists to grant the user access to the average salary without them being able to also access information about the salary of individual employees as well. If the database management system of the distributed database system supports having semantic integrity rules for the stored data, this can be handled either centrally, through a global schema, or locally at every database. The global approach does, in this case, have a significant advantage due to the possibility to add constraints that depend on data stored in different databases. [12]

3.4 ODBC, JDBC, and OLE DB

There are many different existing DBMSs that often are using different interfaces. Open database connectivity (ODBC) aims to provide a uniform interface to the different DBMSs using SQL acting as a kind of middleware.

There are three ways that ODBC tries to standardize this:

• Provide a uniform communication middleware: This means that it is possible to handle connections to different DBMSs in the same way.

• Datatypes standard: Provide a standard of what datatypes that can be used and how they are mapped to the target DBMS.

• Provide an application programming interface (API): Used to execute queries with the same SQL syntax no matter of the underlying target DBMS.

To be able to use ODBC with a specific DBMS, a driver for that DBMS is required. This driver is often provided by the vendor of the DBMS. [13]

For the Java environment, there is the java database connectivity (JDBC) developed by Oracle. It aims to solve the same problem as ODBC and includes the same features. There are connections that bridge ODBC to JDBC and vice versa. This means that one can almost always be used instead of the other. [13]

Object linking and embedding, database (OLE DB) is like ODBC developed by Microsoft. The main difference is that while, ODBC works with most DBMS, OLE DB has support for connecting to many other different sources. This includes sources like files on disk and other non-DBMSs. OLE DB also contains the functionality to connect to regular DBMSs by using ODBC as a middleware. [14]

12 3.5. Microsoft Linked Servers

3.5 Microsoft Linked Servers

Linked servers is a technology developed by Microsoft and is shipped with Microsoft’s SQL Server. Linked servers makes it possible to connect and access data from other data sources. Linked servers use ODBC and OLE DB as a middleware for this connection and therefore it can access data from a variety of sources. [15]

Using linked servers, it is possible to perform distributed queries. Performing a distributed query means that it is possible to query data from many heterogeneous sources in one query. As mentioned in Section 3.3, there are multiple problems when dealing with a distributed query compared to a local one. Microsoft solves this by, what they call, distributed transactions. These distributed transactions try to emulate the transparency of a query on a local database. These distributed queries do enforce the ACID properties. This is implemented using the two-phase commit principle. This will enable the transaction to enforce ACID while adding some overhead. [16]

This technology seems to be a good fit for the work in this thesis but unfortunately, it is a Windows-exclusive software and therefore can not be used in a Linux environment, which is a requirement for this thesis. [15]

3.6 Distributed Query Engines

A query engine is a component that can query data from a DBMS or some other data store. It could also be the case that the query engine supports making distributed queries and joining data over multiple heterogeneous sources. The query engine can itself be distributed, a distributed query engine, meaning that it can scale and run on multiple machines at the same time to enhance performance. [17]

3.6.1 PrestoDB PrestoDB brands itself as a distributed query engine [18]. It is mainly designed for big data analytics and is able to query multiple heterogeneous databases using SQL. It does this by providing connectors to many of the most common DBMSs. [19] Among the supported connectors relevant to this project are:

• MySQL

• PostgreSQL

• Redis and many more. [18] These connectors provide the PrestoDB core with specific information needed for the DBMS.

PrestoDB is built in a distributed manner to enhance performance when analyzing large quantities of data. It is designed with one coordinator server that is the central component of PrestoDB. When executing a query in PrestoDB it first reaches the coordinator. The coordinator takes the queries and divide them into tasks that it then schedules to the workers. The workers then handle the reading or writing of data to the databases using the connectors. [18]

PrestoDB has originally been developed by Facebook to query the internal data of the company from many different sources at once. Today PrestoDB is an open source project licensed under the Apache license. [18] PrestoDB is backed and used by a lot of big

13 3.6. Distributed Query Engines companies. An example is Airbnb that develop a web user interface for PrestoDB named Airpal. [20]

PrestoDB is designed for analytics and can be considered an OLAP system. This means that when it comes to modifying data it has some limitations of which SQL statements that are supported. The supported SQL statements depend on the connector used but are mainly the same for all RDBMSs. For the MySQL connector the INSERT statement is supported but neither the DELETE nor the UPDATE statement. This limitation could cause problems if the system should be used to update any data. [18]

Illustrated in Listing 3.1 is a simple distributed analytics query. In this example, there are two databases that are queried using the same query. These databases are in this example called mysql2 and . As the name suggests mysql2 is a MySQL DBMS while postgresql is a PostgreSQL DBMS. This means that this is not only an example of a distributed query but also an example of querying different, heterogeneous, DBMSs. The test and public identifiers are referencing databases and neighbors is referencing a table. The query in Listing 3.1 will compare and find rows where field4 in different databases does not match each other.

Listing 3.1: PrestoDB distributed query example SELECT mysql. cellappid , post.cellappid , mysql. field4 , post. field4 FROM postgresql .public.neighbors as post , mysql2. test .neighbors as mysql WHERE mysql. field4!=post. field4 AND mysql. cellappid=post. cellappid AND mysql. neighborid=post . neighborid ;

3.6.2 Apache Drill Apache Drill is also, like PrestoDB, a distributed query engine. Drill started as the Google project Dremel in 2010. The techniques from Dremel then became an Apache project under the name Drill. [17]

Drill supports a variety of data sources including both RDBMSs, and NoSQL. It supports full ANSI SQL.[17] It is built using storage plugins that are used to handle connections to specific data stores. Drill is written in Java and uses JDBC to connect to RDBMSs. By using JDBC, Drill should be able to connect to most sources that support the interface. Actively supported and tested data sources relevant to this project include:

• MySQL

• PostgreSQL

Like PrestoDB, Drill is mainly designed for data analytics and, therefore, does not contain all the functionality for modifying and updating data that is available in ANSI SQL. Operations like INSERT, UPDATE or even DELETE are not yet supported. [21]

Illustrated in Listing 3.2 is a distributed query using Drill. The query does the same kind of analytics as the example for PrestoDB in Listing 3.1. The query for PrestoDB and Drill looks similar but there are some small differences. One example of differences is that in Drill the JOIN statement must be explicitly stated while in the example with PrestoDB this was not

14 3.7. Multistore and Polystore Systems needed since it is implicit. Drill also did not support the != operator like PrestoDB did. Other than this both queries gave the same result.

Listing 3.2: Drill distributed query example SELECT mysql. cellappid , post.cellappid , mysql. field4 , post. field4 FROM mysql2. test .neighbors as mysql JOIN postgresql .public.neighbors as post ON mysql. cellappid=post. cellappid AND mysql. neighborid=post . neighborid AND NOT (mysql. field4=post. field4 );

3.7 Multistore and Polystore Systems

According to the book Data Management and Analytics for Medicine and Healthcare written by BEGOLI, Edmon; WANG, Fusheng; LUO, Gang [22] both multistore and polystore are defined as systems that combine heterogeneous DBMSs by using a single uniform interface or language, such as SQL. [22]

Multistore and Polystore Systems are, similar to the distributed query engines, a way to query data from multiple heterogeneous data sources in a single query. [23]

3.7.1 CloudMdsQL and CoherentPaaS CloudMdsQL is a data query language similar to SQL. The syntax of CloudMdsQL is comparable to SQL but contains some additional features. It is developed inside the project of CoherentPaaS, a project funded by the EU. The project aims to solve the problem of querying heterogeneous data sources. [24] It achieves this by taking another approach than the systems previously described. Instead of providing a uniform syntax to all data stores, it enables the user to write native queries for each different database. It also provides a way of combining these native queries into bigger distributed queries, using SQL-like syntax, stretching over many data sources. [23]

By using this technique, CloudMdsQL can query a wide range of data sources including relational database systems, NoSQL systems, and even graph-based databases. Out of the box, CloudMdsQL supports one of each to prove the concept: Derby as a relational database, MongoDB as NoSQL database, and Sparksee as a graph database. With some work, other databases can be added and used with the system [23]

CloudMdsQL should at this time be seen as an interesting research project on the subject but is not yet a system that should be used in production according to V. Giannakouris et al. [19]. Even if it is well documented how the language works, it is not well documented how to run the project, or what parts are needed to extend in order to get it to work with other databases than those provided with the project.

3.7.2 BigDAWG Much like CloudMdsQL, BigDAWG is mainly a research proof of concept and not production ready yet. [19] BigDAWG stands for Big Data Analytics Working Group, and is originally developed by Intel Science and Technology Center for Big Data (ISTC). BigDAWG is developed around an idea expressed by M. Storebraker and U. Cetintemel [25] as:

15 3.8. Data Warehousing

"No one size fits all"

This idea claims that there is no single DBMS that can be used in all circumstances. Instead, DBMSs need to be chosen based on the domain of the data.

BigDAWG is able to query data from multiple sources and different types of DBMS. Like CloudMdsQL, BigDAWG comes with support for a number of different DBMSs to be able to prove the concept. It also contains a sample dataset, consisting of public health records from Israel, that can be used to try all the different functionalities. [26]

3.8 Data Warehousing

Data warehousing is a little different approach than the ones mentioned previously. Data warehousing means that all the data from different sources is collected and unified into a single data storage. This is mainly used in big corporations with big collections of data. This data may be spread out at different locations and even stored in heterogeneous databases. The idea with a data warehouse is to collect all this information to be able to run analytics on data. [27]

According to "The data warehouse toolkit: The complete guide to dimensional modeling" a book written by R. Kimball and M. Ross [27], there are several components relevant to a data warehouse system:

• Source systems: These are the original data sources. It is this part of the system that is facing the user or applications. These systems are often of OLTP type and need to be able to resolve simple queries fast. These systems often only contain the current state of the system and little to no historical data.

• Staging area: This is the stage where data extracted from the source systems is transformed to fit into the database schema of the data warehouse.

• Presentation Area: In this step, the data is structured in a suitable way and stored in the data warehouse system. From here it can be analyzed using different queries and other tools.

Data warehousing is something that will not fit directly into the work done in this thesis since it will require data to first be copied into the data warehouse before being queried. This project is looking for solutions where the data can be queried in place, without being copied first.

16 4 Distributed Database

In this chapter, the developed schema for each database is illustrated. Also illustrated is the distribution schema for the entire distributed database. The distributed database consists of all of the databases in the system.

4.1 Database Schema

Illustrated in Figure 4.1 is the database schema used in all the databases in the system. The schema is created from the initial data class structure illustrated in Figure 1.1. The state table contains data about the application state of each CellApp. The neighbors table contain all the neighbors of the CellApps. A neighbor contains configuration and relation data about a neighboring CellApp. One CellApp can have many neighbors. In the illustration Figure 4.1 the bold fields, with a key symbol next to them, are primary keys. The line in between the tables stands for foreign key constraints. This line shows that the CellAppId in the neighbors table is a foreign key connected to a single CellAppId in the states table. The line also indicates that there are many rows in the neighbors table that is connected to a single CellAppId.

One thing to consider when designing this schema is that not all of the DBMSs, in this heterogeneous system, supports the same data types. An example of this is that MySQL with the Memory engine does not support the STRING data type. This data type was used in a initial design of the schema but had to be changed when this was discovered. Another similar example is that SQLite does not support the BOOL data type fully. It will accept a BOOL data type in the schema but it will just convert it to a NUMERIC. This means that when inserting data into the SQLite DBMS it does not recognize the symbols TRUE and FALSE. Instead, 0 and 1 is used. This is also something that must be taken into account when designing an SQL statement that should work on many different DBMSs.

4.2 Distribution Schema

The many databases in the system together make up a distributed database. The structure of this distributed database is illustrated in Figure 4.2 and Figure 4.3. The data in the two tables in Figure 4.2 and Figure 4.3 is horizontally partitioned.These illustrations show all the data of

17 4.2. Distribution Schema

Figure 4.1: SQL database schema the entire system. The illustrations also show how the data is connected to the different base stations and CellApps.

18 4.2. Distribution Schema

Table: states

CellApp CellAppId Field1 Field2 Field20 0 0 data_field data_field data_field

1 data_field data_field data_field

Base ...... station 1 16 data_field data_field data_field

17 data_field data_field data_field

18 data_field data_field data_field

19 data_field data_field data_field

Base station ...... 2 34 data_field data_field data_field

35 data_field data_field data_field

36 data_field data_field data_field

37 data_field data_field data_field Base station ...... 3

52 data_field data_field data_field

53 data_field data_field data_field

Figure 4.2: States table distribution schema

19 4.2. Distribution Schema

Table: neighbors

CellAppId NeighborId Field1 Field20

0 0 data_field data_field

0 1 data_field data_field

CellApp 0 ......

0 298 data_field data_field

0 299 data_field data_field

Base station ...... 1 17 0 data_field data_field

17 1 data_field data_field

CellApp ...... 17

17 298 data_field data_field

17 299 data_field data_field

18 0 data_field data_field

18 1 data_field data_field

Base CellApp 18 2 data_field data_field station 18 2 ......

Figure 4.3: Neighbors table distribution schema

20 5 System Architecture

The problem that will be studied in this chapter is finding an architecture for the network of heterogeneous databases, that makes up the distributed database. First, different approaches for designing the architecture will be discussed. After that, more concrete architectures for this specific problem will be presented. The architectures presented is connected to the techniques found in the theory. This connection is because of the fact that the different techniques often promote the use of a certain architecture.

There are two distinct possible implementation approaches to build this architecture on. One is a centralized approach where there is a middleware that consists of a single component that is connected to every database in the system. The other approach is that every CellApp in the system is connected directly to all the other databases it needs to communicate to. Each of these approaches has different pros and cons.

The problem consists of two specific parts: the CellApps, and the single machine learning application. These two applications have different access patterns and will be used for different purposes. One approach could be to use a single middleware for communication with both applications. This requires a more complex middleware since it demands something that is both good for real-time queries (OLTP) and something that is good for complex long-running queries for analytics (OLAP). Because of the different characteristics of these two applications, it could be simpler to use two middlewares, one for each. These two approaches will be discussed more in depth in the following two sections. After those two sections, more concrete architectures will be illustrated.

5.1 Centralized Approach

A centralized approach means that there is a single component in the system that can connect and execute queries to all databases in the system. In reality, this single component could itself consist of a cluster of components to improve performance. This is the case with both PrestoDB and Drill mentioned in Chapter 3.

21 5.2. Distributed Approach

By having this centralized component, it could be able to execute distributed queries over multiple data sources at once. This is possible using both PrestoDB and Drill mentioned in Chapter 3. The central component will have a lot of knowledge about the system of databases and can often optimize queries to get good performance, especially for complex analytics queries that these systems are designed for.

A positive thing, related to the configuration needed to connect to the DBMSs in this centralized approach, is that only one component needs to be configured to connect to all databases in the system. This way it will be simple to add new databases to an already large system, all that is needed is to change the configuration of one single centralized component.

The downside of using a central component is that it will add extra overhead for each query compared to a direct connection to a DBMS. This overhead is because the query needs to go through a central component instead of directly to the source. This overhead could, for instance, come from added network data transfer or the additional query processing inside this central component. It could be the case that this overhead will add too much of a performance penalty to work with the real-time part of the problem (the CellApp).

Another downside, created by using a central component, is that all queries need to go through single component. Hence, a single point of failure is introduced. If this single component for some reason does not work, the entire system will not be able to access data.

Another, more concrete, negative aspect of this approach is that most of these central systems found in Section 3.6.2 do not fully support update and write operations on the data. The systems are mainly designed for analytics and, while they offer some crude ways of doing bulk updates on data, they are not ideal for frequent small updates.

5.2 Distributed Approach

In the distributed approach there is no centralized component that constitutes the middleware. Here the middleware is a part of each of the CellApps in the system. Each CellApp will have its own functionality to connect to the DBMSs that it needs to communicate with.

What this means in practice is that each CellApp will need to be configured to connect to one or multiple DBMSs. If two different CellApps need to connect to the same DBMS they both require configuration data on how to connect to that specific source. This may not be a big problem in a small system only containing a few DBMSs and a few CellApps, but this problem will grow as the size of the system increases. If, for example, there exists a high-performance DBMS that thousands of CellApps connect to, all the CellApps will need to be re-configured in order to connect them to another DBMS. Doing the same thing in the centralized approach only required one component to change configuration and it will apply to all CellApps using that central component.

For small and quick transactions this distributed approach might be a good approach since each application will have a direct connection to the DBMS in question. Having a direct connection might be good for fast transactions that only need to read or update a few values. Also, since there is a direct connection to the DBMS, it should possible both read and write data fast without the same limitations as seen in the centralized approach. Both ODBC and JDBC could be used for this direct connection to the databases. They both support almost all the actions that can be performed by an RDBMS as mentioned in Chapter 3.

While this direct approach might be good for transactional performance, it does not really give a way to do distributed queries since the connection is only to one database at a time.

22 5.3. Possible Architectures

So, in the case of this system with the machine learning algorithm, there is no way to query data from the entire distributed database, consisting of all the databases, with a single query. This means that each database must be queried separately. One approach here could be to separately copy data from each database to a single database inside some data warehouse solution and then do the analytics queries there. In this case, after the analytics and updating of the data inside the data warehouse, the data needs to be distributed again to update all the databases. However, this approach also breaks the delimitation of querying data in-place.

5.3 Possible Architectures

As illustrated in Figure 1.2 there are basically two parts of the problem of connecting this system of heterogeneous databases. One part is to find a suitable middleware for the CellApps and another part is finding a middleware for the machine learning application. However, it could be the case that both the CellApp and the machine learning application could use the same middleware. Therefore, the list of different architectures will be divided into one section for the CellApp and one section of possible architectures for the machine learning application. This split is done for simplicity, but as mentioned previously it could be the case that the same middleware is used in both cases.

5.3.1 CellApp Architectures In this section, different architectures for the CellApp will be presented.

5.3.1.1 Architecture A

Base station

Database Access Component

CellApp

Database

CellApp

CellApp

Remote location

Base station

Database Access Component

Database CellApp

CellApp Database

CellApp

Figure 5.1: Architecture A - Simple direct connection

23 5.3. Possible Architectures

Illustrated in Figure 5.1 is Architecture A, one of the simplest of architectures. In this architecture, each of the CellApp contains a database access component that handles the communication between the DBMS and the CellApp. This architecture could be implemented using some library that handles communication with the DBMS.

A portability problem with this architectural approach can be seen in Figure 5.1. The problem is that all CellApps need to contain a configuration on how to connect to the DBMSs. The problem reveals itself when many CellApps connect to the same database, one case is that the database is moved to another address or changed completely. In this case, all the CellApps will need to be re-configured to be able to connect to this new location or database. A step towards a solution for this problem can be seen in Architecture B ( Section 5.3.1.2)

5.3.1.2 Architecture B

Database

Base station

CellApp

Middleware

CellApp

Database

CellApp

Figure 5.2: Architecture B - Simple middleware

Illustrated in Figure 5.2 is Architecture B. This architecture would still be classified as a distributed approach since every base station still contains the middleware. But in this case, the configuration of the database connection is shared between all CellApps on the same machine that connect to the same database. This means that the middleware is still local to each machine but can be used by all the CellApps on the machine. This architecture could be realized using ODBC or JDBC.

As seen in Figure 5.2 the portability problem, described in Architecture A, has been mitigated to an extent. If the database is moved to a new location, each machine still has to be re-configured but not every CellApp. Having to re-configure every machine could still be

24 5.3. Possible Architectures a problem in a big system but it is still better than having to change the configuration of each CellApp like it was required in Architecture A.

5.3.1.3 Architecture C

Middleware Database

Base station

CellApp

CellApp Database

CellApp

Figure 5.3: Architecture C - Simple

Illustrated in Figure 5.3 is Architecture C, a more centralized approach. Here the middleware is abstracted outside the machine that the CellApps are running on. When looking at the more advanced setup of this approach in Figure 5.3 it is clear that this is better from the configuration perspective. There is only one component that is required to know the location of the databases. So, in the case of a database move, only that one component would need to be re-configured. The problem with this architecture is the overhead that this will add as discussed in Section 5.1 . This unnecessary overhead is especially clear in Figure 5.3 where the query is sent to the middleware outside the machine and then inside the machine to the database. This will add some network overhead. Also compared to the previous two architectures, this architecture introduces a single point of failure. If the middleware component crashes, all CellApp that rely on that component will stop to work.

This kind of architecture could be built using PrestoDB or Drill.

5.3.2 Machine Learning Application For the machine learning application, there are two different architectures that could be implemented. Either a database access component is integrated inside the application, and it has one connection to every database. This is shown in Figure 5.5. The other approach is abstracting the middleware to its own component, seen in Figure 5.4. Since there only is one machine learning application these two solutions are similar to each other.

25 5.3. Possible Architectures

5.3.2.1 Architecture D Illustrated in Figure 5.4 is connecting the machine learning application to a middleware. Because of the fact that there only exists one machine learning application, it does not really matter if this middleware component is run on the same machine as the machine learning application or a different. This kind of architecture could be implemented either using ODBC or a distributed query engine as middleware.

Machine Learning Application

Middleware

Database Database Database

Figure 5.4: Architecture D - Middleware

5.3.2.2 Architecture E Illustrated in Figure 5.4 is using a database access component to connect directly to the DBMSs.

26 5.3. Possible Architectures

Machine Learning Application

Database Access Component

Database Database Database

Figure 5.5: Architecture E - Database access component

27 6 Results

6.1 Performance Measurements

This test was devised to measure the performance of ODBC and JDBC with different DBMSs. This is done to assert if it is possible to fulfill the performance measurements seen in Section 1.4.1.4. The test will show which DBMSs can be used and if ODBC or JDBC should be used. In this test, the CellApp was the focus since it has the toughest performance requirements by far, see Section 1.4.1.4.

Since the CellApp is a real-time application that needs data quickly to make calculations, it is important to note that the average query time is not the only important metric. Instead, a metric as important as the average query time is the worst-case (max) time for a query. This is because the CellApp must be able to guarantee execution time.

Because of the fact that the read queries have much stricter performance requirements than the write operations, the read queries will be the focus of the tests. It is still important though to perform write operations as well, since write operations also affect the query time for read operations when executed concurrently.

6.1.1 Test Setup Illustrated in Figure 6.1 is the setup used for this test. This architecture of the test setup is similar to that of Architecture B described in Section 5.3.1.2. The test was performed using ODBC and JDBC to act as a middleware between the CellApp and the DBMS. The CellApp will have a direct connection to the DBMS through ODBC or JDBC. To minimize network lag as a factor, both the DBMS and CellApp were run on the same machine.

28 6.1. Performance Measurements

Virtual Machine

ODBC/JDBC CellApp 1

CellApp 2

Database

CellApp 18

Figure 6.1: Test setup 1

6.1.1.1 Mock Application To run the test, a mock application of the CellApp was created by me. This mock application does not do any "real work" but emulates the data access pattern for the CellApp. From the performance requirements in Section 1.4.1.4 the access patterns can be derived.

• The CellApp should be able to query data from 100 neighbors per second. Each query can take a maximum of 10 milliseconds. This also means a throughput of 6000 queries per minute for every CellApp. Every neighbor query is equal to 20 attributes.

• The CellApp will update an attribute two times per minute. This means that a write statement has 30 seconds to execute at most which is a more relaxed condition than for the read queries. The write statement is done in parallel to the read queries.

The pseudo code for the mock CellApp is seen in Listing 6.1. The function write_request will execute the SQL statement seen in Listing 6.3 and the read_request function will execute the SQL query seen in Listing 6.2. The row selected to read or write to the database is chosen by introducing a random component, this is so that the measured time is closer to that of a random access query and not just a cached response. The variable currentId is the id of the CellApp and randomId is an integer that is drawn uniformly by random from an

29 6.1. Performance Measurements interval between zero and the maximum number of neighbors, which in this case equals 300 neighbors.

Listing 6.1: Mock CellApp pseudo code # Run in parallel every 30 seconds set_interval(write_request , 30)

i = 0 ; # Number of read request to execute in test TIME = 3∗60∗100 while i < TIME : i +=1 start = time.time() read_request () end = time.time()

res = (10 ´ ( end ´ s t a r t ) ∗ 1000.0)/1000.0 i f res > 0 : # Wait so that maximum 1 request is executed per 10 ms time.sleep(res)

Listing 6.2: Read request SELECT ∗ FROM neighbors WHERE CellAppId=currentId AND NeighborId=randomId ;

Listing 6.3: Write request UPDATE neighbors SET Field4=’test ’ WHERE CellAppId=currentId AND NeighborId=randomId ;

6.1.1.2 Database Management Systems (DBMSs) Tested The following DBMSs were benchmarked in this test:

• MySQL InnoDB - Version 5.7.22

• MySQL Memory - Version 5.7.22

• MySQL MyISAM - Version 5.7.22

• PostgreSQL - Version 9.5.12

• SQLite - Version 3.11.0

• SQLite on RAM disk - Version 3.11.0

• H2 - Version 1.4.197

All of them were tested with ODBC to measure the different performance of the different DBMSs. Some of the DBMSs were also tested with JDBC to compare the performance of ODBC and JDBC.

30 6.1. Performance Measurements

6.1.1.3 Test Parameters The following test parameters were used during the test. The parameters were chosen to emulate real-life conditions Each test was run until 18000 read queries had been resolved for each CellApp. The number 18000 equals 3 minutes if the read queries take, on average, 10 ms to resolve. To emulate a real system, 18 CellApps were run simultaneously. This means that a total of 324000 read queries will be done during the course of the test. In Table 6.1 a summary of the parameters used in this test can be seen. Each CellApp has 300 neighbors, to emulate real-world conditions.

Indexes for the different DBMSs were created implicitly for the primary keys. Since the primary keys are the only thing used to query the data no other indexes should be needed.

Table 6.1: Test parameters Number of CellApps running concurrently 18 Total number of read queries executed 324000 Database table "states" number of rows 18 Database table "neighbors" number of rows 5400

Each of the DBMSs were initialized with mock data for 18 CellApps before the test started. As seen in Table 6.1, the state table consists of 18 rows of data and the neighbor table consists of 5400 rows. The number of rows in the neighbor table comes from that each of the 18 CellApps have 300 neighbors. In this test, the neighbors table was used to run queries and update statements on. The schema for the tables is illustrated in Figure 4.1.

The test was done on a virtual machine running on a server provided by Ericsson. Specifications for this machine can be seen in Table 6.2. This machine is about as powerful as an average home PC, which is realistic for this application. Both the processes of the DBMSs and the CellApps were scheduled automatically by Ubuntu.

Table 6.2: Test machine CPU Intel Xeon E3-12xx Cores 8 RAM 16GB Operating system Ubuntu 16.04.1 LTS

6.1.2 Result using ODBC The result of the tests using ODBC can be seen in Table 6.3. The first third of the read queries executed was used as a warm-up period for the DBMS. Hence, the first 6000 queries were not used when calculating the result.

Since the CellApp mock application only issues 1 query each 10ms this means that the real throughput for each CellApp will always be constant at 6000 queries per minute as long as no query exceeds the maximum allowed query time of 10ms. The throughput, shown in the tables below, are therefore calculated based on the average query time. This means that this is an approximation of the throughput if the CellApps were to issue read queries as fast as possible. The calculation used for the throughput can be seen in Equation 6.1, t is the throughput and a is the average query time.

1000 t = 18 ˚ 60 ˚ [queries/min] (6.1) a

31 6.1. Performance Measurements

Table 6.3: ODBC - Initial measurements of query read operations

Database Throughput (queries/min) Average Time (ms) Max Time (ms) Min Time (ms) Standard deviation MySQL InnoDB 0.2909E+07 0.3713 1.8089 0.2103 0.1735 MySQL MyISAM 0.2748E+07 0.3931 2.3310 0.1929 0.2002 MySQL Memory 0.3716E+07 0.2907 1.7009 0.1838 0.1109 PostgreSQL 0.1639E+07 0.6589 3.1700 0.3593 0.3053 SQLite3 0.2651E+07 0.4075 292.6514 0.1817 4.0458 SQLite3 on RAM disk 0.2927E+07 0.3690 41.1594 0.1822 0.7315 H2 0.9000E+04 120.0010 127.6989 117.0919 0.3681

When using MySQL there are multiple storage engines available by default. In this test, 3 of them were tested: MyISAM, Memory, and the default InnoDB. As seen in Table 6.3 all of these storage engines performed pretty well, at least compared to the PostgreSQL system. The highest worst-case time was achieved using the MyISAM engine.

One thing to note here is that using a main memory engine like the MySQL Memory engine will not give you ACID compliance since the data that is being written is not written to disk directly. In the case of a system crash, data could be lost. [28]

PostgreSQL is slower than all MySQL engines in this given setting. Both the max query time and average query time is significantly higher than all the MySQL engines tested.

SQLite was tested with varying result. While the average query time using SQLite was good, there are some big spikes in query execution time. These spikes did not occur that often and seems to be related to when read queries are executed at the same time as write queries. An overview of the spikes can be seen in Figure 6.3 and a detailed view of a single spike can be seen in Figure 6.4. A cause for the spikes could be that the ODBC driver used is classified as experimental and not recommended for production use. These spikes seen when benchmarking SQLite could be reduced by moving the SQLite database to main memory. Then, the spikes are reduced but not mitigated completely. This mitigation by moving SQLite to main memory can be seen in Figure 6.5. The spikes still follow the same pattern, and occur during write queries, but they are somewhat reduced in terms of query time. The spikes are still much too large to be able to use this database in production. Also, by moving the database to main memory, ACID compliance is gone since the database is stored in main memory and not on disk.

The last DBMS tested is a less known one that is written purely in Java called H2. The creators of H2 claim that it is designed for real-time applications. It does also offer main memory possibilities. It uses the same ODBC driver as PostgreSQL and can, therefore, be connected to in the same way as PostgreSQL. Although promising in theory, H2 does not show up good in the initial tests using ODBC. The execution time for a simple read query is steadily over 100 milliseconds. Because of this, both the average time and the max query time is way over the other DBMSs tested.

In this test, both MySQL and PostgreSQL passed the requirement of a query time below 10ms. For these two DBMSs, both the average query time and max query time is way under the requirement. Best in this test was MySQL with the Memory engine, seen in Figure 6.2.

SQLite does have an average query time under 10ms but the max time is still way over the threshold, even when using a RAM disk. Therefore, it is not stable enough to be used in this context.

H2 does not perform well when looking at the average or max query time.

32 6.1. Performance Measurements

Figure 6.2: MySQL Memory engine read and write operations

Figure 6.3: SQLite read and write operations

33 6.1. Performance Measurements

Figure 6.4: SQLite spike read and write operations

Figure 6.5: SQLite main memory read and write operations

6.1.2.1 ODBC Using a High-performance Machine This test was done in order to establish if the result in Table 6.3 was because of the fact of the DBMSs are the limiting factors or if the machine that it runs on is the bottleneck. To establish

34 6.1. Performance Measurements this, the exact same tests as in Section 6.1.2 was done, but this time with a virtual machine with more resources. The resources of this virtual machine can be seen in Table 6.4.

Table 6.4: Test machine 2 CPU Intel Xeon E3-12xx Cores 20 RAM 32GB Operating system Ubuntu 16.04.1 LTS

The results, seen in Table 6.5, is not as expected. Almost all of the DBMSs, except H2, performed worse in this test. The throughput is significantly lower on all DBMSs, except H2 that manage to stay the same. A possible answer to this unexpected result could be that CPU resources are not the limiting factor in these tests but instead something else like main memory speed. By adding cores and logical CPU units it could also be that case that more synchronization is needed which in turns gives a worse result. Another explanation could be that this is some unknown artifact of the virtual machine that causes this. Whatever is the reason for this strange result the conclusion is that, at least in this test environment, scaling up the hardware does not necessarily give a better result.

Table 6.5: ODBC high-performance machine - Measurements of query read operations

Database Throughput (queries/min) Average Time (ms) Max Time (ms) Min Time (ms) Standard deviation MySQL InnoDB 0.1893E+07 0.5706 2.1863 0.2394 0.1552 MySQL MyISAM 0.1756E+07 0.6150 4.9121 0.2875 0.1322 MySQL Memory 0.1992E+07 0.5423 4.0960 0.2327 0.1315 PostgreSQL 0.9789E+06 1.1033 3.2105 0.4413 0.3164 SQLite3 0.1737E+07 0.6217 496.7797 0.2041 1.1612 SQLite3 on RAM disk 0.2078E+07 0.5197 101.9571 0.2100 1.2441 H2 0.8998E+04 120.0222 139.9386 117.1389 0.7091

6.1.3 Using JDBC The implementation of JDBC is similar to ODBC. But when testing JDBC with the same python testbench used to test ODBC, the results were that all DBMSs tested were much slower than when using ODBC. This could be because of the overhead of having to pass through the Java virtual machine. Therefore, a similar testbench was built, by me, in Java for these tests. As suspected, this increased the performance to be comparable to the results in the case of ODBC. The results, seen in Table 6.6, suggests that JDBC has a faster minimum query execution time compared to ODBC. But when looking at the average query time and the maximum query time, which is the main focus of this test, it is significantly higher in the case of JDBC compared to ODBC.

Another interesting observation here is that H2 performs almost as good as the other databases in this test with JDBC. This suggests that H2 database is, in fact, quite fast but that the ODBC driver for it is the limiting factor.

Table 6.6: JDBC - Initial measurements of query read operations

Database Throughput (queries/min) Average Time (ms) Max Time (ms) Min Time (ms) Standard deviation MySQL MyISAM 0.2631E+07 0.4105 34.6827 0.1230 0.3389 PostgreSQL 0.2513E+07 0.4298 5.5006 0.1498 0.3431 H2 0.2314E+07 0.4668 21.9480 0.1378 0.2336

6.1.4 Introducing a Network Delay One part of the research is to check if it is possible to move the DBMS to a remote location without losing too much performance. So in this part of the test, I separated the DBMS and

35 6.1. Performance Measurements the CellApps. The new test setup is illustrated in Figure 6.6. In previous tests both the DBMS and the CellApps ran on the same machine. In this test, they run on two different machines. These two machines are running on the same local network and are physically near each other which should give a pretty low network delay. This should give a network overhead much lower compared to if the DBMS run in the cloud somewhere physically far away.

Virtual Machine 2 seen in Figure 6.6 has the resources seen in Table 6.4. Virtual Machine 1 has the resources seen in Table 6.2.

Virtual Machine 1 Virtual Machine 2

ODBC/JDBC CellApp 1

CellApp 2

Database

CellApp 18

Figure 6.6: Test setup 2

When performing a ping between the two machines, to estimate the network delay, it results in a round-trip time (RTT) around 0.8ms. This should enable almost all of the previously tested DBMSs to still deliver a query time way below the threshold 10ms.

This test was done with MySQL Memory and PostgreSQL using ODBC. ODBC because it had better query times than JDBC in general. MySQL Memory was chosen because it was fast in the previous test. PostgreSQL was chosen as a control measurement. MySQL and PostgreSQL are also easier to set up in a network environment than SQLite, which also was fast in the previous test. Why SQLite is hard to set up in a network environment is because you connect to it by referencing a file on disk, this is hard from another machine since you would need to use some software to share that file in the network.

As seen in Table 6.7 the result is, as expected, overall worse than in the previous tests seen in Section 6.1.2. Both the average and max query time increased quite a bit over the theoretic RTT calculated above. However, both MySQL and PostgreSQL is still quite a bit below the 10ms threshold.

Table 6.7: ODBC - Over network Database Throughput (queries/min) Average Time (ms) Max Time (ms) Min Time (ms) Standard deviation MySQL Memory 0.9651E+06 1.1190 4.7083 0.4880 0.2020 PostgreSQL 0.4314E+06 2.5033 6.0143 1.2474 0.4242

36 6.1. Performance Measurements

6.1.5 Introducing a Distributed Query Engine This test is done to test the most centralized architectural approach. Here the middleware is located on another machine than the CellApp. The same two machines that were used in the previous network test was used again. Therefore, the network overhead should be the same as in the last tests. In this test setup, the machine that contains the middleware also contains the DBMS. The communication between the middleware and the CellApps is managed by a library that provides an API to the different middlewares. For PrestoDB the official python client1 was used. For Drill a third party client2 for python was used.

The following distributed query engines were tested:

• PrestoDB - Version 0.200

• Drill - Version 1.13.0

Virtual Machine 1 and 2 are the same as in the previous test seen in Section 6.1.4.

Virtual Machine 1 Virtual Machine 2

CellApp 1 Middleware: PrestoDB/Drill

CellApp 2

Database

CellApp 18

Figure 6.7: Test setup 3

The distributed query engines tested are Drill and PrestoDB. Each of them will be tested with MySQL, with the Memory engine, and PostgreSQL as a control measurement. The result can be seen in Table 6.8. As seen in the table these systems are not intended to be used this way. Non of the two query engines could run the full test without crashing. Drill worked when handling just one of the CellApps but when starting 18 CellApps concurrently it crashed almost immediately. PrestoDB did better than Drill in this test, it at least ran long enough to get some statistic on how it performs. Although, PrestoDB also crashed after some minutes of

1https://github.com/prestodb/presto-python-client - Version 0.4.2 2https://github.com/PythonicNinja/pydrill - Version 0.3.3

37 6.1. Performance Measurements running due to running out of all of the 16GB of memory. This is probably because it builds up a huge queue of unprocessed queries.

Table 6.8: Distributed query engines - Over network

Database Throughput (queries/min) Average Time (ms) Max Time (ms) Min Time (ms) Standard deviation PrestoDB MySQL Memory 0.6005E+05 199.1672 810.0905 47.0428 97.5587 PrestoDB PostgreSQL 0.4281E+04 252.2610 806.2243 115.8898 85.1282 Drill MySQL Memory - - - - - Drill PostgreSQL - - - - -

6.1.6 Comparing Relational Database Management Systems (RDBMSs) to a non SQL (NoSQL) Alternative NoSQL are systems that do not typically store data in a table, with relations and constraints, as in an RDBMS. NoSQL is somewhat outside the scope of this project since, like the name suggests, they typically cannot be queried with SQL. But it will be tested if a NoSQL system would achieve better performance than the RDBMSs tested. There are some ODBC connectors that could possibly work with some NoSQL systems, but no free or open source.

In this test, the popular NoSQL system Redis was used. Redis was chosen because it is lightweight, open source and often considered to be fast. Version 4.0.9 of Redis was used in this test. The python client redis-py3 was used to connect to Redis.

Redis is an in-memory key-value NoSQL system. Redis stores the database in memory but will save the data to persistent disk storage periodically. [29] Because Redis stores data in memory Redis cannot fully enforce the ACID properties. The simplest way of storing string information in Redis is running the SET command with a key and a value. This value can then be queried by running the GET command with the key. Redis also support storing more complex data structures than strings like sets and lists. [30] String storage is what will be used in this test.

Given the key-value data model supported by Redis, a key-value database schema must be developed for this test. This is because the relational schema described in Chapter 4 cannot be used in Redis’ key-value model. In this test, a very simple model was developed using Google Protobuf4. Protobuf enables you to write specifications that can be built into native python classes. These classes can then be serialized into binary form and parsed back to a class object. The model that was used in this test can be seen in Appendix A. The key of the NoSQL system is the table name combined with CellAppId and NeighborId. The value is the neighbor object in binary form. Binary form since it is compact and easy to generate using Protobuf. To update something in Redis a SET command with the serialized class was executed. The update operation for this test can be seen in Listing 6.5. The read queries were done with the GET command and then parsing the binary string into a python class instance using Protobuf. The read operation can be seen in Listing 6.4.

When comparing query features of Redis to that of an RDBMS, Redis is much more limited. There is no easy way of doing more advanced queries that are possible using SQL. As an example getting all rows with a CellAppId lower then 10 would be easy to implement in a single SQL query. Doing the same thing in Redis would require more advanced code.

Listing 6.4: GET operation # Create the Redis key from a CellAppId and NeighborId key = ’neighbor({} ,{}) ’. format (cellAppId , neighborId)

3https://github.com/andymccurdy/redis-py - Version 2.10.6 4https://developers.google.com/protocol-buffers/ - Version 3.5.1

38 6.2. The Machine Learning Application

res = r.get(key) # Get data from Redis neighbor = cellapp_pb2.neighbor() # Create protobuf object neighbor.ParseFromString(res) # Parse binary to object instance

Listing 6.5: SET operation # Create the Redis key from a CellAppId and NeighborId key = ’neighbor({} ,{}) ’. format (cellAppId , neighborId) res = r.get(key) # Get data from Redis neighbor = cellapp_pb2.neighbor() # Create protobuf object neighbor.ParseFromString(res) # Parse binary to object instance neighbor.Field4 = "new_data" # Write the new data # Serialize the object to binary string line = neighbor.SerializeToString() r . s e t ( key , l i n e ) # Write to Redis

The same test setup as in Section 6.1.2 was used. Seen in Table 6.9 and Figure 6.8 is the results. As suspected NoSQL performs better in this setting than the RDBMSs seen in Table 6.3. The throughput is higher and the average query time is lower than any of the RDBMSs tested.

Table 6.9: Performance measurements NoSQL Database Throughput (queries/min) Average Time (ms) Max Time (ms) Min Time (ms) Standard deviation Redis 0.4528E+07 0.2385 1.5557 0.0885 0.1167

Figure 6.8: Redis read and write operations

6.2 The Machine Learning Application

The machine learning application is significantly harder to performance benchmark in a realistic way. Since the machine learning application, in reality, should query thousands of

39 6.3. Data Migration databases and it is not possible to build such a test environment in the scope of this project. But performance tests, in this case, is not as important for this application since it has hours to query the data. Compared to the CellApp that only has some milliseconds. The PrestoDB website states that PrestoDB is used to query 1 petabyte data at Facebook each day. [18] If we assume that this information is correct this should be more than enough for this system. This system should add up to way below 1 TB of data.

ODBC/JDBC: Using ODBC or JDBC in the case of the machine learning application would not be ideal. ODBC/JDBC do not natively support distributed queries. If you want to do queries on the complete distributed dataset this means that all data first would have to be collected and copied to some single database. This solution would, therefore, be very similar to using a data warehouse.

Distributed query engine: Using a distributed query engine would enable the machine learning application to query data, "in-place", using distributed queries. The problem with this approach comes when write operations should be performed. This is since the distributed query engines, found in this thesis, can only update entire tables at a time. This will not be done in a transaction, instead, it will be done in several statements. Hence, the data will be unavailable to the CellApp during this update period. This is explained further in Section 6.3.

Conclusion: None of these two approaches for the machine learning application will fulfill all the requirements perfectly.

6.3 Data Migration

In theory, a good way of doing a data migration in this heterogeneous distributed database system would be to define the migration purely in SQL. This would require a central component that is connected to all databases. This would ideally look something like Listing 6.6. This transaction will move data from one DBMS called mysql1 to the DBMS mysql2. These two DBMSs could be located at different locations. This migration of data will be done in one single transaction.

Listing 6.6: SQL data migration START TRANSACTION; INSERT INTO mysql2. new_table SELECT ∗ FROM mysql1. old_table WHERE CellAppId=moveId ; DELETE FROM o l d _ t a b l e WHERE CellAppId=moveId ; COMMIT;

However, none of the distributed query engines can achieve the functionality seen in Listing 6.6 out of the box. Problems when trying to implement the functionality of Listing 6.6 in PrestoDB are listed below:

• Transactions not supported. PrestoDB has a TRANSACTION statement but it is not available for use with any of the RDBMSs that was tested in this thesis.

• Deletions not supported. The DELETE command is not supported for the RDBMSs tested. You could implement a very inefficient delete function by the CREATE TABLE AS command This workaround can be seen in Listing 6.7. This is very inefficient but will work. Note that this will also require that all statements can be done in a single transaction, otherwise all data in the table will be unavailable during the deletion.

40 6.4. Final System

Listing 6.7: Delete functionality START TRANSACTION; CREATE TABLE table_tmp AS SELECT ∗ FROM o l d _ t a b l e WHERE CellAppId=moveId ; DROP TABLE o l d _ t a b l e : CREATE TABLE o l d _ t a b l e AS SELECT ∗ FROM table_tmp ; DROP TABLE table_tmp ; COMMIT;

The other approach would be to try using ODBC/JDBC. But these technologies, as stated previously, do not support distributed transactions either.

6.4 Final System

None of the frameworks, databases, and techniques found can be used to create a system that fulfills all the requirements of this system.

A proposed architecture, and solution, can be seen in Figure 6.9. This solution does not fulfill all the requirements, but based on the research and result it fulfills most. As seen in Figure 6.9 the machine learning application part of the system uses Architecture D with a distributed query engine. This is the only way found that will enable the use of distributed queries. The CellApp part of the solution in Figure 6.9 uses Architecture B with ODBC. ODBC is used is because this gave the best performance while still enabling querying of heterogeneous DBMSs using a uniform SQL syntax. According to the results, all of the MySQL engines can be used as well as PostgreSQL and still maintain a big margin to the 10ms threshold.

This solution will not enable data migration without losing availability of the data during the migration.

This proposed system makes some assumptions that may not be valid in the real-life environment:

1. CellApps is not running while the machine learning application writes data. As discussed previously the machine learning application cannot write data without making it unavailable for a period of time.

2. There is no migration of data.

3. The CellApp and DBMS are suitably located, near each other, to satisfy the performance requirements.

4. Maximum of 18 CellApps for each DBMS. Connecting more than 18 CellApps to a single DBMS has not been performance benchmarked in this thesis.

41 6.4. Final System

Machine Learning Application

PrestoDB or Drill

Distributed Database

Local Remote PostgreSQL MySQL MySQL

ODBC

CellApp CellApp CellApp

Figure 6.9: Proposed solution

42 7 Discussion

In this chapter, the most interesting parts of the result will be discussed. Some of the content of the thesis has already been discussed in previous parts of the thesis. For discussions about architectural choices see Chapter 5. For discussions about database schema and distribution schema see Chapter 4. The used method will also be examined and after that, the final system will be discussed. The future work section lists some parts that did not directly fit into this thesis but would still be interesting to investigate further. The section about societal and ethical considerations will place the work done in this thesis in a larger societal context.

This chapter contains the following sub-headings. And will discuss some of the most interesting findings of this thesis that have not already been discussed in previous chapters.

• Literature Study

• Performance Measurements

• Method

• Final System

• Future Work

• Societal and Ethical Considerations

7.1 Literature Study

A big part of the work done in this thesis was researching and finding solutions that could fit the requirements from the customer. This was done by searching for other projects with similar requirements and investigate what techniques they used. Some promising techniques were investigated more deeply and added to the theory study and later tested and benchmarked.

Many of the techniques found, that sounded promising, were not open source. Many big companies like IBM, Microsoft, Oracle etc. all offer solutions to the kind of system

43 7.2. Performance Measurements investigated in this thesis. Some of them were added to the theory section but not all, this is because of the delimitation that only free and open source projects were tested in this thesis.

7.2 Performance Measurements

The CellApp is a real-time application, therefore, the perfect result in the performance measurements is that both the max and average query time is below the 10ms threshold. The results show that both MySQL and PostgreSQL could be used, with a big margin, in this context. SQLite is promising, but due to the spikes during write operations, it should not be used in this project. It is possible that SQLite would perform better with another ODBC driver. As mentioned previously there exist multiple paid ODBC driver alternatives on the market that were not tested in the scope of this project due to the delimitations.

Network overhead had a big impact in the tests. In this test the CellApp and DBMS were located near each other but if they were to be moved further apart in the future, extensive tests would have to be done before the move. The network overhead increased quite a lot over the theoretical RTT, therefore real tests have to be done in order to get a realistic view of how the query time is affected by the network.

It should also be possible to get better performance by running the tests on a faster machine but, as seen in Table 6.5, it is not apparent. It is also possible that the performance could be enhanced by adding some kind of cache mechanism inside the CellApp. Although, by doing this, the storage would not entirely use SQL anymore, which goes against the requirements of this thesis.

Another conclusion is that the NoSQL system Redis is faster than all the RDBMSs tested. Redis has lower max query time and average time. One thing to note, when using Redis in this way, is that it is required that you first read the data before updating it. This can be seen in Listing 6.5. First, the data needs to be read from Redis, then parsed to an object, then the object is modified and then serialized and written. This can be compared to the SQL update query in Listing 6.3, where the column can be updated directly without reading the data first. But even with this extra read, Redis is still faster than all RDBMSs with quite a big margin.

7.3 Uniform Structured Query Language (SQL) Syntax

A big part of this thesis was to find ways of communicating with DBMSs in a common way using a uniform SQL syntax. Although many of the technologies found could achieve this, none of them accomplish it completely seamlessly. Using ODBC, it was found that although the different RDBMSs mainly uses the same syntax and features there are small differences that must be considered. These small differences add up fast if you were to design a large network of heterogeneous DBMSs. If a uniform syntax and database schema are wanted for the entire network it is only possible to use those features that are available on all DBMSs used. As mentioned in Section 4.1 one such problem was discovered during this thesis. The STRING data type could not be used because of the fact that it was not available in MySQL with the Memory engine.

Another problem detected when using PrestoDB was that it does not support capital letters in table names. This is also something that must be considered when designing the database schema. Otherwise, PrestoDB may not be able to query the data in a certain table. PrestoDB also does not support the BOOL data type fully. Instead, a TINYINT data type was used when inserting data, containing BOOL data types, into MySQL.

44 7.4. Data Migration

Because of all these small differences, my conclusion is that that it is important, in a heterogeneous network of DBMSs, to carefully plan the schema and architecture before implementing the system. Because of these small differences it is not that easy to add a new DBMS to an already existing network of DBMSs.

7.4 Data Migration

Why data migration is not supported using any of the frameworks and techniques is described in Section 6.3. As mentioned in that section, distributed transactions are needed for data migration to work safely. For a data migration, you would basically want something that is like a DDBMS. Something that looks like a single relational database for the end user, and enforces the ACID properties for every transaction in the distributed network of databases.

7.5 Method

The architecture and performance of the CellApp could, without much trouble, be emulated in this test environment in a fairly realistic way. For the machine learning application, on the other hand, there is no easy way of setting up a realistic test environment. This would require having massive amounts of data scattered in a distributed network of databases. This was not possible in the scope of this thesis project due to not having that kind of resources. It might be possible to emulate on a smaller scale, but that would, at the same time, not give a realistic result. Even with the extra resources, it would have been hard work to set up an environment for the machine learning application, but it would have been possible.

Running the tests in a virtual machine proved to be both a great help and a challenge. The good thing about running the tests in a virtual environment was that it is easy to set up. The performance of the machine could be chosen and a virtual local network of test machines could be set up fast. This set up was done by installing all required software on one machine and then cloning that machine. The machines could then be reset to the initial state before every test run. However, a problem with validity when using a virtual machine became apparent when running the tests. There is no real way of knowing what goes on behind the scenes. It is harder to know if a certain result is the result of an artifact of the virtual machine or, in fact, the result of what is being measured. This was especially apparent in the performance tests of the high-performance machine in Section 6.1.2.1. In this test, the number of CPU cores was scaled up and the result was worse for almost all RDBMSs. This goes against the intuition that more CPU resources should give a better test result. This leads you to believe that it could just be an artifact of the virtual machine, but it is impossible to know for sure.

The replicability and reliability factor for the tests in a virtual environment is good since the exact same results can be obtained by just cloning the environment. At the same time, for someone else, outside Ericsson, without access to the virtual environment, it would be hard since the hardware components of the virtual machine are not known.

When looking for sources, to use as references in the theory section of this thesis, it was found that many of the newer technologies, used in this thesis, were not mentioned in that many good, peer-reviewed, scientific papers. Therefore, for these technologies, the source of information is often the official documentation for that technology. In this case claims from the official documentation is assumed to be correct, but you also must keep in mind that it is not always written in an impartial way. The official documentation is often written by someone that has an investment or other interest in the product. Therefore, it must be taken into account that these claims can be clouded by that investment. Claims like "This DBMS is

45 7.6. Final System the fastest one ever" should not be taken as valid in most cases from this kind of source. But documentation on how the technology works have, in this thesis, been used and referenced.

7.6 Final System

The proposed final system is not a solution to the problem and requirements stated in the beginning of this thesis. Instead, it is the best solution that was found. One big part is that data migration is not supported in this solution. While it is not supported there is nothing in the solution that directly prohibits the extension of support for data migration. This could be done by implementing something like two-phase commits on top of ODBC. It has also not been tested how a migration affects the performance of the system. Many assumptions, seen in Section 6.4, that are not all realistic, was also made to be able to construct this final system.

Also, although not discussed in this thesis, there is also the need for some central application to distribute the data. As seen in Figure 4.2 this distribution schema requires that all CellApps has a unique integer id as the primary key. In practice, this would require some application that keeps track of giving out CellAppIds and checks so there are no duplicate ids. This check would have to be done in the entire network of DBMSs. This is so it is not the case that the same CellAppId exists in two different DBMSs. This could lead to conflicts and collisions when migrating data. Of course, the same is also required for the neighbors table in Figure 4.3 since it also relies on the CellAppId as the primary key.

7.7 Future Work

One of the delimitations of this project was to query data in-place at the DBMSs. While looking at similar projects, where many distributed heterogeneous databases are queried, most other solve this using some kind of data warehouse solution. This means that the operation data, that is needed run the application in question, is stored in multiple heterogeneous databases. However, this data is periodically imported into a data warehouse. This means that there are two copies of the data. The data warehouse has one copy that is used to run queries on and analyze. Meanwhile, the heterogeneous databases are used for the operation of the application (in this case the CellApp). By doing this, the use cases for OLAP and OLTP are separated into two separate systems. This lowers the complexity of the combined system since each of the two parts of the system only need to be optimized for a specific use case. If this delimitation is removed in the future, it could be interesting to research if there is a better solution than the one found in this thesis, using data warehouses. Data warehouse seems to be a de facto approach to systems with similar requirements as the one in this thesis.

Another interesting part could be to compare NoSQL systems with RDBMSs. This was done briefly in this thesis but it could be extended to compare the performance of more NoSQL databases in this setting. This could be done because of the fact that relations and constraints are not really required in the database schema. Therefore, it should be easy to model the data in a NoSQL database as well.

More research in implementing distributed transactions for a heterogeneous network of databases could also be done in future work. Some research of the required parts, like two-phase commits, has been done in this thesis but not implemented and tested.

7.8 Societal and Ethical Considerations

There are societal and ethical considerations connected to the work in this thesis. The most obvious is data integrity. Data integrity is important because of the environment that this

46 7.8. Societal and Ethical Considerations system runs in. Wireless communication is integrated deeply into our society today. The data, stored in this system, is not directly secret or contain sensitive user information but it is critical to keep up operations and availability of service. If for some reason, this data would be corrupted or unavailable, it would hinder the availability of the service provided. This could affect both the single user and society in whole, as we rely on fast and efficient wireless communication today. For the single user, this abruption of the wireless communication network could result in anything between not being able to call a friend or browse the Internet, to something more critical as to calling in an accident or request medical attention. For society, this loss of communication could have an economical impact or even in some cases endanger the stability and security of our modern society.

47 8 Conclusion

This thesis will end with concluding the initial research questions.

What are the important properties for a distributed network of heterogeneous SQL DBMSs in the given setting?

There are many important properties for a distributed heterogeneous network of SQL DBMSs. Many properties have been discussed in this thesis and will be summarized here:

1. SQL syntax differs: Even if all RDBMSs tested used the SQL standard, the syntax differs between them. Things like what data types and SQL statements are available differs. This will become difficult to keep track of in a large system of many different RDBMSs.

2. No global synchronization: There is no native global synchronization in a distributed, heterogeneous network of SQL DBMSs. This means that if transactions over many RDBMSs is wanted, some kind of global synchronization must be implemented.

3. No distributed queries: Natively, there is no way of doing queries that gather data from multiple RDBMSs in a single query.

4. Different performance: As found in this thesis, different RDBMSs perform very differently.

5. Different machines behave differently: As found in this thesis, the machine that the RDBMSs run on plays a big factor in the performance of the RDBMSs.

6. Network delay is a big factor: This thesis shows that a big part of the query time for the RDBMSs are added network overhead.

7. Hard to combine OLAP and OLTP: It is hard to find techniques that support both the OLAP and OLTP use cases.

8. Different architectures: Depending on what technologies that are used, the system will have different architecture.

48 What frameworks and techniques can be used to communicate with a heterogeneous network of DBMSs using a uniform SQL syntax?

The answer to this question can be found in the Theory, Chapter 3, and the Architecture Chapter 5. There are many different tools that can be used for this. Some were tested in this thesis but there are more that were not considered. The following technologies, for communication using SQL in a heterogeneous database network, were tested in this thesis:

• PrestoDB

• Apache Drill

• ODBC

• JDBC

How do these frameworks and techniques compare to each other?

A detailed comparison in terms of performance and features is presented in the Results, Chapter 6. To summarize the result: it was found that ODBC and JDBC is better for the OLTP use case. Using technologies like PrestoDB and Drill is better for the OLAP use case.

Is it feasible to build such a system, using the frameworks and techniques, that fulfills the functional and performance requirements of this project?

Short answer: No. It was not possible to build a system using the frameworks and techniques that fulfills all the requirements in the scope of this thesis. Not all the functional requirements could be satisfied. One big part of why such a system is hard to build is, as previously mentioned, the combination of OLAP and OLTP use cases. However, the performance requirements could be fulfilled.

Will the system built, using the frameworks and techniques, enable data to migrate from one DBMS to another while keeping consistency and availability of the data?

Short answer: No. None of the frameworks and techniques found in this thesis could migrate data in a good way without losing availability of the data. This said, it could be possible to develop something on top of one of these frameworks and techniques to enable data to migrate without loss of availability.

Concluding thoughts: I would not recommend building a system of heterogeneous SQL databases in this context. It is both overly complex and not really required for this project. However, if this is still wanted I would look at data warehouse solutions that I believe is a more common approach.

49 Bibliography

[1] J. M. Smith, P. A. Bernstein, U. Dayal, N. Goodman, T. Landers, K. W. Lin, and E. Wong, “Multibase: Integrating heterogeneous distributed database systems”, in Proceedings of the May 4-7, 1981, national computer conference, ACM, 1981, pp. 487–499. [2] M. T. Özsu and P. Valduriez, Principles of distributed database systems. Springer Science & Business Media, 2011. [3] Oracle. (Apr. 2018). A relational database overview, [Online]. Available: https : / / docs.oracle.com/javase/tutorial/jdbc/overview/database.html. [4] D. C. Jamison, “Structured query language () fundamentals”, Current protocols in bioinformatics, pp. 9–2, 2003. [5] F. Zemke, “What’s new in sql: 2011”, ACM SIGMOD Record, vol. 41, no. 1, pp. 67–73, 2012. [6] Microsoft. (Apr. 2018). Transactions, [Online]. Available: https : / / docs . microsoft.com/en-us/sql/t-sql/language-elements/transactions- transact-sql?view=sql-server-2017. [7] Oracle. (Feb. 2018). Berkeley db programmer’s reference guide, [Online]. Available: https://docs.oracle.com/cd/E17276_01/html/programmer_reference/. [8] G. S. Reddy, R. Srinivasu, M. P. C. Rao, and S. R. Rikkula, “Data warehousing, data mining, olap and oltp technologies are essential elements to support decision-making process in industries”, International Journal on Computer Science and Engineering, vol. 2, no. 9, pp. 2865–2873, 2010. [9] Y. Cao, C. Chen, F. Guo, D. Jiang, Y. Lin, B. C. Ooi, H. T. Vo, S. Wu, and Q. Xu, “Es 2: A cloud data storage system for supporting both oltp and olap”, in Data Engineering (ICDE), 2011 IEEE 27th International Conference on, IEEE, 2011, pp. 291–302. [10] W. T. Hardgrave, “Distributed database technology: An assessment”, Information & Management, vol. 1, no. 3, pp. 157–167, 1978. [11] A. P. Sheth and J. A. Larson, “Federated database systems for managing distributed, heterogeneous, and autonomous databases”, ACM Computing Surveys (CSUR), vol. 22, no. 3, pp. 183–236, 1990.

50 Bibliography

[12] G. Thomas, G. R. Thompson, C.-W. Chung, E. Barkmeyer, F. Carter, M. Templeton, S. Fox, and B. Hartman, “Heterogeneous distributed database systems for production use”, ACM Computing Surveys (CSUR), vol. 22, no. 3, pp. 237–266, 1990. [13] T. M. Ghanem and W. G. Aref, “Databases deepen the web”, Computer, vol. 37, no. 1, pp. 116–117, 2004. [14] J. A. Blakeley, “Data access for the masses through ole db”, in ACM SIGMOD Record, ACM, vol. 25, 1996, pp. 161–172. [15] C. G. Rick Byham. (Mar. 2017). Linked servers (database engine), [Online]. Available: https : / / docs . microsoft . com / en - us / sql / relational - databases / linked-servers/linked-servers-database-engine. [16] MSDN. (Jul. 2016). Distributed transactions overview, [Online]. Available: https:// msdn.microsoft.com/en-us/library/windows/desktop/ms681205(v=vs. 85).aspx. [17] M. Hausenblas and J. Nadeau, “Apache drill: Interactive ad-hoc analysis at scale”, Big Data, vol. 1, no. 2, pp. 100–104, 2013. [18] Facebook. (Feb. 2018). Prestodb, [Online]. Available: https://prestodb.io/docs/. [19] V. Giannakouris, N. Papailiou, D. Tsoumakos, and N. Koziris, “Musqle: Distributed sql query execution over multiple engine environments”, in Big Data (Big Data), 2016 IEEE International Conference on, IEEE, 2016, pp. 452–461. [20] Airbnb. (Feb. 2018). Airpal, [Online]. Available: http://airbnb.io/airpal/. [21] Apache. (Feb. 2018). Apache drill, [Online]. Available: https://drill.apache. org/. [22] F. Wang, L. Yao, and G. Luo, Data management and analytics for medicine and healthcare. Springer, 2017. [23] B. Kolev, P. Valduriez, C. Bondiombouy, R. Jiménez-Peris, R. Pau, and J. Pereira, “Cloudmdsql: Querying heterogeneous cloud data stores with a common language”, Distributed and parallel databases, vol. 34, no. 4, pp. 463–503, 2016. [24] C. Bondiombouy, B. Kolev, P. Valduriez, and O. Levchenko, “Extending cloudmdsql with mfr for big data integration”, in BDA: Bases de Données Avancées, 2016. [25] M. Stonebraker and U. Cetintemel, “" one size fits all": An idea whose time has come and gone”, in Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on, IEEE, 2005, pp. 2–11. [26] J. Duggan, A. J. Elmore, M. Stonebraker, M. Balazinska, B. Howe, J. Kepner, S. Madden, D. Maier, T. Mattson, and S. Zdonik, “The bigdawg polystore system”, ACM Sigmod Record, vol. 44, no. 2, pp. 11–16, 2015. [27] R. Kimball and M. Ross, The data warehouse toolkit: The complete guide to dimensional modeling. John Wiley & Sons, 2011. [28] Oracle, Chapter 15 alternative storage engines. [Online]. Available: https : / / dev . mysql.com/doc/refman/5.7/en/storage-engines.html. [29] J. Han, E. Haihong, G. Le, and J. Du, “Survey on nosql database”, in Pervasive computing and applications (ICPCA), 2011 6th international conference on, IEEE, 2011, pp. 363–366. [30] (May 2018). Redis commands, [Online]. Available: https://redis.io/commands.

51 A Protobuf Model

Listing A.1: Protobuf model syntax = "proto2"; package cellapp ; message state { required int32 CellAppId = 1; required string Field1 = 2; required string Field2 = 3; required string Field3 = 4; required string Field4 = 5; required string Field5 = 6; required string Field6 = 7;

required int32 Field7 = 8; required int32 Field8 = 9; required int32 Field9 = 10; required int32 Field10 = 11; required int32 Field11 = 12; required int32 Field12 = 13; required int32 Field13 = 14; required int32 Field14 = 15; required int32 Field15 = 16; required int32 Field16 = 17;

required bool Field17 = 18; required bool Field18 = 19; required bool Field19 = 20; required bool Field20 = 21;

}

52 message neighbor { required int32 CellAppId = 1; required int32 NeighborId = 2; required string Field1 = 3; required string Field2 = 4; required string Field3 = 5; required string Field4 = 6; required string Field5 = 7; required string Field6 = 8;

required int32 Field7 = 9; required int32 Field8 = 10; required int32 Field9 = 11; required int32 Field10 = 12; required int32 Field11 = 13; required int32 Field12 = 14; required int32 Field13 = 15; required int32 Field14 = 16; required int32 Field15 = 17; required int32 Field16 = 18;

required bool Field17 = 19; required bool Field18 = 20; required bool Field19 = 21; required bool Field20 = 22; }

53