<<

MSc thesis Master’s Programme in Computer Science

Integration of SQL and NoSQL systems

Olli-Pekka Lindstr¨om

December 24, 2020

Faculty of Science University of Helsinki Supervisor(s)

Prof. Jiaheng Lu Examiner(s)

Contact information

P. O. Box 68 (Pietari Kalmin katu 5) 00014 University of Helsinki,Finland

Email address: [email protected].fi URL: http://www.cs.helsinki.fi/ HELSINGIN YLIOPISTO – HELSINGFORS UNIVERSITET – UNIVERSITY OF HELSINKI Tiedekunta — Fakultet — Faculty Koulutusohjelma — Utbildningsprogram — Study programme

Faculty of Science Master’s Programme in Computer Science

Tekij¨a— F¨orfattare— Author

Olli-Pekka Lindstr¨om

Ty¨onnimi — Arbetets titel — Title

Integration of SQL and NoSQL database systems

Ohjaajat — Handledare — Supervisors

Prof. Jiaheng Lu

Ty¨onlaji — Arbetets art — Level Aika — Datum — Month and year Sivum¨a¨ar¨a— Sidoantal — Number of pages

MSc thesis December 24, 2020 40 pages, 41 appendice pages

Tiivistelm¨a— Referat — Abstract

Until recently, database management systems focused on the , in which data are organized into tables with columns and rows. Relational are known for the widely standardized Structured (SQL), , and strict data schema.

However, with the introduction of , relational databases became too heavy for some use cases. In response, NoSQL databases were developed. The four best-known categories of NoSQL databases are key-value, document, family, and graph databases. NoSQL databases impose fewer data consistency control measures to make processing more efficient.

NoSQL databases haven’t replaced SQL databases in the industry. Many legacy applications still use SQL databases, and newer applications also often require the more strict and secure data processing of SQL databases. This is where the idea of SQL and NoSQL integration comes in. There are two mainstream approaches to combine the benefits of SQL and NoSQL databases: multi-model databases and . Multi-model databases are database manage- ment systems that store and process data in multiple different data models under the same engine. Polyglot persistence refers to the principle of building a system architecture that uses different kinds of database engines to store data. Systems implementing the polyglot persistence principle are called polystores.

This thesis introduces SQL and NoSQL databases and their two main integration strategies: multi-model databases and polyglot persistence. Some representative multi-model databases and polystores are introduced. In conclusion, some challenges and future research directions for multi-model databases and polyglot persistence are introduced and discussed.

ACM Computing Classification System (CCS) General and reference → Document types → Surveys and overviews Applied computing → Document management and text processing → Document management → Text editing

Avainsanat — Nyckelord — Keywords SQL, NoSQL, multi-model database, polyglot persistence, polystore

S¨ailytyspaikka — F¨orvaringsst¨alle— Where deposited Helsinki University Library

Muita tietoja — ¨ovrigauppgifter — Additional information Software Systems study track

Contents

1 Introduction1 1.1 Context ...... 1 1.2 Thesis overview...... 2

2 Single-model database systems3 2.1 Relational databases ...... 3 2.1.1 Data organization and schema...... 4 2.1.2 SQL...... 4 2.1.3 Transactions (ACID)...... 5 2.1.4 Scaling...... 5 2.1.5 Considerations ...... 6 2.2 NoSQL databases...... 6 2.2.1 Key-value stores...... 7 2.2.2 Document stores ...... 8 2.2.3 Column family stores...... 9 2.2.4 Graph stores ...... 10 2.2.5 Considerations ...... 11

3 Multi-model databases 13 3.1 Primary multi-model databases ...... 14 3.1.1 OrientDB ...... 14 3.1.2 ArangoDB...... 14 3.1.3 Cosmos DB...... 15 3.2 Secondary Multi-model databases...... 15 3.2.1 PostgreSQL...... 15 3.2.2 Microsoft SQL Server...... 16

4 Polyglot persistence 20 4.1 Representative systems...... 21 4.1.1 BigDAWG...... 21 4.1.2 PolyBase...... 25 4.2 Considerations ...... 27

5 Challenges and open problems of SQL and NoSQL integration 29 5.1 Query languages and data processing...... 29 5.2 Data modeling and schema design...... 31 5.3 Evolution ...... 32 5.4 Consistency Control ...... 33 5.5 Extensibility...... 33 5.6 Indexing...... 33 5.7 Partitioning...... 33 5.8 Standardization...... 34 5.9 Performance...... 34

6 Conclusions 37

Bibliography 39 1 Introduction

Relational database management systems have been a major part of software development for decades. They support complex data queries using the standardized query language SQL and ensure high data integrity by enforcing atomicity, consistency, isolation, and durability (ACID). However, these systems have problems with increasing volumes of data with a flexible structure. NoSQL (not only SQL) database systems discard schema and ACID compliance for better performance and . They can be used to read and write data more quickly and store larger amounts of data without having to define the data schema case by case. Based on increasingly complex business and user needs, it is not always possible to manage a solution with only either SQL or NoSQL database systems. Therefore, the need for systems that utilize both kinds of databases or data models has been increasing lately, from legacy systems that cannot keep up with increasing amounts of data to data stores that would benefit from formalizing some of their parts. Having a single interface or system architecture for structured data and unstructured NoSQL database data makes development, maintenance, and data management easier.

1.1 Context

When it comes to Big Data, one of the biggest challenges is the Variety of data. With small data sets, it is usually enough to store data in one format. However, when data sets grow larger, certain operations become unmanageable with certain data formats. Therefore it is useful to have the possibility to store data in different formats, such as in hierarchical, network, or formats, and handle them simultaneously effectively. In the last few years, organizations have started to realize the importance of unstructured non-relational data. While relational data has its place in systems where regulating data content is critical, the increasing amount of data of all sizes and shapes calls for NoSQL systems’ help. A famous slogan in the database community is ”One size does not fit all.” Single model databases are optimized for the data model that they are built for. It is possible to 2 simulate NoSQL features in relational databases and vice versa, but performance can become a bottleneck with larger data sets when the features aren’t supported natively. Hence it would be desirable in theory that data processing most of the time happened in a system that is best suited for the data model. Organizations often spread data across different data storage engines in their solutions, sometimes for the aforementioned reason and sometimes for others. It is, of course, always possible to connect systems manually. However, managing those connections can become very time-consuming when the amount of systems increases.

1.2 Thesis overview

This thesis covers the basics of SQL and NoSQL database systems, why it is sometimes nec- essary or helpful to combine features of such systems, and options for doing so. The focus is on integrating SQL and NoSQL features into one logical system or engine, with special emphasis on bringing together the varying data models used to store data in such sys- tems. The two mainstream approaches for handling multiple data models simultaneously are multi-model databases and polyglot persistence. Multi-model databases are singular engines that have built-in support for multiple models of data. Systems implementing the polyglot persistence principle are called polystores. They are system architectures that build an integrated interface over existing database systems. Chapter 2 looks at the single model SQL and NoSQL systems separately. Afterward, The two main strategies to integrate or combine the two different database systems are introduced: multi-model databases and polyglot persistence. Multi-model databases are database engines that allow storing data in multiple different formats. Polyglot persistence refers to utilizing multiple different kinds of database systems in one software architecture. Chapters 3 and 4 go in-depth with multi-model databases and polyglot persistence, re- spectively, introducing the concepts and some representative systems. Chapter 5 explains some polyglot persistence and multi-model database systems’ challenges and explores cur- rent and future research directions. Chapter 6 concludes the thesis by summarizing the main takeaways. 2 Single-model database systems

The most popular database management systems for handling structured data are based on the relational model. On the other hand, NoSQL systems are the more popular solution for dealing with unstructured data. Both are designed to handle the respective types of data as efficiently as possible. Having a single model for an entire application’s data also makes development, management, and maintenance more straightforward. This chapter covers the basics of database systems that focus on one data model. A distinction is made between SQL and NoSQL databases. SQL databases, also known as relational databases, are explained first, emphasizing a few key concepts that are later used to compare SQL and NoSQL databases. The second section introduces NoSQL databases and their four main categories.

2.1 Relational databases

Relational database management systems are the traditional database systems that man- age data using the relational model [3]. They usually use the standardized Structured Query Language (SQL) for queries. For this reason, they are often also called SQL databases. The relational model organizes data into tables, which describe types of entities, such as users or products. Relational databases often enforce strict rules for data man- agement. First, the prevents erroneous data from entering the database. This contains data type conflicts as well as additional constraints defined specifically for an application to function properly. Secondly, the use of ACID (Atomicity, Consistency, Isolation, Durability) transactions ensures data validity across data manipulations in case of computational errors and system failures. Some popular relational database management systems are PostgreSQL, Microsoft SQL Server, Oracle, and MySQL [4]. Oracle and SQL Server are commercial database systems, and PostgreSQL and MySQL are open-source solutions. 4

2.1.1 Data organization and schema

The data in relational databases are organized into tables with rows and columns. Tables describe the types of entities. Rows represent the instances of the entities, and columns specify their attributes. Columns are defined with a data type such as text, number, or date. These organizational rules, along with any other integrity or structural constraints, form the database schema. A variety of constraints are often used to define the data structure in tables. Primary keys, which identify each uniquely, can be defined as singular values or as a combination of columns. Foreign keys specify relationships with other tables. The column values can also be controlled with manual constraints such as an upper limit for a number field. The schema definition in relational databases is strict. The database rejects any data that does not fit the schema. For example, all rows of a table have to conform to the data types of each column.

2.1.2 SQL

Most relational database management systems use Structured Query Language (SQL) as the means of managing data. It is used to find, add, update, and remove data with the SELECT, INSERT, UPDATE, and DELETE commands. It also allows defining the data schema by adding or removing tables or altering table columns or constraints. SQL is also efficient in querying relationships across multiple tables with the JOIN operation. The variety of options and the expressiveness of the language facilitate complex queries. It is the industry standard for relational database languages, and most database developers are familiar with it. Figure 2.1 describes an SQL query that returns all rows in USERS -table where the column ”firstname” has the value ”Jane.”

Figure 2.1: Example SQL query SELECT * FROM USERS WHERE firstname = ’Jane’ 5

2.1.3 Transactions (ACID)

Relational databases are often used to enforce strict rules for data management and modi- fications. The concept of transactions facilitates these rules and helps to minimize anoma- lies. A transaction is a set of data manipulation operations that form a single unit of work [8]. A famous example of a transaction is the bank account transfer. When funds are trans- ferred from one bank account to another, the amount is subtracted from one account and added to the other. If the transfer is run through a transaction, it is guaranteed that the transfer will not result in an illegal state where the amount of funds is only changed for one account. Transactions in relational databases are often described with the acronym ACID. It refers to Atomicity, Consistency, Isolation, and Durability. Transactions consist of multiple operations that are meant to be executed as a single logical unit. Either all operations are executed successfully, or none of them are executed at all. This is called atomicity. If one of the operations fails, all succeeded prior operations are restored to the data state before the transaction. This is known as a rollback. Transactions always preserve the consistency of the data. Transactions can only to a consistent state or roll back into a previous consistent state. The changes of a transaction are not reflected in the database until the transaction is complete. It means that the intermediate data state cannot be seen by users or used by other transactions until then. In other words, concurrent transactions run in isolation. The database management system must guarantee that the changes of committed trans- actions are actually reflected in the database and survive any malfunctions such as system failures. Successful transactions must be written to persistent storage, which ensures durability.

2.1.4 Scaling

SQL databases are usually run on a single server. When applications grow, the usual scaling solution is to switch to better hardware. This is called vertical scaling. However, there is always an upper limit for a single machine’s performance, whereas no such limit may for the amount of data. Another option for scaling is adding read replicas. This 6 means copying your database to other servers to distribute the workload of data reads across multiple machines. However, this adds overhead in keeping the database copies synchronized.

2.1.5 Considerations

Using SQL databases is a good option if you need to keep your data consistent. This is ensured by the schema definition and the fact that the database rejects faulty data. Another good reason is if you have well-defined relationships for your data and need complex queries across them. SQL is a well-known and flexible language that allows complex queries. Relational databases also handle relationships between data naturally with primary and foreign keys and JOIN queries. SQL databases may not be optimal if you need to process large amounts of data. The transactional data processing and enforcing the schema introduce much processing over- head. The scaling of the application is also difficult and limited. The strict schema can also be a problem if your data has varying or uncertain structure. It may require much manual work to resolve the issues between incoming data and the schema.

2.2 NoSQL databases

NoSQL is an umbrella term used to describe open-source, distributed non-relational database systems where SQL-style querying is not the main focus. There is no official meaning for the NoSQL acronym. However, a popular one nowadays seems to be ”Not only SQL,” perhaps because the variety of systems considered under the term has increased since its inception. Even though there is no strict set of features for NoSQL systems, some common characteristics can be identified. NoSQL database systems were developed in response to the increasing need for data structure flexibility and processing larger amounts of data. One of the common features of NoSQL databases is using simple and flexible data models. They often rely on unstructured or semistructured data, which means that the data can be completely schemaless, or it may have constraints or fixed structure in certain parts and allow flexibility in others. NoSQL databases are sometimes referred to as BASE (Basically Available, Soft-state, Eventually consistent). Basically Available refers to the system being available even though some parts are not. Soft-state indicates that the system tolerates inconsistent states and 7 can be used at such times. After a while, the system will become consistent. NoSQL databases often distribute data processing and storage across multiple servers. This strategy is called partitioning or sharding, and it allows efficient scalability in data throughput and volume while maintaining low latency. Adding servers to a system is called horizontal scaling [6]. NoSQL databases are usually divided into four categories based on their main data model: key-value, document, column family, and graph databases [9]. Most of them utilize the key-value principle in some shape, but the main data model, and storage practices, for example, vary somewhat.

2.2.1 Key-value stores

The key-value model is probably the simplest data model used in NoSQL databases. Each record is stored as a key-value pair where the key uniquely identifies the value, and the value can be anything. The values are stored as plain bytes with no reference to the data structure or contents, making them independent from one another and the keys the only way to retrieve them. Key-value stores have no way to define a data schema, but records can be arranged into collections. Key-value stores often rely on in-memory processing [9]. Popular key-value stores include and [4]. Key-value stores often utilize hash-sharding to data. [6]. It means creating a hash function for the keys and distributing the records into multiple servers based on the hash function values. This makes horizontal scaling easy. It only requires adding values to the hashing function and redistributing the data. Some key-value stores do this automatically. Key-value stores are effective when an application focuses on a large number of small read and write operations on singular values or continuous streaming [10]. Reading and writing is fast because the values are saved in a raw format and it is the applications responsibility to interpret them. In-memory processing also helps. For example, key-value stores efficient with large multimedia objects such as images or audio [2]. Key-value stores may not be the best choice if complex queries are required because queries beyond single value lookups aren’t supported [6]. The data model can also be too simple for some applications. 8

2.2.2 Document stores

The document data model considers documents as the basic record unit. Documents can be groupings of structured or semistructured data [10]. Document stores can be considered a subcategory of key-value stores since they also store values for keys. However, they also read the data’s internal structure to allow more flexible processing and querying and further optimization. Popular document databases include MongoDB, CouchDB, Google Cloud Firestore, and Firebase Realtime Database [4]. A popular way to store documents is using the JSON∗ (JavaScript Object Notation) format. A JSON document, or object, consists of a set of key-value pairs where the keys are attribute names, and the values can be atomic values, lists, sets, tuples, or objects, for example. Each document can belong to a collection that describes the type of documents it contains. Figure 2.2 describes a MongoDB† document in JSON format. It contains a nested object in the ”address” field and an array in the ”hobbies” field. Arrays can also be composed of object values. Other well-known document formats include XML, YAML and BSON (Binary JSON) [10].

Figure 2.2: Example MongoDB document in JSON format { "_id": "5cf0029caff5056591b0ce7d", "firstname": "Jane", "lastname": "Wu", "address": { "street": "1 Circle Rd", "city": "Los Angeles", "state": "CA", "zip": "90404" }, "hobbies": ["surfing", "coding"] }

In contrast to key-value stores, the semistructured nature of document data generally allows querying with search conditions on the fields and creating indexes on them. It simultaneously allows flexibility in the field definitions of each object. The number of

∗https://www.json.org/ †https://www.mongodb.com/ 9

fields and the field names can vary from object to object, and queries are not affected by this. Queries with fields that are not a part of certain objects can leave those objects out of the results [10]. Figure 2.3 describes a MongoDB query that retrieves records from users -collection and filters them by the ”firstname” -field.

Figure 2.3: Example MongoDB query db.users.find({ "firstname" : "Jane" })

Document stores are a good solution for applications that use structured data with some variance in data types or in the number of columns [10]. They provide a good middle- ground by structuring the data where necessary and allowing flexibility elsewhere. They are also a popular solution with web-based applications where the JSON data type is a natural fit.

2.2.3 Column family stores

Column family stores, also known as column-oriented stores, extensible record stores, and wide-column stores [9,6], store data in tables similar to relational databases. However, instead of fixed column definitions, column family database tables consist of rows with an arbitrary amount of key-value pairs representing columns. This model can be seen as a two-dimensional key-value store. Popular wide-column databases include Cassandra, HBase, Google Cloud , and Accumulo [4]. Column family stores partition columns into column families. Entire data rows are not located together on disk. Instead, only values of the same column family are located together, and the rest of the row can reside elsewhere [6]. While the amount of columns is flexible, the column families often have to be strictly defined [9]. Column family stores also model empty values effectively in terms of storage space because a missing column key-value pair represents an empty value, removing the need to reserve space for it. Cassandra is a popular open-source column family datastore [1]. Like relational databases, it stores data in tables with columns and rows. However, relational modeling isn’t sup- ported in Cassandra. For example, you can’t use foreign keys or table joins in queries. Instead, the table structure design is query-driven. The tables are built in a way that every query from an application can be answered by retrieving data from one table. This leads to data duplication across multiple tables and data denormalization. Cassandra uses it’s own Cassandra Query Language (CQL) for queries. Its syntax is similar to SQL, but 10 it only contains a subset of clauses, such as SELECT, FROM, WHERE, GROUP BY, ORDER BY, and LIMIT. It also has limitations on filtering and sorting. Another popular column family database management system is Accumulo∗, also known as a text database. It is used to store data in a key-value manner where the keys have multiple attributes. Figure 2.4 describes data structure in Accumulo. Each data object has an identifier and a timestamp, as well as three separate column attributes. Column Family describes a primary grouping level for the object. It can be used to partition data. Column Qualifier is a more specific attribute. Column Visibility is a label to control accessibility to the object.

Figure 2.4: Data structure in Accumulo

Column family stores are effective with applications that often retrieve certain parts of objects at a time [6]. The partitioning based on column families makes this efficient. The model is also suitable when the number of attributes in the data varies greatly [9].

2.2.4 Graph stores

The model is based on the mathematical concept of graphs. Graphs consist of nodes that are connected to one another by edges. Graph stores store data in node and edge objects. Nodes represent the main data objects, and edges describe relationships between them. Both can have a varying number of attributes to provide additional information [10]. Neo4j† is a popular graph database. It uses the query language to query graphs. Figure 2.5 illustrates a Cypher graph query. This query finds the Company that a Person called Jennifer works for. First, it finds Jennifer’s Person node. Then it follows the WORKS FOR relationship to find the company. Finally, it returns the company node.

∗https://accumulo.apache.org/ †https://neo4j.com 11

Figure 2.5: Example Cypher query in Neo4j MATCH (:Person {name: ’Jennifer’})-[:WORKS_FOR]->(company:Company) RETURN company

Other popular database management systems utilizing graph structures include OrientDB, ArangoDB, and Microsoft Azure CosmosDB [4]. Graph stores are optimal when your application relies on relationships [9, 10]. While rela- tional databases also handle relationships efficiently and flexibly, joining data becomes a bottleneck in performance when queries need to address recursive relationships, for exam- ple, friends of friends. Graph stores can perform these deep traversals in constant time, which makes them an effective solution for social network or recommendation applications, for example. However, graph stores aren’t suited well as an all-encompassing solution for large-scale data processing applications [10]. Instead, they are often used as a separate component specifically for processing relationships.

Other NoSQL datastores

In addition to the categories Key-Value, Document, Column family, and Graph, NoSQL databases have also been further divided into other types, such as Time series, Search engine, and Multivalue [4]. They can often be seen as subtypes of the main categories, being optimized towards a certain use case, similarly to how document databases are a special case of key-value databases. For example, Multivalue database management systems store data in tables similarly to relational databases but allow assigning multiple values to a single attribute of a record. SciDB∗ is a database, also knows as an array database, that is specialized for scientific management. Its data model is built around vectors and multidimensional arrays.

2.2.5 Considerations

NoSQL databases usually rely on horizontal scaling. Their infrastructure is usually built in a way that makes adding and managing partitions simple. Good examples are hash- sharding in key-value databases and column family partitioning in column family databases. Some document and column family databases also support range-sharding where data is partitioned into value ranges [6].

∗https://www.paradigm4.com/technology/scidb-platform-overview/ 12

Using a NoSQL database is a good option if your application relies on simple reads and writes, but there are a lot of them. The data models are centered around the key-value -principle, and complex queries aren’t supported. The less strict consistency requirements also reduce performance overhead. A NoSQL database can also be a natural option if the data model fits the application well. For example, processing complex relationships in a graph datastore is easy. However, if the scale of an application is small or the scope is unknown, it may be better to use a SQL database. The flexibility of the relational model and SQL can be used to better answer changing business and application needs. The consistency guarantees are also good when the data is small. 3 Multi-model databases

Connecting existing SQL and NoSQL systems can be considered the natural direction for the integration of these systems. However, there is another, perhaps simpler option. Some systems can handle both SQL and NoSQL data models and features natively. These systems are called multi-model databases [12]. Managing both SQL and NoSQL data in a single data storage engine effectively reduces integration, migration, development, maintenance, and operational issues. The majority of the most popular database engines are, in fact, multi-model databases, even though they are often associated with one model or the other. For example, the relational databases PostgreSQL and MySQL can be used to store document data, and Oracle and Microsoft SQL Server databases support document and graph formats addi- tionally. Many NoSQL databases also support multiple data formats, such as OrientDB, ArangoDB, CouchBase, and CosmosDD [4]. Many relational database systems started out as purely relational databases and imple- mented support for NoSQL data types later. NoSQL databases are not usually considered having complete relational data support as a secondary data format, but many emulate relational database features by supporting SQL or a similar query language, as well as ACID transactions. Database engines have been grouped based on whether the system has one primary data model and other secondary models or whether multiple data models are considered primary in that system [4]. This separates systems specifically designed to offer multi-model data storage from systems that focus mainly on one data model and provide support for others as additional features. Let us call the former systems primary multi-model databases and the latter secondary multi-model databases. The next sections introduce a few notable representatives of each category. Primary multi-model databases are introduced at an overview level, and, in the case of secondary multi-model databases, the functionality of the secondary data models specifically is explained. 14 3.1 Primary multi-model databases

3.1.1 OrientDB

OrientDB∗ stores document, graph, and key-value data. Documents consist of key-value pairs and are stored in classes or clusters. Classes and clusters can be vertices of a graph. Orient DB uses SQL to query and update data. OrientDBs basic data unit is a record that can represent various data types such as byte (BLOB), document, vertex (node), or edge. Documents contain mandatory attributes database key, version number, and class. A class defines schema type for the record, which can be ”schemaless,” ”schema-full,” and ”mixed.” Records are grouped into clusters. There is one cluster per class by default. Documents can have weak or strong relationships, such as embedded documents. Nodes also have the same form as documents, but edges can be stored in a weaker form to improve performance but sacrifice additional attributes. OrientDB supports indexing and transactions. There are four types of indexes: SB-Tree (based on B-tree with some optimization added), hash, auto-sharding (for distributed systems), and Lucene (full-text and spatial). Transactions have full ACID properties and two isolation levels: read committed and repeatable read. [17]

3.1.2 ArangoDB

ArangoDB† stores document, graph, and key-value data. Documents are stored in col- lections that can be arranged by their structure into shapes. ArangoDB uses a query language called AQL. The basic data unit in ArangoDB is a document. Documents can consist of any number of attributes with simple or complex value types. Documents are organized into collections. A collection can be a node or an edge in a graph. Each document contains the database key, collection key, and version attributes. Documents of an edge collection also contain ”from” and ”to” attributes to represent the graph connection between documents. ArangoDB supports indexing and transactions. Collections and graph edges use hash

∗https://www.orientdb.org/ †https://www.arangodb.com/ 15 indexes based on the document’s key field. Other fields can also have hash, skiplist, geo, full-text, sparse indexes. [17]

3.1.3 Microsoft Azure Cosmos DB

Microsoft Azure Cosmos DB∗ is a fully managed NoSQL database service, which focuses on global data distribution, high availability, automatic scaling, and low latency. It hosts a variety of partly separate database APIs that focus on one data model. All data in CosmosDB is stored in the JSON format, but the APIs determine how the data is inter- acted with. Each CosmosDB instance runs on one of these APIs, and multi-model queries between instances aren’t supported natively, for example, [20]. However, each API takes advantage of CosmosDBs core features, such as distribution and scaling. One of the main benefits of CosmosDB APIs is that applications written for a specific can also communicate with the respective CosmosDB API. For example, an application that uses a Cassandra datastore can switch to the CosmosDB Cassandra API store by only changing connection properties. The primary database API in CosmosDB is the CORE/SQL API. It is used to query the JSON data with a subset SQL. The standard clauses SELECT, FROM, WHERE, GROUP BY, and ORDER BY are supported. Fields and nested objects are accessed with the dot notation. The Cassandra API allows querying data with the Cassandra Query Language (CQL). MongoDB API provides standard MongoDB functionality and queries. API allows graph queries with the Gremlin† language, and Table API provides OData and LINQ interfaces.

3.2 Secondary Multi-model databases

3.2.1 PostgreSQL

PostgreSQL‡ is a popular open-source relational database engine that has also imple- mented support for JSON data. JSON can be stored in a plain text format or in a decomposed binary format. The former is called ”” and the latter ”jsonb.” Jsonb re-

∗https://azure.microsoft.com/en-gb/services/cosmos-db/ †http://tinkerpop.apache.org/ ‡https://www.postgresql.org 16 moves insignificant white spaces and duplicate keys from the objects, for example. It also supports indexing, which plain json does not [18]. JSON documents are stored in tables as field values, like any other data type. They can be retrieved normally in SELECT queries. Additionally, they can be decomposed further with the -> and ->> operators. The operator -> returns JSON object field by key and ->> returns JSON object field by text. They can be chained to return values of nested objects as a list, and they can be used in filtering operations as well. For example, given info field in orders table in figure 3.1, the query in figure 3.2 returns the values Beer, Diaper, Toy car and Toy train as text [19].

Figure 3.1: Example JSON field contents in PostgreSQL

Figure 3.2: Example JSON query in PostgreSQL SELECT info -> ’items’ ->> ’product’ as product FROM orders ORDER BY product;

3.2.2 Microsoft SQL Server

Microsoft SQL Server∗ is another popular relational database engine. In addition to the relational model, it also supports storing document data in XML and JSON formats and defining graph structures in tables. SQL Server stores JSON as plain text into NVARCHAR fields. It also provides a variety of functions for handling JSON data. The function ISJSON tests whether a given text is in a valid JSON format. The JSON VALUE function extracts a singular value, and the ∗https://www.microsoft.com/en-us/sql-server 17 function JSON QUERY extracts an object or an array from JSON text. The function JSON MODIFY can be used to update the values of properties in JSON text. SQL Server also provides mechanisms to transform JSON text into relational tables and vice versa. The clause FOR JSON converts table query results into JSON. FOR JSON AUTO can be used to format the resulting JSON automatically, and FOR JSON PATH allows controlling the structure of the resulting JSON. The clause is placed at the end of SELECT queries. The function OPENJSON is used to convert JSON into a relational table format. By default, the function returns one row for one property, where the keys are placed into the ”key” column and the values into the ”value” column. An OPENJSON query also allows defining the result schema, which can be used to return tables where the columns correspond to the keys, and the values are placed into them [15].

Figure 3.3: OpenJson query without schema definition DECLARE @json NVARCHAR(2048) = N’{ "String_value": "John", "DoublePrecisionFloatingPoint_value": 45, "DoublePrecisionFloatingPoint_value": 2.3456, "BooleanTrue_value": true, "BooleanFalse_value": false, "Null_value": , "Array_value": ["a","","r","a","y"], "Object_value": {"obj":"ect"} }’; SELECT * FROM OpenJson(@json);

SQL Server can also utilize graph processing [7]. It allows creating tables that are modeled as nodes or edges. Node tables represent entities, which edge tables connect. Edge tables can be strengthened with edge constraints, guaranteeing that an edge table row connects two node table rows in the database. The constraints can also be created with DELETE CASCADE triggers that delete the connecting edges when the corresponding nodes are deleted. The graph structure can be queried with the MATCH clause. It is used for traversing the graph by giving an edge table, the two connected node tables and direction as inputs. Additionally, the SHORTEST PATH function can be used inside the MATCH clause to find the shortest path between two given nodes. 18

Figure 3.4: Converting nested JSON into a relational table with OpenJson and schema definition DECLARE @json NVARCHAR(MAX) = N’[ { "Order": { "Number":"SO43659", "Date":"2011-05-31T00:00:00" }, "AccountNumber":"AW29825", "Item": { "Price":2024.9940, "Quantity":1 } }, { "Order": { "Number":"SO43661", "Date":"2011-06-01T00:00:00" }, "AccountNumber":"AW73565", "Item": { "Price":2024.9940, "Quantity":3 } } ]’ SELECT * FROM OpenJson ( @json ) WITH ( Number VARCHAR(200) ’$.Order.Number’, Date DATETIME ’$.Order.Date’, Customer VARCHAR(200) ’$.AccountNumber’, Quantity INT ’$.Item.Quantity’, [Order] NVARCHAR(MAX) AS JSON ) 19

Figure 3.5: Example graph table structure in SQL Server

Figure 3.6: Create a graph node and edge with constraint in SQL Server CREATE TABLE Person (ID INTEGER PRIMARY KEY, Name VARCHAR(100), Age INT) AS NODE; CREATE TABLE Friends (StartDate date) AS EDGE; ALTER TABLE Friends ADD CONSTRAINT EC_FRIEND1 CONNECTION (Person TO Person);

Figure 3.7: Example graph query with the MATCH clause in SQL Server -- Find friends of John SELECT Person2.Name FROM Person Person1, Friends, Person Person2 WHERE MATCH(Person1-(Friends)->Person2) AND Person1.Name = ’John’; 4 Polyglot persistence

The different kinds of application needs have led to the development of database man- agement systems that effectively solve particular problems. Relational databases handle structural data with strict rules and transaction support, while NoSQL databases han- dle more data-intensive applications better. Complex applications often find themselves having data spread across different kinds of data storage engines for this reason. Then the problem of integrating the different storages effectively presents itself. One answer is polyglot persistence, and more specifically, polystores. Polyglot persistence means storing data using multiple languages. In addition, storing data in multiple different models is also considered one of the main features of polyglot persistence. A system implementing polyglot persistence is often called a polystore. These kinds of systems have been further classified as polyglot, multistore, and polystore systems as well, but this thesis considers all such systems polystores for simplicity. A Polystore is an integrated distributed database system architecture that consists of multiple different kinds of database engines specializing in different data models. The idea is to combine the capabilities of various engines to form a system that can answer many kinds of requirements. The main implementation strategy is to build a system on top of the database systems. Users can then use the middleware to access all of the systems and combine their data in an integrated manner [12]. Polystores consist of multiple data stores. The key feature is that the stores are distinct and accessed through their own interfaces. , for example, does not count. The data storages of a polystore are heterogeneous. This means that the different data storages each have a specific purpose concerning the data features and use cases. This separates polystores from distributed database management systems that are designed specifically to scale applications. Polystores are meant to take multiple use cases into account and combine them effectively. Finally, polystores aim to integrate different systems so that they are managed and func- tion effectively as a unit. Usually, an integration layer, or middleware, is introduced. It processes queries by distributing them to the appropriate databases and recombining the partial results. It sometimes also takes care of synchronization between the databases by 21 propagating schema and data changes to ensure application-wide consistency and integrity. A key goal of polystore systems is that data processing happens in a system that is best suited for the type of data that is being processed at each time. When designing a polystore system, it is important to select the correct types of NoSQL databases for the integrated architecture. For example, document stores are effective if you have lots of data with a similar structure. Graph stores should be used when there are lots of relations between the data items. Key-value stores are optimal if the data structure is managed elsewhere, and complex queries are not needed. Common motivations for polystore systems include:

• Analytical operations can be slow in SQL systems. Polystore architecture can select relational data and perform analytics in another type of database.

• Polystores offer data transparency. A working polystore system hides the details of which parts of the data reside in which physical locations and where which types of data are located.

• Polystores can be an effective solution for offering system connectivity and additional features to legacy systems. There is often a need to keep legacy systems intact while developing new or more effective functionality.

4.1 Representative systems

4.1.1 BigDAWG

BigDAWG∗ is an open-source polystore implementation. It provides a common inter- face for three separate database engines: the relational database PostgreSQL, the array database SciDB and the text database Accumulo [5]. The BigDAWG architecture consists of a top-level Common API, the database engines themselves, Islands, Shims, and Casts. Applications communicate with BigDAWG through the Common API. Islands are components that provide interfaces that target a specific data model on a general level. They can be queried with a specific query language or set of operations. Shims connect Islands to the database engines by transforming the queries from the Island query definition into the native queries in each database engine. Casts are ∗https://bigdawg.mit.edu/ 22 used to move data between database engines, for example, when multi-model queries are used.

Figure 4.1: BigDAWG architecture

BigDAWG currently supports three Islands. Relational Island uses SQL to query Post- greSQL databases. Array Island is used to query SciDB databases with SciDBs query language AFL. Text Island queries Accumulo using SQL or a language called D4M. BigDAWG Common API provides an interface over the Islands. It consists of four separate components: Planner, Executor, Monitor, and Migrator. The API also communicates with a separate component called Catalog, which keeps track of all databases and object schema information. The Catalog is a component that contains two Postgres databases: the main catalog database and a schema database. The catalog database contains engine, database, object, shim, and cast information, and the schema database stores object schema information. The engines table contains engine connection properties such as hostnames and ports. The database table contains the databases running in each engine and their credentials. Objects table contains objects in each database and their field names. The schema database stores object schema globally so that casting from one database type to another is made possible and does not break constraints. The Planner coordinates query execution. It parses, plans, and optimizes queries. Per- formance information from the Monitor is utilized when optimizing queries. It sends exe- cution plans to the Executor. It also has a separate Training mode that allows collecting execution plan statistics before going to production. While in Training mode, the Planner 23 analyzes all possible execution plans that produce the same result and sends them to the Monitor to gather all possible metric results. The fastest plan is then chosen amongst them. The proper production mode only analyzes the query’s features and directly asks for the Monitor’s best plan. The Executor executes queries from the Planner and Monitor. It traverses execution plans issuing sub-queries to the different islands. It calls the Migrator when transferring data between islands is needed. It also sends performance data to the Monitor in Training mode. The Migrator handles moving data between different databases. It queries the Catalog for the schema information needed when casting data from one island to another. The Monitor tracks execution times of query plans. It runs training executions to learn result metric execution time features to infer execution times for future queries.

Figure 4.2: Components of BigDAWG Common API

BigDAWG Query language uses a functional syntax. It defines five function tokens that are used to determine how the subqueries inside them are interpreted. The five function tokens are:

• bdrel – the query targets the relational island and uses PostgreSQL.

• bdarray – the query targets the array island and uses SciDB’s AFL query language.

• bdtext – the query targets the text island and uses either SQL or D4M.

• bdcatalog – the query targets the BigDAWG catalog using SQL.

• bdcast – the query is a cast operation for data migration between islands

The bdrel - function is used to query the relational island. It supports a subset of SQL used in PostgreSQL. The operations supported are filtering, aggregating, sorting, and limiting. The relational island supports integer, varchar, timestamp, double, float data types. 24

bdrel(select * from mimic2v26.d patients limit 4)

The bdarray - function queries the array Island and the array database SciDB. The queries are performed with a subset of SciDB’s Array Functional Language (AFL). It allows for projection, aggregation, cross-, filter, schema reform, and sorting. It supports string, int64, datetime, double, float data types.

bdarray(filter(myarray,dim1 >150))

The bdtext - function queries the text database Accumulo. The data are organized into tables in a key-value fashion, and the queries can be full table scans or range scans. The full query uses a JSON syntax to signify the table and range used.

bdtext( ’op’ : ’scan’, ’table’ : ’mimic logs’, ’range’ : ’start’ : [’r 0001’,”,”], ’end’ : [’r 0015’,”,”] )

The bdcatalog - function allows querying the Catalog database. You can query a certain Catalog table with a simpler table name syntax or the whole Catalog database freely with any applicable SQL query.

bdcatalog( catalog table name [ column name ] [, ...] ) or bdcatalog( full query applied to the catalog database )

The bdcast - function is used to migrate data from one Island to another. It takes a single island query and the destination island and schema as arguments.

bdrel(select * from bdcast( bdarray(filter(myarray,dim1 150)), tab6, ’(i bigint, dim1 real, dim2 real)’, relational))

This query first moves array data from SciDB to PostgreSQL and then retrieves it from there. The bdcast function tells the middleware to put the resultant array data into a table called ”tab6” with the schema ”i bigint, dim1 real, dim2 real”. 25

4.1.2 PolyBase

PolyBase is a system built to extend Microsoft SQL Server applications to allow processing SQL queries that read data from, and process data in, external data sources [14]. Initially, the supported external data sources were Hadoop∗ and Azure Blob Storage. Version 2019 extended the support to Oracle, MongoDB, and Teradata databases, as well as potentially any implementing the ODBC standard. PolyBase also allows query computation pushdown to the external systems and creating SQL Server clusters for parallel query processing. PolyBase implements the polyglot persistence principle by allowing queries to target both an SQL server instance and an external data source at the same time. Targeting external data sources works with external tables or external data sources, without the need to configure the target system. PolyBase supports a variety of data formats. In addition to the relational database tables, PolyBase supports semi-structured document data, as well as unstructured non-relational tables such as delimited text data and Hadoop HDFS files. PolyBase solves the problem of joining data from different sources effectively. Before, you had to move your data into one location or integrate and join your data in a client appli- cation. PolyBase has implemented full query support to the SQL Server standard query language Transact-SQL. With it, you can query external data as if querying relational database tables in SQL Server. PolyBase implements computation pushdown in Hadoop. It means that the query op- timizer can make the decision to either perform query computations in Hadoop or in the main SQL Server system. The decision is made by estimating the performance cost. Pushing down computation can utilize Hadoop’s optimized distributed computing with MapReduce jobs, for example. This can also be forced on or off in queries. PolyBase also allows parallel computing with multiple SQL Server instances. The next section covers one of the main ways PolyBase integrates SQL and NoSQL sys- tems. In version 2019, it started to provide unified and simultaneous access to SQL Server and MongoDB, i.e., SQL and NoSQL databases.

∗https://hadoop.apache.org/ 26

MongoDB support

PolyBase implemented support for querying data from external MongoDB databases in version 2019. PolyBase uses external tables to query data from external data sources. First, you have to configure access to the external data source. It is done with a CREATE EXTERNAL DATA SOURCE -command with location and credential information. The location consists of the external database type, hostname of the external server, and port that the external database instance is listening to. The database type determines the type of external database, for which some other options are Oracle, Teradata, ODBC, and HDFS. Computation pushdown can also be enabled or disabled here. It is enabled by default. Figure 4.3 describes configuring MongoDB access in PolyBase.

Figure 4.3: Configure MongoDB access

After configuring the external data source, access to MongoDB collections is configured by creating external tables. In an external table, you first specify the data source created ear- lier. Then you specify the location, which is the collection name in the MongoDB database. The location can be just the collection name, or it can also include the database name and schema name. The full format is ”database name.schema name.collection name”. Finally, you specify all fields and their data types. It is also recommended to create statistics on external tables to enhance the performance of queries, especially ones containing joining, filtering, and aggregating. Figure 4.4 describes configuring a MongoDB external table in PolyBase. MongoDB’s JSON data structure allows nesting objects and arrays inside objects. Poly- Base handles nesting by flattening nested objects and arrays into additional columns. These columns have to be specified in the external table explicitly. The flattening rules are:

• Objects: For each key in a nested object, a new column is created. The column name is objectname key. 27

Figure 4.4: Configure MongoDB external table

• Arrays: First, a column is created to represent the array index. Its name is array- name index. Then another column is created for the values. If the array values are objects, new columns are created for all keys as before. For each value in an array, all result rows are duplicated.

PolyBase requires strict schema for external tables, whereas MongoDB does not for its objects. When an object’s structure doesn’t match the external table structure, it is rejected in queries. PolyBase allows specifying the number or percentage of rejected rows before queries fail. If an object has many nested arrays, the number of resulting rows in queries can get very high. PolyBase also allows connecting to CosmosDB instances. CosmosDB provides a MongoDB interface, which can be used to configure access following the same steps as before.

4.2 Considerations

Polyglot persistence introduces additional complexity to applications in many ways. Inte- gration and management of multiple different kinds of database engines result in additional operational costs. There is no unified query interface or language, which forces system components and developers to accommodate multiple ones. Data consistency must be ensured at the middleware or application-level manually, while it is mostly automatic with single-model databases, especially with relational ones. The middleware also has to consider that the separate database systems may be developed independently and adapt to changes in parts of the application. Access control between the different components 28 and databases, or security in general, gets more complicated. Availability can also be an issue. Answering global queries can be problematic if there is downtime in one or more subsystems. 5 Challenges and open problems of SQL and NoSQL integration

Developing database systems that support relational and NoSQL data is difficult. Even though many promising approaches to SQL and NoSQL integration have been developed, all of them are far from optimal. There are many aspects that single model database sys- tems have optimized much further. It is worthwhile to study whether these optimizations would be applicable or feasible in multi-model databases and polystores as well. The need to consider multiple data models at the same time provides an additional challenge in . In this chapter, various challenges and future research directions for multi-model databases and polystores are introduced. The concepts of intra-model and inter-model are distin- guished. Intra-model refers to concepts and operations inside a single data model, and inter-model means pertaining to multiple models. One interesting theoretical framework proposal is UDBMS (Unified DataBase Manage- ment System) [13]. Many multi-model databases and polystores focus on a limited set of data models, features, and use cases. UDBMS instead attempts to provide an all- encompassing solution for multi-model data management with several model-agnostic components and features, including data storage, query processing, indexing, in-memory processing, and transaction handling. Figure 5.1 depicts the UDBMS architecture.

5.1 Query languages and data processing

Even though many relational databases have started to support various NoSQL data mod- els, they do not usually offer unified query languages to process data in all of the models. In contrast, various multi-model databases and polystores have been built to provide uni- fied interfaces and engines over specific sets of data models. There are various languages capable of handling multiple models of data simultaneously, such as ArangoDB’s AQL for document and graph data and the BigDAWG query language for relational, array, and text data. We also previously discussed that the relational databases PostgreSQL and Microsoft SQL Server can query documents stored into relational tables using SQL. How- 30

Figure 5.1: UDBMS architecture

ever, most multi-model query languages still fail to reach the same level of expressiveness and supported features as many single-model query languages. Moreover, the scope of the supported set of data models is also limited in these languages. It is an open challenge to develop a query language that supports all of the most common data models as completely as possible. Another challenge is developing efficient query processing planning and optimization for multi-model queries. Relational databases have developed query planners that use various algorithms and statistical techniques to determine the best execution strategies for queries. However, they are optimized with the relational data schema in mind. Adding NoSQL data models into the mix presents new challenges with semi-structural or schemaless data, 31 especially when queried simultaneously with structured relational data. Another challenge in polystore architectures is to make the system hide the details about which database types are targeted at each query while at the same time supporting all features, or as many as possible, each different type of database offers. It would be beneficial to have a query language that doesn’t require specifying the different data models. For example, in the BigDAWG query language, you have to use the data model specifying function tokens at each turn.

5.2 Data modeling and schema design

Modeling multi-model data is difficult, especially when both relational and NoSQL data are present. For example, SQL databases often rely on data normalization where data are divided into tables so that redundancy is minimal. On the other hand, NoSQL databases often replicate and nest data for optimized querying. An effective general data model for multi-model data has to balance these factors and consider the possibility of inter-model data redundancy. Another thing to consider is inter-model links between data items. Good schema design is important in relational databases. For example, it allows more consistent query performance and application extensibility. On the other hand, NoSQL databases are usually schemaless, and they often denormalize data for performance. How- ever, most of the time, the applications using the databases expect a certain structure for the data even though there is not a database level schema defined. Therefore, it would be beneficial to develop a schema model that balances these contradictory schema requirements effectively. Another schema-related problem is schema inference. There are various approaches for inferring schema when dealing with a singular data model. However, in a multi-model database or a polystore, it would be useful also to be able to infer relationships and references between models [12]. UDBMS proposes a model-agnostic data storage model that stores objects as byte arrays that are not interpreted by the engine. They can represent data in any format, be it a relational data row with schema information or a document. It provides a generic interface that allows adding, updating, retrieving, and deleting objects. UDBMS also addresses schema issues with the Flexible schema manager -component. It tracks data reads and writes and generates schema information based on them and automatically updates it 32 when necessary [13].

5.3 Evolution

In the lifetime of a software system, the user requirements are bound to change at some point. At this point, many times, the database data structure and constraints need to be changed. Multi-model databases and polystores need to find ways to propagate the needed changes to the application upon completing evolution operations. Especially interesting in this case are inter-model evolution operations that can be, for example, changing properties linking objects of different models or modifying data that has been stored redundantly in different models at the same time. Other possible changes to be propagated across multiple models can include changes to queries, indexes or even storage strategies. Inter-model evolution operations introduce at least these challenges:

• Schema changes in one model sometimes need to be addressed or replicated in an- other model.

• The operations have to retain data consistency, which requires going through all inter-model links and redundant data.

• The operations have to be translated into the commands of each respective modeling language.

A possible solution is managing a global schema model. However, it would need to present all of the underlying local schema models as a coherent whole. In addition, all executed local evolution operations would have to be propagated back to the global model. Another challenge is the migration of changes to the data. When dealing with large enough data sets, it may be inefficient to propagate changes on every update immediately. Instead, the data can be migrated only when it is needed in a query or update. For multi-model databases and polystores to function with this lazy migration strategy, there is a need for schema version management between models. This could be done by adding version numbers or timestamps to the records and rewriting queries so that data in each specific version can be queried [13]. 33 5.4 Consistency Control

Some multi-model databases, such as ArangoDB and OrientDB, for example, support transactions. However, they still lack comprehensive transaction models for multi-model data [12]. A potential research direction could be to investigate how the SQL data consis- tency principle ACID and the corresponding NoSQL principle BASE could work together in a multi-model database or polystore. UDBMS proposes a model-agnostic transaction scheme that allows assigning ACID or BASE properties to object collections and queries.

5.5 Extensibility

The limited nature of multi-model databases begs the question of extensibility. It seems that most multi-model databases and polystores only support a subset of the normal model features, as well as only a narrow set of inter-model techniques. Most systems have also been built specifically on top of certain data models or underlying existing representative systems. It would be useful to investigate the possibilities of extending the features of intra-model and inter-model operations, as well as adding completely new data models, including interoperability with the currently available models, to existing multi-model databases or polystores [11].

5.6 Indexing

Relational database management systems have also developed complex indexing structures for relational data, such as B-tree and B+-tree indexes. Similarly, NoSQL systems have developed indexes optimized for their own data structures. Multi-model databases and polystores can, of course, utilize these indexes when applicable, but a universal index structure between multiple data models could be more efficient [13].

5.7 Partitioning

Multi-model partitioning or sharding is an interesting problem. We discussed before that NoSQL databases often utilize horizontal scaling where the data is distributed among multiple servers, which is rather simple with a single data model. However, partitioning 34 in multi-model environments could be much more complicated. Instead of partitioning data within a single data model according to their own established principles, it may be worthwhile to investigate other multi-model partitioning strategies. For example, it could be beneficial to favor partitioning certain data models together.

5.8 Standardization

One of the greatest strengths of relational databases has been the standardization of the data structures and the SQL query language. On the other hand, multi-model databases and polystores are still in a rather early and experimental stage where each representative system and solution proposal differs from one another greatly. It would be beneficial to investigate possibilities to standardize some of the multi-model features, constructs, or components.

5.9 Performance

It would make sense to think that implementing multi-model and polyglot persistence principles can hinder performance or require more resources. When you process and store data in multiple different formats, at some point in the process, a decision needs to be made about the processing method with varying data, whether in a single multi-model database or a polyglot persistence architecture splits the data into multiple different systems. While performance hasn’t seemed to be a primary focus in the research, some studies have been conducted. For example, one study compared the performance of ArangoDB, OrientDB, another multi-model database called Couchbase∗, and MongoDB [17]. The study compared the efficiency of data creation, retrieving, update, and delete operations, as well as aggregation. Figure 5.2 shows the results, which are measured by the number of operations performed per second. Based on this test, MongoDB, being the only single-model database focusing on docu- ments, is consistently more efficient than the others. The only exception is Couchbase data deletion, which is, according to the study, probably explained by native optimization on Couchbase’s physical data structures called buckets. Otherwise, this study gave results that you would expect. It could be that the multi-model-ness introduces a small amount

∗https://www.couchbase.com/ 35

Figure 5.2: Multi-model database performance comparison - number of operations per second

of overhead all-around. Another, more recent performance test was conducted to compare the performance of ArangoDB to OrientDB and multiple single-model databases: MongoDB for documents, Neo4j for graphs, and PostgreSQL for relational data [16]. Figure 5.3 shows the results of this test.

Figure 5.3: NoSQL Performance 2018

This performance test has some interesting results. This time, MongoDB and the other 36 single-model databases performed significantly worse than ArangoDB. However, ArangoDB is shown to utilize much more memory than MongoDB and PostgreSQL, for example. Perhaps ArangoDB has optimized in-memory processing to gain performance over the single-model counterparts. ArangoDB also performed better than the single-model graph database Neo4j in testing the graph operations shortest path and neighbors. This can’t be explained with memory usage because Neo4j also requires more memory than ArangoDB. This performance test indicates that multi-model databases can compete with single-model databases with proper optimization measures. 6 Conclusions

The lack of flexibility in the data schema definition and the need to process increasing amounts of unstructured data led to the development of NoSQL databases. However, NoSQL databases are not meant to replace SQL databases because of transaction consis- tency, strict schema, and strong have their place in the industry. The greatest benefit can be achieved by supplementing one with the other in complex systems. Many kinds of multi-model databases have been developed to address this issue. Some were built with this issue in mind from the beginning, while others have been extended to support additional data models since their initial release. This gives developers of new applications, many options when choosing the best fitting database engine. Developers of existing applications may find that the database engine they are using already implements new ways for data management or may do so in the future. On the other hand, polyglot persistence is still at a relatively early stage in its application. Both BigDAWG and Polybase, being perhaps among the most well-known representative implementations, can be seen as simple and limited. BigDAWG only supports three spe- cific database engines, and PolyBase requires a strict schema for all or most of its data. However, they serve as a good starting point for further research and development of polyglot persistence. The research on general-purpose multi-model data management is still at an early stage as well. The UDBMS framework attempts to provide ideas and direction, but there are still close to no practical implementations. However, the fact that many database systems have started to embrace multi-model data management and the amount of research shows that this is an important research topic that has promise.

Bibliography

[1] Documentation. url: https://cassandra.apache.org/doc/ latest/ (visited on 11/03/2020). [2] S. Bjeladinovic. “A fresh approach for hybrid SQL/NoSQL database design based on data structuredness”. In: Enterprise Information Systems 12.8-9 (2018), pp. 1202– 1220. issn: 17517583. doi: 10.1080/17517575.2018.1446102. url: https://doi. org/10.1080/17517575.2018.1446102. [3] E. F. Codd. The Relational Model for Database Management : Version 2. 1990. isbn: 0201141922. [4] DB-Engines Ranking. url: https://db- engines.com/en/ranking (visited on 11/03/2020). [5] V. Gadepally, P. Chen, J. Duggan, A. Elmore, B. Haynes, J. Kepner, S. Madden, T. Mattson, and M. Stonebraker. “The BigDAWG Polystore System and Architecture”. In: IEEE High Performance Extreme Computing (2016). [6] F. Gessert, W. Wingerath, S. Friedrich, and N. Ritter. “NoSQL database systems: a survey and decision guidance”. In: Computer Science - Research and Development 32.3-4 (2017), pp. 353–365. issn: 18652042. doi: 10.1007/s00450-016-0334-3. [7] Graph processing with SQL Server and Azure SQL Database. url: https://docs. microsoft . com / en - us / sql / relational - databases / graphs / sql - graph - overview (visited on 11/03/2020). [8] T. Haerder and A. Reuter. “Principles of transaction-oriented database recovery”. In: ACM Computing Surveys (CSUR) 15.4 (1983), pp. 287–317. issn: 15577341. doi: 10.1145/289.291. [9] R. Hecht and S. Jablonski. “NoSQL evaluation: A use case oriented survey”. In: Proceedings - 2011 International Conference on Cloud and Service Computing, CSC 2011 (2011), pp. 336–341. doi: 10.1109/CSC.2011.6138544. [10] B. Lakhe. In: Practical Hadoop Migration. 2016. Chap. Re-Architecting for NoSQL: Design Principles, Models and Best Practices, pp. 117–148. isbn: 9781484212875. doi: 10.1007/978-1-4842-1287-5. 40

[11] J. Lu, I. Holubov, and B. Cautis. “Multi-model databases and tightly integrated polystores: Current practices, comparisons, and open challenges”. In: International Conference on Information and Knowledge Management, Proceedings (2018), pp. 2301– 2302. doi: 10.1145/3269206.3274269. [12] J. Lu and I. Holubov´a.“Multi-model Databases: A new journey to handle the variety of data”. In: ACM Computing Surveys 52.3 (2019). issn: 15577341. doi: 10.1145/ 3323214. [13] J. Lu, Z. H. Liu, P. Xu, and C. Zhang. “UDBMS: Road to Unification for Multi-model Data Management”. In: Woo C., Lu J., Li Z., Ling T., Li G., Lee M. (eds) Advances in Conceptual Modeling. ER 2018. Lecture Notes in Computer Science. Vol. 11158. Springer, Cham, 2018. url: https://doi.org/10.1007/978-3-030-01391-2_33. [14] Microsoft Polybase Documentation. url: https : / / docs . microsoft . com / en - us/sql/relational-databases/polybase (visited on 11/03/2020). [15] Microsoft Transact-SQL documentation. url: https://docs.microsoft.com/en- us/sql/t-sql/ (visited on 11/03/2020). [16] NoSQL Performance Benchmark 2018 – MongoDB, PostgreSQL, OrientDB, Neo4j and ArangoDB. url: https://www.arangodb.com/2018/02/nosql-performance- benchmark-2018----neo4j-arangodb/ (visited on 12/03/2020). [17] E. P luciennik and K. Zgorza lek. “The Multi-model Databases – A Review”. In: Kozielski S., Mrozek D., Kasprowski P., Ma lysiak-MrozekB., Kostrzewa D. (eds) Be- yond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation. BDAS 2017. Communications in Com- puter and . Vol. 716. Springer, Cham, 2017, pp. 141–152. doi: 10.1007/978-3-319-58274-0. url: http://link.springer.com/10.1007/978- 3-319-58274-0. [18] PostgreSQL documentation. url: https://www.postgresql.org/docs/ (visited on 11/03/2020). [19] PostgreSQL JSON tutorial. url: https://www.postgresqltutorial.com/postgresql- json/ (visited on 11/03/2020). [20] C. Zhang and J. Lu. Holistic evaluation in multi-model databases benchmarking. Springer US, 2019. isbn: 1061901907279. doi: 10.1007/s10619- 019- 07279- 6. url: https://doi.org/10.1007/s10619-019-07279-6.