Masaryk University Faculty of Informatics

Comparison of a multi-model with its single-model variants

Bachelor’s Thesis

Michal Merjavý

Brno, Fall 2019

Masaryk University Faculty of Informatics

Comparison of a multi-model database with its single-model variants

Bachelor’s Thesis

Michal Merjavý

Brno, Fall 2019

This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document.

Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source.

Michal Merjavý

Advisor: Mgr. Martin Macák

i

Acknowledgements

I would like to thank my advisor, Mgr. Martin Macák for giving me advice which greatly improved the quality of this work. Access to and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme "Projects of Large Research, Devel- opment, and Innovations Infrastructures" (CESNET LM2015042), is greatly appreciated.

iii Abstract

Goal of this thesis was to complete an overview of publicly available datasets and to compare multi-model to its single-model counterparts specifically OrientDB to ,MongoDB. We begin with overview of NoSQL databases, their categories and we include implementations of document,graph and multi-model databases. We briefly describe publicly available datasets we found. Following that are document and graph queries that were tested and results of said queries.

iv Keywords

NoSQL, multi-model database, , document database, time series database, OrientDB, Neo4J, MongoDB, benchmarking, datasets, public datasets

v

Contents

Introduction 1

1 Database overview 3 1.1 NoSQL databases ...... 3 1.1.1 Key-Value stores ...... 3 1.1.2 Document databases ...... 4 1.1.3 Wide- stores ...... 6 1.1.4 Graph databases ...... 6 1.1.5 Multi-model databases ...... 8

2 Public available datasets 13 2.1 Transportation ...... 13 2.2 Environment ...... 13 2.3 Economics ...... 14 2.4 Health ...... 14 2.5 Education ...... 14 2.6 Other ...... 14

3 Designed queries 17 3.1 Setup ...... 17 3.1.1 Chosen datasets ...... 17 3.2 Graph queries ...... 18 3.3 Document queries ...... 19

4 Results of queries 21 4.1 Graph queries measurements ...... 21 4.2 Document queries measurements ...... 22 4.3 Summary of results ...... 23

5 Conclusion 25 5.1 Future Work ...... 25

Bibliography 27

vii

List of Tables

1.1 Multi-model databases 6 1.2 Multi-model databases 8 1.3 Multi-model databases 11 2.1 Found datasets 15 4.1 Graphs queries 21 4.2 Document queries 22 4.3 Average times of all queries 24

ix

Introduction

Nowadays, almost every company needs to store and access data to work correctly; that is why most of the time, they decide to use databases. Working with data more efficiently saves time, so itis crucial to select a useful tool. One of the most challenging problems is how to deal with the variety of data they need to store. If there are two or three different types of data that the company needs to store and analyze. It would require multiple single-model databases. That includes learning to set them up properly and study query languages they use. However, if they were using multi-model databases, it is possible to only work with one database for all their purposes. Multi-model databases are getting more popular every year be- cause of their ability to support multiple data models and using one backend for all of them. The focus of the thesis is to find out whether a multi-model database can compare to its single-model counterparts and not lose too much performance due to its practicality. This thesis compares graph and document data. This thesis begins with a description of NoSQL databases in gen- eral, implementations of different multi-model, graph and document databases. It then goes on to an overview of freely available datasets that we found. Datasets are grouped by their respective fields of study. Next chapter is focused on queries and datasets we used for bench- marking. The fourth chapter includes measurements for queries that were used and interpretation of the results.

1

1 Database overview

In this chapter, we provide information on NoSQL databases, Multi- model databases and introduction for some implementations of graph, document and multi-model databases.

1.1 NoSQL databases

NoSQL stands for Not Only SQL, refers to a group of nonrelational data management systems; where databases do not primarily use ta- bles and most of the time do not use SQL for data manipulation [1]. NoSQL databases also provide better and processing of large scale datasets. In classical database systems, transactions guaran- tee the integrity of data. However, scaling transactional-based systems proved to be a difficult problem. That is why NoSQL databases lowered the requirements on consistency and achieved better availability and partitioning. This resulted in NoSQL databases using mostly systems know as BASE (Basically Available, Soft-state, Eventually consistent) instead of transactions [2]. BASE systems do not require consistency after every transaction, but eventually being in a consistent state [3]. NoSQL databases can be classified into four basic categories: Key-Value stores, Document databases, Wide-Column stores, Graph databases.

1.1.1 Key-Value stores These data management systems use hash tables where there are unique keys and pointers to items of data creating key-value pairs [2]. Hash tables are suitable for lookup values in extremely large datasets. Key-value databases do not provide traditional database capabili- ties. To ensure atomicity of transactions and consistency of parallel transactions, users have to rely on the application itself. In Key-value databases, it is only possible to query the key part of the pair. This makes extracting records that contain a particular set of values impos- sible. Key-Value stores are suitable for applications where the speed of retrieving the data is essential because of their simplicity. Neverthe- less, if there is a necessity to query data by value, Key-value databases are not suitable.

3 1. Database overview

1.1.2 Document databases

Document databases were created for improving the storage and man- agement of documents. Documents in these databases are more flex- ible than records in relational databases because they are schema- less [2]. Documents are mostly stored in one of these formats XML, JSON or BSON. Documents can contain, in most cases, multiple key- value pairs, key-array pairs, or nested documents. Document databases hold semi-structured data in attribute/value pairs. The difference between document databases and key-value stores is in the ability to search not only by key but also by values. Document databases are used for storing semi-structured data where it is not sure if the value will be entered or not, which could cause problems in relational databases [4]. It is not optimal to use document databases when the data that is used has a lot of relations. Below is a description of CouchDB and MongoDB to describe implementations of document databases. MongoDB is used for com- parison later on. Following that is a summary of differences between the databases as mentioned earlier in 1.1.

CouchDB

CouchDB1 is document-oriented NoSQL database. For storing data, it uses JSON and JavaScript is used to work with the stored data as its . Database in CouchDB is a collection of independent documents that sore their own data and self-contained schema. Documents also contain metadata that make it possible to merge differences that may have happened while the databases were disconnected. CouchDB does not lock the database while writing the data and any conflicts that may have occurred are left to the application to resolve. It is written in Erlang, JavaScript, , C++ and was released in 2005. Beneath is an example of JSON that can find people with the name Michal.

1. https://couchdb.apache.org/

4 1. Database overview

{ "selector":{ "firstname":"Michal" } }

MongoDB MongoDB 2 is a database based on a document model. Its initial release was in 2009 and written in C++, Go, JavaScript and Python. MongoDB provides two options for storage engines WiredTiger storage engine and In-Memory storage engine [5]. Starting in Mon- goDB 3.2, the default storage engine is WiredTiger. It is recommended for new deployments and is suitable for most workloads. WiredTiger uses optimistic for most reads and writes. In op- timistic concurrency control, only intent locks are used for global, database and collections levels. If a conflict is detected by the storage engine between two operations on the same document, one of them will incur a write conflict and is retried. WiredTiger also uses Multi- Version Concurrency Control for short MVCC. This allows MongoDB to recover from the last valid checkpoint when it encounters an er- ror while writing new checkpoint. MongoDB creates checkpoints at intervals of sixty seconds. In MongoDB, data are stored in flexible JSON-like documents; this means that fields can vary from document to document, anditis easy to change data structures over time. MongoDB provides high availability, horizontal scaling, and geographic distribution. MongoDB also makes analysing data easier with Ad hoc queries, indexing, and real-time aggregation [6]. MongoDB uses JavaScript as its main query language. Below is a query that returns all documents where property "first- name" is equal to Michal. db.catalog.find({firstname:{ $eq:"Michal"})

2. https://www.mongodb.com/

5 1. Database overview

CouchDB MongoDB Year of release 2005 2009 Query language JavaScript JavaScript Implemented in Erlang, JavaScript, C++, Go, C JavaScript, Python

Table 1.1: Multi-model databases

1.1.3 Wide-Column stores

Wide-column databases are an extension of the key-value architec- ture with added columns. Wide-column databases relay on declara- tive characteristics as in relational databases and various key-value stores schema [2]. Wide-columns databases are primarily used for distributed data storage, especially for versioned data because of time- stamping functions, exploratory and predictive analytics and large- scale data processing such as sorting, parsing, conversion, algorithmic crunching and more. Most Wide-Column stores are patterned after ’s , which is used for Google’s search index, Google Earth and Google Finance [3].

1.1.4 Graph databases

Graph databases are suitable for storing information about objects and all relationships that exist among them. Graph databases use a schema- free model that allows them to model and represent connected data easily [7]. This model includes vertices and edges. Vertices represent and store data about respective objects and edges can also store data that are related to the relationship. Graph databases use a technique called index-free adjacency [2]. This technique enables nodes to store information of a direct pointer which points to the adjacent nodes. Graph databases are mostly used in cases where the main focus of the study is not the data but the relationships between them. They allow better analysis of the strength and the nature of relationships between nodes. Graph databases are used for a variety of applications such

6 1. Database overview

as social networking applications, bioinformatics, network and cloud management and more. [4]. Below is a description of Neo4J and Dgraph. We chose these two graph databases because they are among the most popular ones. Neo4J is later on used for comparison in this thesis. Summary of their char- acteristics is in Table 1.2.

Neo4J Neo4J3 is an open-source native graph database that is also ACID compliant, highly available and scalable. Native graph database means its storage was designed specifically for the management and storage of graph [8]. In this case, each file contains data specific only for one part of the graph, such as nodes, relationships, labels and properties. Division of the different parts of graphs improves the performance of graph traversals. Data are stored in a graph format with nodes and relationships. Nodes can be connected by defined relationships. Both use properties to store data that are related to them. It was first released in February 2010 and is written entirely in Java. Data in Neo4J can be accessed through two different query languages. query language which is similar to SQL and . In Neo4J, it is also possible to work with data with REST interface and Java API. Neo4J also includes Neo4J Bloom that can be used for visual exploration [9]. Example query in Cypher query language that returns nodes that are connected to account with id five through relationship named friend and first name is Michal. MATCH(a:Account{id:’5’})-[:friend]->(f) WHEREf.firstname="Michal" returnf;

Dgraph Dgraph4 is an open-source native graph database that was first re- leased in 2015. Dgraph provides ACID transactions, consistent replica- tion and linearizable reads [10]. It is written in Go. For persistent stor-

3. https://neo4j.com/ 4. https://dgraph.io/

7 1. Database overview age, Dgraph uses Badger since version 0.8. Badger is also written in Go and is meant to be an alternative to non-Go-based key-value stores like RocksDB which Dgraph used previously [11]. Dgraph uses GraphQL+- which is based on ’s GraphQL. GraphQL+- was modified to provide more features for graph databases because GraphQL was not developed for them. However, GraphQL has graph-like query syntax, schema validation and subgraph shaped response, which made it an excellent base. Dgraph responds to queries in JSON and Protocol Buffers over GRPC and HTTP. { findFriendsWithNameMichal(func:(uid(5)) { friend{ account @filter(eq(name@en,"Michal")){ name@en date_of_birth } } } }

Neo4J Dgraph Year of release 2010 2015 Query language Cypher,Gremlin GraphQL+- Implemented in Java C++, JavaScript

Table 1.2: Multi-model databases

1.1.5 Multi-model databases In traditional database management systems, a single data model de- fines how to manipulate the data. However, a multi-model database is able to support multiple data models, which enables it to organise the data in multiple ways, and the user can choose a model that is the best for the data being used. Multi-model databases offer data modelling advantages of without its disadvantages [12]. Currently, there are two possible solutions to manage multi-model

8 1. Database overview

data. First one is a single integrated multi-model database system like OrientDB and ArangoDB, and the second one is using middleware over multiple single-model data stores, for example, Cosmos DB. In multi-model databases, all data models are supported by a single integrated backend, and this helps reduce development, maintenance, and operational issues. [13]. Below is a description of ArangoDB and MongoDB. Both of them belong to the most popular multi-model databases. We are using Ori- entDB as our multi-model representative. Further on is a recapitulation of their main characteristics 1.3.

ArangoDB

ArangoDB5 is open-source multi-model NoSQL database that sup- ports documents, graphs and key-value data models. It uses SQL-like query called AQL language and also supports JavaScript and Ruby extensions. AQL is considered a declarative language that focuses on the results, not on how they should be produced. Storage engines that are available in ArangoDB memory-mapped files and RocksDB from Facebook [14]. It is not possible to mix engines on one server for different databases. The default storage engine from version 3.4 onwards is RocksDB. The best use of memory-mapped files engine is in a case where all the data fits into the main memory. This makes concurrent reads very fast. Indexes are being rebuilt on startup, resulting in longer startup time and better performance. RocksDB is based on a log-structured merge and is optimized for large datasets. The main advantages of RocksDB are document-level locks which allow concurrent writes, reads and writes do not block each other and startup is faster compared to memory-mapped files engine because indexes do not have to rebuild on startup. For storage and retrieval of data, it uses collections. Each collection consists of documents and is uniquely identified by collection iden- tifier and by the name of the collection [2]. Collections haveatype that is specified when they are created it is either document oredge. The default type is document. It also allows REST as an option for

5. https://www.arangodb.com/

9 1. Database overview querying documents. ArangoDB was initially released in 2011, and it is written in C++ and JavaScript. Below are examples of queries in AQL language. The first query returns people with a first name Michal. The second query returns all friends named Michal of the person with id equal to five. FORpIN People FILTERp.firstname == "Michal" RETURNp

FORpIN People FILTERp.id == 5 FORvIN 1..1OUTBOUNDp friendWith FILTERv.firstname == "Michal" RETURNv

OrientDB OrientDB6 is a multi-model open source NoSQL database manage- ment system that supports document, graph, key-value and object data model. It was initially released in 2010, is implemented in Java and is being developed by OrientDB Ltd. It supports distributed architecture with replication and is transactional. OrientDB uses Paginated Local Storage for storing data. It is disk- based storage which uses page model to work with data and consists of several components that use disk data trough disk . Paginated local storage is a two-level disk cache that works together with a write- ahead log. Files are split into pages, and this allows operations to be atomic at page level. Two-level disk cache permits OrientDB to cache often accessed pages, separate pages that are not accessed frequently, minimize the amount of disk head seeks during data writes. It also enables to mitigate pauses that are needed to write data to the disk by flushing all changed or newly added pages to the disk in a background thread. Disk cache consists of two parts read cache and write cache. Read cache is based on 2Q cache algorithm and write cache is based on WOW cache algorithm [15].

6. https://orientdb.com/

10 1. Database overview

For manipulating database data, it is possible to use Java, SQL with extension for graphs and Gremlin. OrientDB supports schema-less, schema-full or schema-mixed data. OrientDB uses the Hazelcast Open Source project for automatic discovery of nodes, storing cluster config- uration and synchronization between nodes. Distributed architecture can be used in different ways to achieve better performance, scalability and robustness. OrientDB also provides a web interface that can be used for viewing graphs and data manipulation. Below are examples of queries in OrientDB. The first query returns people with a first name Michal. The second query returns all friends named Michal of the person with id equal to five. SELECT*FROM PeopleWHERE firstname="Michal"

SELECT*FROM (SELECTEXPAND(OUT(’Friend’)) FROM PersonWHERE id=5) WHERE firstname="Michal"

OrientDB ArangoDB Year of release 2010 2011 Supported data document, graph, document, graph, models key-value, object key-value Query language Java, SQL with AQL, JavaScript, extension for Ruby extensions graphs, Gremlin Implemented in Java C++, JavaScript

Table 1.3: Multi-model databases

11

2 Public available datasets

We have made this overview because we were looking for datasets that could be used in this thesis, and we did not find any categorisation of datasets such as this. Datasets that are mentioned below are categorised into different groups according to fields of study they roughly belong to. Each dataset has a brief description of what it contains. The criterion for these datasets was mainly their size because there was a requirement that a dataset has several GBs of data. At the end of the chapter is a summary of datasets we mentioned with their respective sizes in a table.

2.1 Transportation

Chicago parking ticket data. It contains tickets issued from January 1, 2007, to May 14, 2018, and information about them such as violation location, fine, license plate and so forth [16]. New York parking tickets dataset is similar to the previous one, but for the city of New York [17]. Both New York taxi rides and Chicago taxi rides datasets include data about for instance tips, pick up and drop off locations, cost and time [18, 19]. New York transport statistics has data regarding buses in ten minutes intervals where they are if they are ahead of schedule and more [20].

2.2 Environment

Global Surface Summary of the Day dataset has data from over 9000 stations, and elements included various data about temperature, wind, sea and occurrences of fog, rain, snow, thunder and [21]. Safecast Radiation Measurements dataset is made by volunteers. It contains information about radiation levels around the world [22].

13 2. Public available datasets 2.3 Economics

World Development Indicators is a dataset about how countries de- veloped themselves [23].

2.4 Health

NIH Chest X-ray dataset records approximately 100,000 images and metadata for them describing 14 common thorax diseases [24]. British Prescription data covers all the prescribed medicines each month. Data includes drug name, form, strength, and where it was prescribed [25]. Deaths in the United States dataset includes every death in the country from 2005 to 2015. It includes causes of death and the demo- graphic background of the deceased [26]. EEG data from the basic sensory task in Schizophrenia contains data about humans with Schizophrenia and how it affects their sensory task [27].

2.5 Education

U.S. College Scorecard data from 1996 to 2015 dataset covers informa- tion such as cost of tuition, undergraduate enrollment size and the rate of graduation [28].

2.6 Other

The Microsoft Academic Graph is focused on scientific fields. It con- tains scientific publication records their authors, citation relationships between those publications, institutions, journals, conferences, and fields of study. It features hundreds of millions of rows, with medium inter-connectivity [29]. Seattle Public Library checkouts dataset began in April 2005 it has information on every physical or electronic item that was checkout such as title, creator, number of checkouts [30]. Seattle Library Collection Inventorydataset contains data about every item that can be found in the library [31].

14 2. Public available datasets

Reddit comments dataset holds data about comments from Decem- ber 2012 to March 2017. Information includes author, score, subreddit and text [32]. Wikipedia Offline Edition contains everything text related that is on Wikipedia [33]. Pwned Passwords dataset is a collection of more than 500,000,000 real passwords that were exposed. Each password is represented as an SHA-1 hash to protect the original value [34].

Dataset Domain size [16] transportation 24GB [17] transportation 8GB [18] transportation 300GB [19] transportation 2GB [20] transportation 5GB [21] environment 3GB [22] environment 10GB [23] economics 2GB [24] health 45GB [25] health 4GB [26] health 4GB [27] health 19GB [28] education 2GB [29] other 80GB [30] other 7GB [31] other 11GB [32] other 304GB (zip) [33] other 11GB [34] other 22GB

Table 2.1: Found datasets

15

3 Designed queries

This chapter contains information about the datasets we used for testing, and all queries we tested in this thesis. Each query has a brief description of the results it produces. Queries were designed according to what we expected to be the most common use cases in respective data models. For graph queries, we were mostly focusing on traversal between nodes because we expect relationships to be more important than information that is stored in individual nodes. In document queries, we concentrated mainly on filtering the data with and without indexes.

3.1 Setup

Databases that we chose are for multi-model OrientDB 3.0.23, graph Neo4J 3.5.12 and document MongoDB 4.0.12. Each database is config- ured in a cluster of three nodes. Nodes are located on the OpenStack cloud platform. All three nodes are running on Ubuntu 18.04.2 LTS. One node has 4GB of RAM and two cores 2GHz each. Remaining two have 8GB of RAM and four cores 2GHz each.

3.1.1 Chosen datasets Graph dataset For comparing graph databases, we are using the dataset of followers 1. Files in this dataset were processed into a format suitable for importing into the database. Both Neo4J and OrientDB support CSV format for importing, so we decided to use it. In Neo4J, we used Neo4J-admin import tool with a file that contained names of all files we wanted to import and database name. In OrientDB we utilised ETL tool that requires a JSON file to define Extractor, Transformer and Loader. Extractor is responsible for extract- ing data from a source file and defining other options for extraction such as separator, columns, date format. Transformer defines to which class it imports the data and edges that are related to this class. In

1. http://networkrepository.com/soc-twitter.php

17 3. Designed queries loader is a name of the database, type of the database, indexes for classes.

Document dataset For document database, we have records of taxi rides in New York City from 2013 2. This dataset was already in CSV format; hence, it did not need any alteration. We used this dataset in MongoDB and OrientDB. To MongoDB, it was imported by the mongoimport tool which required a path to file which we wanted to import and database name. In OrientDB, we used the same ETL tool as in the previous case.

3.2 Graph queries

In these queries, we are mostly using relationships for traversal be- tween nodes. Queries are labelled from g1 to g8. Their description is as follows:

∙ g1 counts connected nodes that have less than 1000 followers until depth two,

∙ g2 identical to g1 (4GB of RAM on all nodes)

∙ g3 identical to g1 depth three,

∙ g4 identical to g1 depth four,

∙ g5 identical to g1 depth five,

∙ g6 finds the shortest path between two nodes where the desired path is three edges long

∙ g7 identical to g6 (4GB of RAM on all nodes)

∙ g8 finds the shortest path where the path between nodes does not exist.

2. https://chriswhong.com/open-data/foil_nyc_taxi/

18 3. Designed queries 3.3 Document queries

In a document database, we are mostly focusing on queries that deal with filtering data for results that satisfy their requirements, grouping data by some common denominators. Queries for are labelled from d1 to d10. Their description is as follows:

∙ d1 counts how many documents fulfil one condition,

∙ d2 identical to d1 (4GB of RAM on all nodes),

∙ d3 counts how many documents fulfil two conditions,

∙ d4 counts how many documents fulfil three conditions,

∙ d5 counts how many documents fulfil four conditions,

∙ d6 sum of total tip amount on different types of payments,

∙ d7 counts how many documents fulfil one condition on an in- dexed property where a number of these documents is more than ten million,

∙ d8 counts how many documents fulfil one condition on an in- dexed property where a number of these documents is more than ten million (4GB of RAM on all nodes),

∙ d9 counts how many documents fulfil one condition on an in- dexed property where a number of these documents is less than one hundred thousand,

∙ d10 combined index for two properties.

19

4 Results of queries

In this chapter are all measurements of queries we did and interpre- tation of the results for each comparison between OrientDB and its single-model counterpart, namely Neo4J and MongoDB. For comparison between Neo4J and OrientDB, we are using their built-in timers. For MongoDB and OrientDB Unix command time is used in their comparisons. Each query was run five times. The final time is calculated as an average from all runs of the query.

4.1 Graph queries measurements

Query Neo4J OrientDB g1 24.586s 1m16s g2 30.638s 4m41s g3 3m43s 10m52s g4 30m38s 15m37s g5 251ms19s 28m41s g6 2m7s 20.436s g7 3m32s 1m22s g8 1m14s 8.247s

Table 4.1: Graphs queries

As seen in the Table 4.1, Neo4J outperformed OrientDB in the first three queries. However, in queries g4 and g5 where the depth was four and five, respectively, OrientDB came out on top with a significant lead in both cases. In our last three queries where the objective was to find the shortest path between selected nodes, OrientDB wasalso significantly faster than Neo4J. In queries g2 and g7, we can see that lowering the size or RAM negatively affects both OrientDB and Neo4J.

21 4. Results of queries 4.2 Document queries measurements

Query MongoDB OrientDB d1 9m27s 38m2s d2 17m6s 51m32s d3 10m15s 32m13s d4 12m47s 32m3s d5 11m4s 34m42s d6 18m42s 42m57s d7 2.571s 50.324s d8 2.652s 77.893s d9 0.673s 1.078s d10 0.562s 1.459s

Table 4.2: Document queries

Queries we tested are focused on filtering a different number of prop- erties and grouping data. In MongoDB filtering, a different number of properties did add some additional time to query executions. As we can see from the Table 4.2 in queries d2 and d3 where we were filtering two and three properties respectively, OrientDB took less time than in d1. We suspect that lower times in query d3 and d4 was due to the lower number of results. In d5 execution time for OrientDB increased, we assume its because the last added property was a string. In the last four queries, there were indexes created for properties that we used to filter the data. In OrientDB it was SB-Tree Index, for MongoDB it was Single Field both indexes are based on B-tree algo- rithm. Both indexes were chosen based on recommendations found in respective documentations. In query d7, we can see that MongoDB outperformed OrientDB, but in the query d9 that uses the same index where the number of results was significantly smaller. The difference in these queries in real-time was negligible. In the last query, we used a combined index that uses more than one property for the index and OrientDB performed a little bit better than MongoDB in this case.

22 4. Results of queries

Since not in all cases, it is beneficial to create an index on all the data because it requires additional disk space and in most cases, slows inserts it is essential to know how we want to use a database. In case of not indexed data, MongoDB is superior compared to OrientDB, and in case of indexed data, OrientDB does not perform very differently to its counterpart. In query d7 where we used only 4GB of RAM instead of 8GB, we also found that more RAM improved performance of OrientDB on indexed data significantly and for Mon- goDB, it did not matter in this case. Amount of RAM on not indexed data improved performance in both cases as we can see in d1 and d2. This improvement was very similar in all queries for not indexed data.

4.3 Summary of results

As seen in Table 4.3, OrientDB is comparable or better than both Neo4J and MongoDB in both data models. For graph queries, it was when the objective was to find the shortest path between nodes, and when depth in query was more than three. However, when the objective in graph data is traversing different nodes up to depth three Neo4J is a much better choice. In document queries, OrientDB was not very competitive when it came to not indexed data. In this case, it was slower than MongoDB in our queries. When data being queried was indexed, we could see improvement in both cases. For indexed data changing the size of RAM, in our case, did not affect the performance of MongoDB, but it significantly improved the performance of OrientDB. For OrientDB number of results that our query produces is also very important as we could see in queries d7 and d9 because in both queries we used the same data and difference between execution times of MongoDB and OrientDB was remarkably smaller. Overall OrientDB is beneficial to use when more than one data model is needed. However, if we need a document model and we are using a large dataset MongoDB is a more suitable choice because its performance was significantly better in both indexed and not indexed data.

23 4. Results of queries

Query Neo4J MongoDB OrientDB g1 24.586s - 1m16s g2 30.638s - 4m41s g3 3m43s - 10m52s g4 30m38s - 15m37s g5 251ms19s - 28m41s g6 2m7s - 20.436s g7 3m32s - 1m22s g8 1m14s - 8.247s d1 - 9m27s 38m2s d2 - 17m6s 51m32s d3 - 10m15s 32m13s d4 - 12m47s 32m3s d5 - 11m4s 34m42s d6 - 18m42s 42m57s d7 - 2.571s 50.324s d8 - 2.652s 77.893s d9 - 0.673s 1.078s d10 - 0.562s 1.459s

Table 4.3: Average times of all queries

24 5 Conclusion

In this thesis, we are comparing the multi-model database OrientDB with the Neo4J graph database and MongoDB document database. We focused on comparing the performance of these databases on several different queries. We also described NoSQL databases and their various categories and made a list of publicly available datasets we found. Queries were split into two different categories. The first category is focused on queries related to graph data, and the second is focused on document data. In each group, we concentrated on queries that are more likely to be used in real life. For graph data, we focused on traversal between nodes and for document data on filtering them. Our results showed that multi-model databases OrientDB was in some cases faster than its single-model variants Neo4J and MongoDB. For graph data, it was in finding the shortest path between nodes in both cases where path existed and when it did not. In document data, it was faster when the data was indexed, and the amount of data that fulfilled condition was not very large. However, it is possible tore- duce the difference between OrientDB and MongoDB by adding more RAM. Therefore we can recommend using multi-model OrientDB in cases where it is needed to work with more than one data model. The difference between OrientDB and its single-model counterparts Neo4J and MongoDB in our test cases is not very large, and in some cases, OrientDB also outperforms them by a large amount. The most significant difference is probably on not indexed data where OrientDB is very lacking compared to MongoDB, so in cases where indexing data is not very beneficial, MongoDB is a better choice.

5.1 Future Work

In the future, we would recommend using differently sized datasets be- cause we think it was one of the decisive factors in our testing. Adding more nodes to a cluster may also prove beneficial and improving the specifications of each node, especially increasing its RAM because we saw that it greatly increased performance for the databases we used for testing.

25 5. Conclusion

We identified that for beating the single-model database, indexes are very important. Therefore, indexing all data in the multi-model database and comparing it to the single-model variant may also bring interesting results. We also recommend testing various multi-model databases and single-model databases due to their different imple- mentations.

26 Bibliography

1. SADALAGE, Pramod J; FOWLER, Martin. NoSQL distilled: a brief guide to the emerging world of polyglot persistence. Pearson Education, 2013. 2. OUSSOUS, Ahmed; BENJELLOUN, Fatima-Zahra; LAHCEN, Ay- oub Ait; BELFKIH, Samir. Comparison and classification of databases for . In: Proceedings of International Conference on Big Data, Cloud and Applications. 2015, vol. 2. 3. MONIRUZZAMAN, A. B. .; HOSSAIN, Syed Akhter. NoSQL Database: New Era of Databases for Big data Analytics - Classification, Charac- teristics and Comparison. CoRR. 2013, vol. abs/1307.0191. Available from arXiv: 1307.0191. 4. NAYAK, Ameya; PORIYA, Anil; POOJARY, Dikshay. Type of NOSQL databases and its comparison with relational databases. International Journal of Applied Information Systems. 2013, vol. 5, no. 4, pp. 16–19. 5. MongoDB storage. Available also from: https://docs.mongodb.com/ manual/core/storage-engines/. 6. MongoDB [online] [visited on 2019-04-30]. Available from: https:// www.mongodb.com/. 7. JOUILI, Salim; VANSTEENBERGHE, Valentin. An empirical compar- ison of graph databases. In: 2013 International Conference on Social Computing. 2013, pp. 708–715. 8. HOLZSCHUHER, Florian; PEINL, René. Performance of graph query languages: comparison of cypher, gremlin and native access in Neo4j. In: Proceedings of the Joint EDBT/ICDT 2013 Workshops. 2013, pp. 195– 204. 9. Neo4j [online] [visited on 2019-04-30]. Available from: https://neo4j. com/top-ten-reasons/. 10. Dgraph. Available also from: https : / / . com / dgraph - io / dgraph. 11. Available also from: https://docs.dgraph.io/.

27 BIBLIOGRAPHY

12. LU, Jiaheng; HOLUBOVÁ,Irena; CAUTIS, Bogdan. Multi-model databases and tightly integrated polystores: Current practices, comparisons, and open challenges. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018, pp. 2301– 2302. 13. LU, Jiaheng. Towards Benchmarking Multi-Model Databases. In: CIDR. 2017. 14. ArangoDB storage. Available also from: https://www.arangodb.com/ docs/stable/architecture-storage-engines.html#performance. 15. OrientDB storage. Available also from: http://orientdb.com/docs/3. 0.x/internals/Paginated-Local-Storage.html. 16. City of Chicago Parking Ticket Data [online] [visited on 2019-04-30]. Available from: https://www.propublica.org/datastore/dataset/ chicago-parking-ticket-data. 17. NYC Parking Tickets [online] [visited on 2019-04-30]. Available from: https : / / www . kaggle . com / new - york - city / nyc - parking - tickets. 18. New York Taxi Data 2009-2016 in Parquet Fomat [online] [visited on 2019-04-30]. Available from: http://academictorrents.com/details/ 4f465810b86c6b793d1c7556fe3936441081992e. 19. Chicago Taxi Rides 2016 [online] [visited on 2019-04-30]. Available from: https://www.kaggle.com/chicago/chicago-taxi-rides-2016. 20. New York City Bus Data [online] [visited on 2019-04-30]. Available from: https://www.kaggle.com/stoney71/new-york-city-transport- statistics. 21. NOAA Global Surface Summary of the Day [online] [visited on 2019-04-30]. Available from: https://www.kaggle.com/noaa/noaa- global- surface-summary-of-the-day. 22. Safecast [online] [visited on 2019-04-30]. Available from: https://www. kaggle.com/safecast/safecast. 23. World Development Indicators (WDI) Data [online] [visited on 2019-04-30]. Available from: https://www.kaggle.com/bigquery/worldbank- wdi.

28 BIBLIOGRAPHY

24. NIH Chest X-ray Dataset of 14 Common Thorax Disease Categories [online] [visited on 2019-04-30]. Available from: http://academictorrents. com/details/557481faacd824c83fbf57dcf7b6da9383b3235a. 25. General Practice Prescribing Data [online] [visited on 2019-04-30]. Avail- able from: https://www.kaggle.com/nhs/general- practice- prescribing-data#chem.csv. 26. Death in the United States [online] [visited on 2019-04-30]. Available from: https://www.kaggle.com/cdc/mortality#2014_codes. . 27. EEG data from basic sensory task in Schizophrenia [online] [visited on 2019-04-30]. Available from: https://www.kaggle.com/broach/ button-tone-sz. 28. U.S. College Scorecard Data 1996-2015 [online] [visited on 2019-04-30]. Available from: https://www.kaggle.com/noriuk/us-college- scorecard-data-19962015#Preview_MERGED2014_15_PP.csv. 29. SINHA, Arnab; SHEN, Zhihong; SONG, Yang; MA, Hao; EIDE, Darrin; HSU, Bo-June (Paul); WANG, Kuansan. Microsoft Academic Graph - 2016/02/05. Available also from: https://academicgraph.blob. core.windows.net/graph-2016-02-05/index.html. 30. Seattle Checkouts by Title [online] [visited on 2019-04-30]. Available from: https : / / www . kaggle . com / city - of - seattle / seattle - checkouts-by-title. 31. Seattle Library Collection Inventory [online] [visited on 2019-04-30]. Avail- able from: https://www.kaggle.com/city-of-seattle/seattle- library-collection-inventory. 32. reddit_data [online] [visited on 2019-04-30]. Available from: http:// academictorrents.com/details/85a5bd50e4c365f8df70240ffd4ecc7dec59912b. 33. Wikipedia English Official Offline Edition 2014-07-07 [online] [visited on 2019-04-30]. Available from: http://academictorrents.com/ details/e18b8cce7d9cb2726f5f40dcb857111ec573cad4. 34. Pwned Passwords Dataset [online] [visited on 2019-04-30]. Available from: http://academictorrents.com/details/53555c69e3799d876159d7290ea60e56b35e36a9.

29