Comparison of a Multi-Model Database with Its Single-Model Variants
Total Page:16
File Type:pdf, Size:1020Kb
Masaryk University Faculty of Informatics Comparison of a multi-model database with its single-model variants Bachelor’s Thesis Michal Merjavý Brno, Fall 2019 Masaryk University Faculty of Informatics Comparison of a multi-model database with its single-model variants Bachelor’s Thesis Michal Merjavý Brno, Fall 2019 This is where a copy of the official signed thesis assignment and a copy ofthe Statement of an Author is located in the printed version of the document. Declaration Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or excerpted during elaboration of this work are properly cited and listed in complete reference to the due source. Michal Merjavý Advisor: Mgr. Martin Macák i Acknowledgements I would like to thank my advisor, Mgr. Martin Macák for giving me advice which greatly improved the quality of this work. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme "Projects of Large Research, Devel- opment, and Innovations Infrastructures" (CESNET LM2015042), is greatly appreciated. iii Abstract Goal of this thesis was to complete an overview of publicly available datasets and to compare multi-model databases to its single-model counterparts specifically OrientDB to Neo4J,MongoDB. We begin with overview of NoSQL databases, their categories and we include implementations of document,graph and multi-model databases. We briefly describe publicly available datasets we found. Following that are document and graph queries that were tested and results of said queries. iv Keywords NoSQL, multi-model database, graph database, document database, time series database, OrientDB, Neo4J, MongoDB, benchmarking, datasets, public datasets v Contents Introduction 1 1 Database overview 3 1.1 NoSQL databases .......................3 1.1.1 Key-Value stores . .3 1.1.2 Document databases . .4 1.1.3 Wide-Column stores . .6 1.1.4 Graph databases . .6 1.1.5 Multi-model databases . .8 2 Public available datasets 13 2.1 Transportation ........................ 13 2.2 Environment ......................... 13 2.3 Economics ........................... 14 2.4 Health ............................. 14 2.5 Education ........................... 14 2.6 Other ............................. 14 3 Designed queries 17 3.1 Setup ............................. 17 3.1.1 Chosen datasets . 17 3.2 Graph queries ......................... 18 3.3 Document queries ....................... 19 4 Results of queries 21 4.1 Graph queries measurements ................. 21 4.2 Document queries measurements ............... 22 4.3 Summary of results ...................... 23 5 Conclusion 25 5.1 Future Work .......................... 25 Bibliography 27 vii List of Tables 1.1 Multi-model databases 6 1.2 Multi-model databases 8 1.3 Multi-model databases 11 2.1 Found datasets 15 4.1 Graphs queries 21 4.2 Document queries 22 4.3 Average times of all queries 24 ix Introduction Nowadays, almost every company needs to store and access data to work correctly; that is why most of the time, they decide to use databases. Working with data more efficiently saves time, so itis crucial to select a useful tool. One of the most challenging problems is how to deal with the variety of data they need to store. If there are two or three different types of data that the company needs to store and analyze. It would require multiple single-model databases. That includes learning to set them up properly and study query languages they use. However, if they were using multi-model databases, it is possible to only work with one database for all their purposes. Multi-model databases are getting more popular every year be- cause of their ability to support multiple data models and using one backend for all of them. The focus of the thesis is to find out whether a multi-model database can compare to its single-model counterparts and not lose too much performance due to its practicality. This thesis compares graph and document data. This thesis begins with a description of NoSQL databases in gen- eral, implementations of different multi-model, graph and document databases. It then goes on to an overview of freely available datasets that we found. Datasets are grouped by their respective fields of study. Next chapter is focused on queries and datasets we used for bench- marking. The fourth chapter includes measurements for queries that were used and interpretation of the results. 1 1 Database overview In this chapter, we provide information on NoSQL databases, Multi- model databases and introduction for some implementations of graph, document and multi-model databases. 1.1 NoSQL databases NoSQL stands for Not Only SQL, refers to a group of nonrelational data management systems; where databases do not primarily use ta- bles and most of the time do not use SQL for data manipulation [1]. NoSQL databases also provide better scalability and processing of large scale datasets. In classical database systems, transactions guaran- tee the integrity of data. However, scaling transactional-based systems proved to be a difficult problem. That is why NoSQL databases lowered the requirements on consistency and achieved better availability and partitioning. This resulted in NoSQL databases using mostly systems know as BASE (Basically Available, Soft-state, Eventually consistent) instead of transactions [2]. BASE systems do not require consistency after every transaction, but eventually being in a consistent state [3]. NoSQL databases can be classified into four basic categories: Key-Value stores, Document databases, Wide-Column stores, Graph databases. 1.1.1 Key-Value stores These data management systems use hash tables where there are unique keys and pointers to items of data creating key-value pairs [2]. Hash tables are suitable for lookup values in extremely large datasets. Key-value databases do not provide traditional database capabili- ties. To ensure atomicity of transactions and consistency of parallel transactions, users have to rely on the application itself. In Key-value databases, it is only possible to query the key part of the pair. This makes extracting records that contain a particular set of values impos- sible. Key-Value stores are suitable for applications where the speed of retrieving the data is essential because of their simplicity. Neverthe- less, if there is a necessity to query data by value, Key-value databases are not suitable. 3 1. Database overview 1.1.2 Document databases Document databases were created for improving the storage and man- agement of documents. Documents in these databases are more flex- ible than records in relational databases because they are schema- less [2]. Documents are mostly stored in one of these formats XML, JSON or BSON. Documents can contain, in most cases, multiple key- value pairs, key-array pairs, or nested documents. Document databases hold semi-structured data in attribute/value pairs. The difference between document databases and key-value stores is in the ability to search not only by key but also by values. Document databases are used for storing semi-structured data where it is not sure if the value will be entered or not, which could cause problems in relational databases [4]. It is not optimal to use document databases when the data that is used has a lot of relations. Below is a description of CouchDB and MongoDB to describe implementations of document databases. MongoDB is used for com- parison later on. Following that is a summary of differences between the databases as mentioned earlier in Table 1.1. CouchDB CouchDB1 is document-oriented NoSQL database. For storing data, it uses JSON and JavaScript is used to work with the stored data as its query language. Database in CouchDB is a collection of independent documents that sore their own data and self-contained schema. Documents also contain metadata that make it possible to merge differences that may have happened while the databases were disconnected. CouchDB does not lock the database while writing the data and any conflicts that may have occurred are left to the application to resolve. It is written in Erlang, JavaScript, C, C++ and was released in 2005. Beneath is an example of JSON that can find people with the name Michal. 1. https://couchdb.apache.org/ 4 1. Database overview { "selector":{ "firstname":"Michal" } } MongoDB MongoDB 2 is a database based on a document model. Its initial release was in 2009 and written in C++, Go, JavaScript and Python. MongoDB provides two options for storage engines WiredTiger storage engine and In-Memory storage engine [5]. Starting in Mon- goDB 3.2, the default storage engine is WiredTiger. It is recommended for new deployments and is suitable for most workloads. WiredTiger uses optimistic concurrency control for most reads and writes. In op- timistic concurrency control, only intent locks are used for global, database and collections levels. If a conflict is detected by the storage engine between two operations on the same document, one of them will incur a write conflict and is retried. WiredTiger also uses Multi- Version Concurrency Control for short MVCC. This allows MongoDB to recover from the last valid checkpoint when it encounters an er- ror while writing new checkpoint. MongoDB creates checkpoints at intervals of sixty seconds. In MongoDB, data are stored in flexible JSON-like documents; this means that fields can vary from document to document, anditis