Integration of SQL and Nosql Database Systems

MSc thesis Master’s Programme in Computer Science Integration of SQL and NoSQL database systems Olli-Pekka Lindström December 24, 2020 Faculty of Science University of Helsinki Supervisor(s) Prof. Jiaheng Lu Examiner(s) Contact information P. O. Box 68 (Pietari Kalmin katu 5) 00014 University of Helsinki,Finland Email address: [email protected].fi URL: http://www.cs.helsinki.fi/ HELSINGIN YLIOPISTO – HELSINGFORS UNIVERSITET – UNIVERSITY OF HELSINKI Tiedekunta — Fakultet — Faculty Koulutusohjelma — Utbildningsprogram — Study programme Faculty of Science Master’s Programme in Computer Science Tekijä— Författare— Author Olli-Pekka Lindström Työnnimi — Arbetets titel — Title Integration of SQL and NoSQL database systems Ohjaajat — Handledare — Supervisors Prof. Jiaheng Lu Työnlaji — Arbetets art — Level Aika — Datum — Month and year Sivumäärä— Sidoantal — Number of pages MSc thesis December 24, 2020 40 pages, 41 appendice pages Tiivistelmä— Referat — Abstract Until recently, database management systems focused on the relational model, in which data are organized into tables with columns and rows. Relational databases are known for the widely standardized Structured Query Language (SQL), transaction processing, and strict data schema. However, with the introduction of Big Data, relational databases became too heavy for some use cases. In response, NoSQL databases were developed. The four best-known categories of NoSQL databases are key-value, document, column family, and graph databases. NoSQL databases impose fewer data consistency control measures to make processing more efficient. NoSQL databases haven’t replaced SQL databases in the industry. Many legacy applications still use SQL databases, and newer applications also often require the more strict and secure data processing of SQL databases. This is where the idea of SQL and NoSQL integration comes in. There are two mainstream approaches to combine the benefits of SQL and NoSQL databases: multi-model databases and polyglot persistence. Multi-model databases are database management systems that store and process data in multiple different data models under the same engine. Polyglot persistence refers to the principle of building a system architecture that uses different kinds of database engines to store data. Systems implementing the polyglot persistence principle are called polystores. This thesis introduces SQL and NoSQL databases and their two main integration strategies: multi-model databases and polyglot persistence. Some representative multi-model databases and polystores are introduced. In conclusion, some challenges and future research directions for multi-model databases and polyglot persistence are introduced and discussed. ACM Computing Classification System (CCS) General and reference → Document types → Surveys and overviews Applied computing → Document management and text processing → Document management → Text editing Avainsanat — Nyckelord — Keywords SQL, NoSQL, multi-model database, polyglot persistence, polystore Säilytyspaikka — Förvaringsställe— Where deposited Helsinki University Library Muita tietoja — övrigauppgifter — Additional information Software Systems study track Contents 1 Introduction1 1.1 Context ..................................... 1 1.2 Thesis overview................................. 2 2 Single-model database systems3 2.1 Relational databases .............................. 3 2.1.1 Data organization and schema..................... 4 2.1.2 SQL................................... 4 2.1.3 Transactions (ACID).......................... 5 2.1.4 Scaling.................................. 5 2.1.5 Considerations ............................. 6 2.2 NoSQL databases................................ 6 2.2.1 Key-value stores............................. 7 2.2.2 Document stores ............................ 8 2.2.3 Column family stores.......................... 9 2.2.4 Graph stores .............................. 10 2.2.5 Considerations ............................. 11 3 Multi-model databases 13 3.1 Primary multi-model databases ........................ 14 3.1.1 OrientDB ................................ 14 3.1.2 ArangoDB................................ 14 3.1.3 Microsoft Azure Cosmos DB...................... 15 3.2 Secondary Multi-model databases....................... 15 3.2.1 PostgreSQL............................... 15 3.2.2 Microsoft SQL Server.......................... 16 4 Polyglot persistence 20 4.1 Representative systems............................. 21 4.1.1 BigDAWG................................ 21 4.1.2 PolyBase................................. 25 4.2 Considerations ................................. 27 5 Challenges and open problems of SQL and NoSQL integration 29 5.1 Query languages and data processing..................... 29 5.2 Data modeling and schema design....................... 31 5.3 Evolution .................................... 32 5.4 Consistency Control .............................. 33 5.5 Extensibility................................... 33 5.6 Indexing..................................... 33 5.7 Partitioning................................... 33 5.8 Standardization................................. 34 5.9 Performance................................... 34 6 Conclusions 37 Bibliography 39 1 Introduction Relational database management systems have been a major part of software development for decades. They support complex data queries using the standardized query language SQL and ensure high data integrity by enforcing atomicity, consistency, isolation, and durability (ACID). However, these systems have problems with increasing volumes of data with a flexible structure. NoSQL (not only SQL) database systems discard schema and ACID compliance for better performance and scalability. They can be used to read and write data more quickly and store larger amounts of data without having to define the data schema case by case. Based on increasingly complex business and user needs, it is not always possible to manage a solution with only either SQL or NoSQL database systems. Therefore, the need for systems that utilize both kinds of databases or data models has been increasing lately, from legacy systems that cannot keep up with increasing amounts of data to data stores that would benefit from formalizing some of their parts. Having a single interface or system architecture for structured relational database data and unstructured NoSQL database data makes development, maintenance, and data management easier. 1.1 Context When it comes to Big Data, one of the biggest challenges is the Variety of data. With small data sets, it is usually enough to store data in one format. However, when data sets grow larger, certain operations become unmanageable with certain data formats. Therefore it is useful to have the possibility to store data in different formats, such as in hierarchical, network, or table formats, and handle them simultaneously effectively. In the last few years, organizations have started to realize the importance of unstructured non-relational data. While relational data has its place in systems where regulating data content is critical, the increasing amount of data of all sizes and shapes calls for NoSQL systems’ help. A famous slogan in the database community is ”One size does not fit all.” Single model databases are optimized for the data model that they are built for. It is possible to 2 simulate NoSQL features in relational databases and vice versa, but performance can become a bottleneck with larger data sets when the features aren’t supported natively. Hence it would be desirable in theory that data processing most of the time happened in a system that is best suited for the data model. Organizations often spread data across different data storage engines in their solutions, sometimes for the aforementioned reason and sometimes for others. It is, of course, always possible to connect systems manually. However, managing those connections can become very time-consuming when the amount of systems increases. 1.2 Thesis overview This thesis covers the basics of SQL and NoSQL database systems, why it is sometimes nec- essary or helpful to combine features of such systems, and options for doing so. The focus is on integrating SQL and NoSQL features into one logical system or engine, with special emphasis on bringing together the varying data models used to store data in such systems. The two mainstream approaches for handling multiple data models simultaneously are multi-model databases and polyglot persistence. Multi-model databases are singular engines that have built-in support for multiple models of data. Systems implementing the polyglot persistence principle are called polystores. They are system architectures that build an integrated interface over existing database systems. Chapter 2 looks at the single model SQL and NoSQL systems separately. Afterward, The two main strategies to integrate or combine the two different database systems are introduced: multi-model databases and polyglot persistence. Multi-model databases are database engines that allow storing data in multiple different formats. Polyglot persistence refers to utilizing multiple different kinds of database systems in one software architecture. Chapters 3 and 4 go in-depth with multi-model databases and polyglot persistence, re- spectively, introducing the concepts and some representative systems. Chapter 5 explains some polyglot persistence and multi-model database systems’ challenges and explores cur- rent and future research directions. Chapter 6 concludes the

Load more