Bridging SQL and Nosql

Eindhoven University of Technology MASTER Bridging SQL and NoSQL Roijackers, J.A.M. Award date: 2012 Link to publication Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Eindhoven University of Technology Department of Mathematics and Computer Science Master’s thesis Bridging SQL and NoSQL May 11, 2012 Author: John Roijackers [email protected] Supervisor and tutor: Dr. G.H.L. Fletcher [email protected] Assessment committee: Dr. T.G.K. Calders Dr. G.H.L. Fletcher Dr. A. Serebrenik Abstract A recent trend towards the use of non-relational NoSQL databases raises the question where to store application data when part of it is perfectly relational. Dividing data over separate SQL and NoSQL databases implies manual work to manage multiple data sources. We bridge this gap between SQL and NoSQL via an abstraction layer over both databases that behaves as a single database. We transform the NoSQL data to a triple format and incorporate these triples in the SQL database as a virtual relation. Via a series of self joins the original NoSQL data can be reconstructed from this triple relation. To avoid the tedious task of writing the self joins manually for each query, we develop an extension to SQL that includes a NoSQL query pattern in the query. This pattern describes conditions on the NoSQL data and lets the rest of the SQL query refer to the corresponding NoSQL data via variable bindings. The user query can automatically be translated to an equivalent pure SQL query that uses multiple copies of the triple relation and automatically adds the corresponding join conditions. We describe a naive translation, where for each key-value pair from the NoSQL query pattern a new triple relation copy is used, and present several optimization strategies that focus on reducing the number of triples that must be shipped to the SQL database. We implement a prototype of the developed theoretical framework using PostgreSQL and MongoDB for an extensive empirical analysis in which we investigate the impact of the translation optimizations and identify bottlenecks. The empirical analysis shows that a practical prototype based on the constructed, non-trivial, and generically applicable theoretical framework is possible and that we have developed a hybrid system to access SQL and NoSQL data via an intermediate triple representation of the data, though there still is enough space for future improvement. iii Acknowledgments This thesis is the result of my master project, which completes my master Computer Science and Engineering at the Eindhoven University of Technology. The project was done within the Databases and Hypermedia group of the department of Mathematics and Computer Science, and carried out externally at KlapperCompany B.V., the company that introduced me to the problem and offered resources to perform the project. I would like to take this opportunity to thank a few people for their support during my project. First of all, I would like to thank my supervisor, George Fletcher, for his support and guidance during the project and for offering the possibility to work on my suggested research topic. Furthermore, I thank all involved employees at KlapperCompany B.V. for offering access to their resources required to carry out my project in the form of a dataset, hardware, a pleasant working environment, and general expertise regarding the project context. Finally, thanks go out to my family, girlfriend, and friends, for their support and understanding during my project. Particularly in the periods of social isolation to meet an impending deadline. John Roijackers v Contents 1 Introduction 1 1.1 Motivation...........................................1 1.1.1 Problem origin....................................1 1.1.2 Gap between SQL and NoSQL...........................2 1.1.3 Real life situation...................................3 1.2 Problem statement......................................3 1.3 Summary and thesis overview................................4 2 Problem context 7 2.1 General direction.......................................7 2.1.1 Hybrid database...................................7 2.1.2 Approach.......................................8 2.1.3 NoSQL representation................................9 2.1.4 Data reconstruction................................. 11 2.2 Literature review....................................... 12 2.3 Summary........................................... 13 3 Theoretical framework for bridging SQL and NoSQL 15 3.1 Data incorporation...................................... 15 3.1.1 Availability of NoSQL data in SQL......................... 15 3.1.2 Transformation of NoSQL data to triples...................... 16 3.2 Query language........................................ 18 3.2.1 Main idea....................................... 18 3.2.2 Syntax description.................................. 19 3.2.3 Collaboration with SQL............................... 21 3.3 Architectural overview.................................... 21 3.4 Summary........................................... 22 4 Query processing strategies 25 4.1 Base implementation..................................... 25 4.1.1 Notation........................................ 26 4.1.2 Translation...................................... 27 4.1.3 Selection pushdown.................................. 29 vii 4.1.4 Combined selection pushdown............................ 31 4.1.5 Condition derivation................................. 32 4.1.6 Transitive condition derivation........................... 33 4.2 Projection pushdown..................................... 33 4.3 Data retrieval reduction................................... 34 4.3.1 General idea..................................... 35 4.3.2 Triple reduction.................................... 36 4.4 Summary........................................... 36 5 Experimental framework 39 5.1 Software............................................ 39 5.1.1 SQL.......................................... 40 5.1.2 NoSQL........................................ 40 5.2 Hardware........................................... 41 5.3 Data.............................................. 41 5.3.1 Products....................................... 41 5.3.2 Tweets......................................... 43 5.4 Queries............................................ 44 5.4.1 Flow classes...................................... 44 5.4.2 Query types...................................... 45 5.4.3 Query construction.................................. 46 5.5 Triple relation implementation............................... 46 5.5.1 Table functions.................................... 46 5.5.2 Foreign data wrappers................................ 47 5.5.3 Multicorn....................................... 49 5.6 Setup............................................. 50 5.6.1 Translation implementation............................. 50 5.6.2 Details......................................... 51 5.7 Summary........................................... 52 6 Empirical analysis 55 6.1 Result overview........................................ 55 6.2 Base implementation..................................... 57 6.3 Projection pushdown..................................... 59 6.4 Data retrieval reduction................................... 59 6.5 Increasing the NoSQL data limit.............................. 62 6.6 Summary........................................... 64 7 Conclusions 65 7.1 Summary of theoretical results............................... 65 7.2 Summary of practical results................................ 66 7.3 Future work.......................................... 67 7.3.1 Nested join reduction................................. 67 7.3.2 Tuple reconstruction................................. 68 7.3.3 Triple relation reuse................................. 69 7.3.4 Dynamic run-time query evaluation strategies................... 70 7.3.5 NoSQL query language standardization...................... 70 A Query templates 73 A.1 Product queries........................................ 73 viii A.2 Tweet queries......................................... 79 B Foreign data wrapper 87 Bibliography 89 ix 1 Introduction In this thesis we bring traditional relational databases and non-relational data storage closer together by reducing the developer’s effort required to combine data from both types of databases. This is motivated by the recent trend to use non-relational data storage, which leads to separate databases for a single application and the consequential extra work involved with managing multiple data sources. In this chapter we further introduce and motivate the problem covered in this

Bridging SQL and Nosql

CHAPTER 3 - Relational Database Modeling

Relational Query Languages

The Relational Data Model and Relational Database Constraints

SQL Vs Nosql: a Performance Comparison

Relational Algebra and SQL Relational Query Languages

Plantuml Language Reference Guide

A Developer's Guide to Data Modeling for SQL Server

Fast Foreign-Key Detection in Microsoft SQL

Session 5 – Main Theme

Mssql Database Schema Diagram

The Relational Model

Relational Database Manipulation Basic Relational Database Query