Experimental Comparison of Relational and Nosql Systems: the Case of Decision Support

VLDB2020Tokyo Experimental Comparison of Relational and NoSQL Systems: the Case of Decision Support Tomás F. Llano-Ríos Mohamed Khalefa Antonio Badia tfl[email protected] [email protected] [email protected] University of Louisville, SUNY College Old University of Louisville, Louisville, KY 40208, Westbury, Louisville, KY 40208, USA Old Westbury, NY USA 11568, USA 1 / 14 Contribution VLDB2020Tokyo Experimental comparison of relational and document-oriented NoSQL systems (PostgreSQL, MongoDB and Couchbase) focused on Decision Support (we use TPC-H) limited to a single-node setting limited to hierarchical data (customer, orders and lineitem tables) Analysis Query language influence on query optimization Data model’s influence on query optimization Possibilities for improvement Is ongoing: We are still doing experiments 2 / 14 Experiment Methodology VLDB2020Tokyo TPC-H Schema translated to document stores in 3 different schemas: S1:Embedding S2:Linking S3:Hybrid Custumer Custumer Order key1 : val1, key1 : val1, key1 : val1, key2 : val2,..., key2 : val2,..., key2 : val2,..., keyn : valn keyn : valn keyn : valn orders: lineitems: key1 : val1, Orders key1 : val1, key2 : val2,..., key1 : val1, key2 : val2,..., keym : valm key2 : val2,..., keym : valm Lineitems: keyn : valn key1 : val1, Customer key2 : val2,..., Lineitem keyl : vall key1 : val1, key1 : val1, key2 : val2,..., key2 : val2,..., keyl : vall keyn : valn 3 / 14 Experiment Setup VLDB2020Tokyo Databases Relational: PostgreSQL v10.6 NoSQL: MongoDB v4.0, Couchbase CE v6.5 Queries Taken from the TPC-H benchmark Only involving Customer, Orders, and Lineitem Total query versions: 38 Each version was run 5 times, average taken Time limit of 24 hours per query Couchbase queries were run at scale factor (SF) 1 only due to poor performance 4 / 14 Results VLDB2020Tokyo Query 1: Scan over Lineitem only, S1/S3 at a disadvantage. S2 is superior because it does not need joins nor unnest. (a) (b) at 100G Figure: Running times of query 1 on MongoDB and PostgreSQL 5 / 14 Results VLDB2020Tokyo Query 3: Join order matters greatly, S2 is superior if orders are filtered first. (a) (b) at 100G Figure: Running times of query 3 on MongoDB and PostgreSQL 6 / 14 Results VLDB2020Tokyo Query 4: No lookups in S1 or S3, but S1 has extra unnest. Filter array and unnest is faster than unnest and then filter. (a) (b) at 100G Figure: Running times of query 4 on MongoDB and PostgreSQL 7 / 14 Results VLDB2020Tokyo Query 12: Orders ./ Lineitem makes S2 the slowest. No lookups in S1 or S3, but S1 has extra unnest. (a) (b) at 100G Figure: Running times of query 12 on MongoDB and PostgreSQL 8 / 14 Results VLDB2020Tokyo Query 13: Lookup slow with sub-queries (issue [SERVER-41171]); cannot effectively convert σ(C A./ O) into C A./σ(O)jC = Customer Ô = Orders in S2 nor S3. No lookups in S1. (a) (b) at 100G Figure: Running times of query 13 on MongoDB and PostgreSQL 9 / 14 Experiment Results VLDB2020Tokyo Query 22: Sub-query translates to self-lookup on S1, ordering of operators depends on schema, sub-queries in lookup are slow. (a) (b) at 100G Figure: Running times of query 22 on MongoDB and PostgreSQL 10 / 14 Analysis VLDB2020Tokyo High Cost Scan single collection with large documents Unnesting large documents Main document store limitations Join reordering Must be done manually in MongoDB Couchbase CE cannot express correlation as join Optimizer No cost-based optimizer Selectivity of conditions not considered 11 / 14 Conclusions VLDB2020Tokyo Typical document store design (one or a few collections with complex documents that use embedding) is not always a good fit for DSS environments. Schema-less does not imply schema-free. Schema design matters in document stores for DSS environments. Navigational languages should be supported by an optimizer that is able to rewrite and reorder operations in a query. 12 / 14 Future work VLDB2020Tokyo Extending comparison to column-oriented DBs. Exploring document storage as multi-dimensional arrays. Expanding further schemas and query sets (all TPC-H queries). Explore a distributed setup. 13 / 14 Thank you! VLDB2020Tokyo Data, Queries and Code www.github.com/tllano11/dss-sql-vs-nosql-experiments Questions? Please contact us: tfl[email protected], [email protected], [email protected] 14 / 14.

Experimental Comparison of Relational and Nosql Systems: the Case of Decision Support

Query Optimization Techniques - Tips for Writing Efficient and Faster SQL Queries

PL/SQL 1 Database Management System

An Overview of Query Optimization

Introduction to Query Processing and Optimization

Query Optimization for Dynamic Imputation

Scalable Query Optimization for Efficient Data Processing Using

Query Optimization

DB2 UDB for Iseries Database Performance and Query Optimization

Database Performance and Query Optimization 7.1

Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings

A Modern Optimizer for Real-Time Analytics in a Distributed Database

Query Optimization