SQL for Nosql and How Apache Calcite Can Help (Slides)

SQL for NoSQL and how Apache Calcite can help FOSDEM 2017 Christian Tzolov Engineer at Pivotal BigData, Hadoop, Spring Cloud Dataflow Apache Committer, PMC member Apache {Crunch, Geode, HAWQ, ...} blog.tzolov.net twitter.com/christzolov nl.linkedin.com/in/tzolov Disclaimer This talk expresses my personal opinions. It is not read or approved by Pivotal and does not necessarily reflect the views and opinions of Pivotal nor does it constitute any official communication of Pivotal. Pivotal does not support any of the code shared here. 2 “It will be interesting to see what happens if an established NoSQL database decides to implement a reasonably standard SQL; The only predictable outcome for such an eventuality is plenty of argument.” 2012, Martin Fowler, P.J.Sadalage, NoSQL Distilled 3 Data Big Bang Why? 4 NoSQL Driving Forces • Rise of Internet Web, Mobile, IoT – Data Volume, Velocity, Variety challenges ACID & 2PC clash with Distributed architectures. CAP, PAXOS instead.. • Row-based Relational Model. Object-Relational Impedance Mismatch More convenient data models: Datastores, Key/Value, Graph, Columnar, Full-text Search, Schema-on-Load… • Infrastructure Automation and Elasticity (Cloud Computing) Eliminate operational complexity and cost. Shift from Integration to application databases … 5 Data Big Bang Implications • Over 150 commercial NoSQL and BigData systems. • Organizations will have to mix data storage technologies! • How to integrate such multitude of data systems? 6 “Standard” Data Process/Query Language? • Functional - Unified Programming pcollection.apply(Read.from(”in.txt")) Model .apply(FlatMapElements.via((String word) -> asList(word.split("[^a-zA-Z']+"))) • Apache {Beam, Spark, Flink, .apply(Filter.by((String word)->!word.isEmpty())) Apex, Crunch}, Cascading .apply(Count.<String>perElement()) • Converging around Apache Batch & Streaming, OLTP Beam • Declarative - SQL • Adopted by many NoSQL SELECT b."totalPrice", c."firstName” Vendors FROM "BookOrder" as b INNER JOIN "Customer" as c • Most Hadoop tasks: Hive and ON b."customerNumber" = c."customerNumber” SQL-on-Hadoop WHERE b."totalPrice" > 0; • Spark SQL - most used production component for 2016 OLAP, EDW, Exploration • Google F1 7 SQL for NoSQL? • Extended Relational Algebra - already present in most NoSql data system • Relational Expression Optimization – Desirable but hard to implement 8 Organization Data - Integrated View Single Federated DB (M:1:N) Direct (M:N) Organization Data Tools Organization Data Tools SQL/JDBC HAWQ FDBS SQL/JDBC SQL/JDBC SQL/JDBC Apache HAWQ Optimizer, Columnar /browse/HAWQ-1235 (HDFS) Calcite Calcite Calcite jira / SQLAdapter 1 SQLAdapter 2 … SQLAdapter n NoSQL 1 NoSQL 2 NoSQL n PXF 1 PXF 2 PXF n Native Native Native API 1 API 2 API n NoSQL 1 NoSQL 1 … NoSQL n issues.apache.org https:// 9 Single Federated Database Federated External Tables with Apache HAWQ - MPP, Shared-Noting, SQL- on-Hadoop CREATE EXTERNAL TABLE MyNoSQL ( customer_id TEXT, first_name TEXT, last_name TEXT, gender TEXT ) LOCATION ('pxf://MyNoSQL-URL>? FRAGMENTER=MyFragmenter& ACCESSOR=MyAccessor& RESOLVER=MyResolver&') FORMAT 'custom'(formatter='pxfwritable_import'); 10 Apache Calcite? Java framework that allows SQL interface and advanced query optimization, for virtually any data system • Query Parser, Validator and Optimizer(s) • JDBC drivers - local and remote • Agnostic to data storage and processing Calcite Application • Apache Apex • Apache Drill • Apache Flink • Apache Hive • Apache Kylin • Apache Phoenix • Apache Samza • Apache Storm • Cascading • Qubole Quark • SQL-Gremlin … • Apache Geode 12 SQL Adapter Design Choices SQL completeness vs. NoSql design integrity Catalog – namespaces accessed in queries Schema - collection of schemas and tables • Data Type Conversion Table - single data set, collection of rows RelDataType – SQL fields types in a Table (simple) Predicate Pushdown: Scan, Filter, • Move Computation to Data Projection (complex) Custom Relational Rules and Operations: Sort, Join, GroupBy ... 13 Geode to Calcite Data Types Mapping Geode Cache is mapped into Calcite Schema Create Column Types Geode Cache Calcite Schema (RelDataType) from Geode Value class (JavaTypeFactoryImpl) Region 1 Table 1 Col1 Col2 ColN Key Val Row1 V(1,1) V(1,2) V(1,N) k1 v1 Row2 V(2,1) V(2,2) V(2,N) k2 v2 RowM V(M,1) V(M,2) V(M,N) … Geode Key/Value is mapped … into Table Row Region K Table K Regions are mapped into Tables 14 Geode Adapter - Overview SQL/JDBC/ODBC Parse SQL, converts into relational expression and optimizes Apache Calcite Push down the relational expressions supported by Geode OQL and falls back to the Calcite Spring Data API for Enumerable Enumerable Adapter for the rest interacting with Geode Adapter Convert SQL relational expressions into OQL queries Spring Data Geode Adapter Geode (Geode Client) Geode API and OQL Geode Server Geode Server Geode Server Data Data Data Simple SQ L Adapter Initialize <<SchemaFactory>> <<Schema>> MySchemaFactory MySchema !connect jdbc:calcite:model=path-to-model.json +create(operands):Schema +getTableMap():Map<String, Table>) <<create>> defaultSchema: 'MyNoSQL', schemas: [{ <<create>> name: ’MyNoSQLAdapter, factory: MySchemaFactory’, operand: { myNoSqlUrl: …, } ScannableTable, Uses reflection to builds <<ScannableTable>> }] RelDataType from your FilterableTable, MyTable value’s class type ProjectableFilterableTable +getRowType(RelDataTypeFactor) Returns an Enumeration +scan(ctx):Ennumerator<Object[]> over the entire target data store <<on scan() create>> SELECT b."totalPrice” Defined in the Linq4j Query FROM "BookOrder" as b <<Enummerator>> WHERE b."totalPrice" > 0; MyEnummerator sub-project Converts MyNoSQL +moveNext() value response into +convert(Object):E Calcite row data <<Get all Data>> My NoSQL 16 Non-Relational Tables (Simple) Scanned without intermediate relational expression. • ScannableTable - can be scanned Enumerable<Object[]> scan(DataContext root); • FilterableTable - can be scanned, applying supplied filter expressions Enumerable<Object[]> scan(DataContext root, List<RexNode> filters); • ProjectableFilterableTable - can be scanned, applying supplied filter expressions and projecting a given list of columns Enumerable<Object[]> scan(DataContext root, List<RexNode> filters, int[] projects); 17 Calcite Ecosystem Several “semi-independent” projects. Local and Remote Port of LINQ (Language-Integrated Query) JDBC driver to Java. Linq4j JDBC and Avatica Method for translating executable code into data Expression Converts SQL (LINQ/MSN port) queries Into AST SQL Parser & AST Tree (SqlNode …) Complies Java code generated from linq4j Relational “Expressions”. Part of the Relational Algebra, • Relational Expressions Interpreter physical plan implementer expression, • Row Expression optimizations … • Optimization Rules • Planner … Enumerable Default (In-memory) Data Store Adapter Adapter implementation. Leverages Linq4j 3rd party Adapters 18 Calcite SQL Query Execution Flow 1. On new SQL query JDBC delegates to Prepare to prepare the query execution JDBC 2. Parse SQL, convert to rel. 1 Calcite Framework expressions. Validate and Optimize 2 them Prepare Geode Adapter 3. Start building a physical plan from the relation expressions SQL, 3 5 Interpreter 4. Implement the Geode relations and Relational, Enumerable encode them as Expression tree Planner 5. Pass the Expression tree to the 4 6 7 Interpreter to generate Java code 2 Geode 7 6. Generate and Compile a Binder Binder instance that on ‘bind()’ call runs Adapter Geodes’ query method 7 7. JDBC uses the newly compiled Binder to perform the query on the Geode Geode Cluster Cluster 19 Calcite Relational Expressions TableScan Input Column Project Ref Row-level Relational Literal expressions expression Filter Struct field * RexlNode RelNode Aggregate access Project, Sort fields * Join Function call Filter, Join conditions RelTrait Window Intersect expressions Physical attribute of a relation Sort 20 Calcite Relational Expressions Rule transforms an expression into another. It has a list of Operands, which determine whether the rule can be applied to RelOptRuleOperand a particular section of the tree. * <<fire criteria>> Query optimizer: Transforms a Converts from one calling relational expression according to RelOptRule ConverterRule convention to another a given set of rules and a cost model. * + onMatch(call) + RelNode convert(RelNode) <<rules>> RelOptPlanner Indicate that it converts a <<create>> physical attribute only! <<register>> Convertor +findBestExp():RelNode <<root>> * Calling convention used to represent a single RelNode RelTrait Convention data source. Inputs to a relational + register(RelOptPlander) expression must be in RelOptCluster + List<RelNode> getInputs(); the same convention RexNode * * <<inputs>> MyDBConvention NONE EnumberableConvention 21 Calcite Adapter Implementation Patterns TranslatableTable AbstractQueryableTable Convention.Impl(“MyAdapter”) RelNode <<instance of>> 1 2 MyAdapterProject 3 MyAdapterTable MyAdapterConvention MyAdapterRel MyAdapterFilter 9 + toRel(optTable) + implement(implContext) + asQueryable(provider,…) <<create>> TableScan ImplContext MyAdapterXXX Can convert queries in 5 + implParm1 Expression AbstractTableQueryable <<create>> + implParm2 … MyAdapterScan <<create on match >> 4 + register(planer) { MyAdapterQueryable Common interface for all MyAdapter Registers all MyAdapter Rules Relation Expressions. Provides } implementation callback method called + myQuery(params) : as part of physical plan implementation Enumetator MyAdapterProjectRu

SQL for Nosql and How Apache Calcite Can Help (Slides)

Java Linksammlung

Working with Storm Topologies Date of Publish: 2018-08-13

Apache Flink™: Stream and Batch Processing in a Single Engine

Declarative Languages for Big Streaming Data a Database Perspective

Administration and Configuration Guide

Apache Apex: Next Gen Big Data Analytics

E6895 Advanced Big Data Analytics Lecture 4: Data Store

The Cloud‐Based Demand‐Driven Supply Chain

A Comprehensive Study of Bloated Dependencies in the Maven Ecosystem

Synthesis and Development of a Big Data Architecture for the Management of Radar Measurement Data

CS145 Lecture Notes #15 Introduction to OQL

Unravel Data Systems Version 4.5