Apache Calcite for Enabling SQL Access to Nosql Data Systems Such As Apache Geode Christian Tzolov Whoami Christian Tzolov

Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian Tzolov Whoami Christian Tzolov Engineer at Pivotal, Big-Data, Hadoop, Spring Cloud Dataflow, Apache Geode, Apache HAWQ, Apache Committer, Apache Crunch PMC member [email protected] blog.tzolov.net twitter: @christzolov https://nl.linkedin.com/in/tzolov Disclaimer This talk expresses my personal opinions. It is not read or approved by Pivotal and does not necessarily reflect the views and opinions of Pivotal nor does it constitute any official communication of Pivotal. Pivotal does not support any of the code shared here. 2 Big Data Landscape 2016 • Volume • Velocity • Varity • Scalability • Latency • Consistency vs. Availability (CAP) 3 Data Access • {Old | New} SQL • Custom APIs – Key / Value – Fluent APIs – REST APIs • {My} Query Language Unified Data Access? At What Cost? 4 SQL? • Apache Apex • SQL-Gremlin • Apache Drill … • Apache Flink • Apache Geode • Apache Hive • Apache Kylin • Apache Phoenix • Apache Samza • Apache Storm • Cascading • Qubole Quark 5 Geode Adapter - Overview SQL/JDBC/ODBC Parse SQL, converts into relational expression and Apache Calcite optimizes Push down the relational expressions supported by Geode Spring Data API for Enumerable OQL and falls back to the Calcite interacting with Geode Adapter Enumerable Adapter for the rest Convert SQL relational Spring Data Geode Adapter expressions into OQL queries Geode (Geode Client) Geode API and OQL Geode Server Geode Server Geode Server Data Data Data SQL Relational Expressions SELECT b."totalPrice", c."firstName” FROM "BookOrder" as b INNER JOIN "Customer" as c ON b."customerNumber" = c."customerNumber” WHERE b."totalPrice" > 0; Project (c.firstName, b.totalPrice) (c.firstName, b.totalPrice) Project Join (on customerNumber) (b.totalPrice > 0) Filter optimize (totalPrice, Project customerNumber) (on customerNumber) Join (firstName, Project Filter (totalPrice > 0) customerNumber) Scan Scan Scan Scan Customer [c] BookOrder [b] Customer [c] BookOrder [b] 7 Geode Push Down Candidates Relational Operator Geode Support LIMIT YES (without OFFSET) PROJECT YES FILTER YES JOIN For collocated Regions only AGGREGATE YES for GROUP BY, DISTINCT, MAX, MIN, SUM, AVG, COUNT http://bit.ly/2eKApd0 SORT YES 8 Apache Geode? “… in-memory, distributed database with strong consistency built to support low latency transactional applications at extreme scale” Why Apache Geode? China Railway 5,700 train stations 7,000 stations 4.5 million tickets per day 72,000 miles of track 20 million daily users 23 million passengers daily 1.4 billion page views per day 120,000 concurrent users 40,000 visits per second 10,000 transactions per minute https://pivotal.io/big-data/case-study/distributed-in-memory-data-management-solution https://pivotal.io/big-data/case-study/scaling-online-sales-for-the-largest-railway-in-the-world-china-railway-corporation 10 Apache Geode Features • In-Memory Data Storage • Streaming and Event Processing – Over 100TB Memory – Listeners – JVM Heap + Off Heap – Distributed Functions • Any Data Format – Continuous OQL Queries – Key-Value/Object Store • Multi-site / Inter-cluster • ACID and JTA Compliant • Full Text Search (Lucene indexes) Transactions • Embedded and Standalone • HA and Linear Scalability • Top Level Apache Project • Strong Consistency 11 Apache Geode Concepts Locator – tracks system Client –read and modify the members and provides content of the distributed membership information system Locator (member) Client (member) Listener – event handler. CacheServer – process Registers for one or connected to the distributed Cache Server (member) more events and notified system with created Cache when they occur Cache Listeners Ke Val Region 1 y k1 v1 Region - consistent, distributed Functions k2 v2 Map (key-value), Partitioned or Replicated Functions… – distributed, … concurrent data Cache - In-memory collection processing Region N of Regions Geode Topology Cache Server Cache Server Cache Server Cache Data Cache Data Cache Data Peer-to-Peer Client Local Cache pool Client-Server Cache Server Cache Server Cache Server Cache Data Cache Data Cache Data … Cache Server Cache Server Cache Server Cache Server Cache Data Cache Data Cache Data Cache Data Gateway Receiver Gateway Sender WAN Multi-site Boundary Multi-Site … Gateway Sender Gateway Receiver Cache Data Cache Data Cache Data Cache Data Cache Server Cache Server Cache Server Cache Server Geode Client API • Client Cache • Key / Value - Region GET, PUT, REMOVE • OQL – QueryService Geode Data Types & Serialization • Key-Value with complex value formats • Portable Data eXchange (PDX) Serialization – Delta propagation, schema evolution, polyglot support … • Object Query Language (OQL) { SELECT p.name id: 1, FROM /Person p name: “Fred”, WHERE p.pet.type = “dino” age: 42, pet: { nested fields name: “Barney”, type: “dino” single field deserialization } } Geode Demo (GFSH and OQL) • Connect to Geode cluster, • List available Regions • Run OQL query Apache Calcite? Java framework that allows SQL interface and advanced query optimization, for virtually any data system • Query Parser, Validator and Optimizer(s) • JDBC drivers - local and remote • SQL Streaming • Agnostic to data storage and processing • SQL completes vs. NoSQL integrity Calcite Data Types JDBC • Catalog – namespaces accessed in queries Calcite • Schema - collection of schemas and tables SQL Engine • Table - single data set, collection of rows Schema Table Table … Table • RelDataType – SQL fields types in a Table Data Type Mapping SELECT title, author FROM test.BookMaster Data System Data Types Data Type Fields Schema Table Your Data System Calcite Data Types: RelDataType Type of a scalar expression or row • RelDataTypeFactory – RelDataType factory • JavaTypeFactory - registers Java classes as record types • JavaTypeFactoryImpl - Java Reflection to build RelDataTypes • SqlTypeFactoryImpl - Default implementation with all SQL types 19 Geode to Calcite Data Types Mapping Geode Cache is mapped into Calcite Schema Create Column Types Geode Cache (RelDataType) from Geode Calcite Schema Value class Region 1 (JavaTypeFactoryImpl) Table 1 Col1 Col2 ColN Key Val Row1 V(1,1) V(1,2) V(1,N) k1 v1 Row2 V(2,1) V(2,2) V(2,N) k2 v2 RowM V(M,1) V(M,2) V(M,N) … Geode Key/Value is mapped … into Table Row Region K Table K Regions are mapped into Tables 20 Calcite Bootstrap Flow Typical calcite initialization flow Configures Calcite Model (JSON) Creates SchemaFactory Creates Schema Creates Tables 21 Calcite Model Model The path to <my-model>.json is passed as JDBC connection argument: SchemaFactory !connect jdbc:calcite:model=target/test-classes/<my-model-path>.json︎ Schema Tables { version: '1.0', defaultSchema: 'TEST', schemas: [ { Reference to your adapter name: 'TEST', Schema Name type: 'custom', schema factory implementation class factory: 'org.apache.calcite.adapter.geode.simple.GeodeSchemaFactory', operand: { locatorHost: 'localhost', locatorPort: '10334', Parameters to be passed to regions: 'BookMaster', your adapter schema factory pdxSerializablePackagePath: 'net.tzolov.geode.bookstore.domain.*' } implementation }] } Geode Calcite Schema and Schema Factory Model public class GeodeSchemaFactory implements SchemaFactory { SchemaFactory public Schema create(SchemaPlus parentSchema, String schemaName, Map<String, Object> operand) { Schema String locatorHost = (String) operand.get(“locatorHost”); Tables int locatorPort = … Retrieves the parameters set in String[] regionNames = … the model.json String pdxPackagePath = … Create an Adapter Schema return new GeodeSchema(locatorHost, locatorPort, regionNames, pdxPackagePath); instance with the provided } parameters. } public class GeodeSchema extends AbstractSchema { private String regionName = .. protected Map<String, Table> getTableMap() { final ImmutableMap.Builder<String, Table> builder = ImmutableMap.builder(); Create GeodeScannableTable Region region = … Get Geode Region by region name … instance for each Geode Region Class valueClass= … Find region’s value type … builder.put(regionName, new GeodeScannableTable(regionName, valueClass, clientCache)); return tableMap; } Geode Scannable Table public class GeodeScannableTable extends AbstractTable implements ScannableTable { Model SchemaFactory public RelDataType getRowType(RelDataTypeFactory typeFactory) { Uses reflection (or pdx-instance) Schema return new JavaTypeFactoryImpl().createStructType(valueClass); to builds RelDataType from value’s } class type Tables public Enumerable<Object[]> scan(DataContext root) { Returns an Enumeration over the entire target data store return new AbstractEnumerable<Object[]>() { public Enumerator<Object[]> enumerator() { return new GeodeEnumerator<Object[]>(clientCache, regionName); } } public class GeodeEnumerator<E> implements Enumerator<E> { private E current; private SelectResults geodeIterator; Defined in the Linq4j sub-project public GeodeEnumerator(ClientCache clientCache, String regionName) { geodeterator = clientCache.getQueryService().newQuery("select * from /" + regionName).execute().iterator(); } public boolean moveNext() { current = convert(geodeIterator.next()); return true;} Retrieves the entire Region!! public E current() {return current;} Converts Geode value response into Calcite row data public abstract E convert(Object geodeValue) { Convert PDX value into RelDataType row } Geode Demo (Scannable Tables) $ ./sqlline ︎ sqlline> !connect jdbc:calcite:model=target/test-classes/model2.json admin admin︎ ︎ jdbc:calcite> !tables ︎ jdbc:calcite>

Apache Calcite for Enabling SQL Access to Nosql Data Systems Such As Apache Geode Christian Tzolov Whoami Christian Tzolov

Java Linksammlung

Working with Storm Topologies Date of Publish: 2018-08-13

Apache Flink™: Stream and Batch Processing in a Single Engine

Declarative Languages for Big Streaming Data a Database Perspective

Administration and Configuration Guide

Apache Apex: Next Gen Big Data Analytics

E6895 Advanced Big Data Analytics Lecture 4: Data Store

The Cloud‐Based Demand‐Driven Supply Chain

A Comprehensive Study of Bloated Dependencies in the Maven Ecosystem

Synthesis and Development of a Big Data Architecture for the Management of Radar Measurement Data

CS145 Lecture Notes #15 Introduction to OQL

Evaluation of SPARQL Queries on Apache Flink