SQL++ Query Language

9/9/2015 SQL++ Query Language and its use in the FORWARD framework with the support of National Science Foundation, Google, Informatica, Couchbase Semistructured Databases have Arrived SQL-on-Hadoop NoSQL CQL PIG N1QL SQL-on-Hadoop Jaql Others AQL in the form of JSON… { location: 'Alpine', readings: [ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, no2: 0.0050 }, { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } ] } 1 9/9/2015 Semistructured Query Languages come along… SQL: JSON: Expressive Semistructured Power and Data Adoption but in a Tower of Babel… SELECT AVG(temp) AS tavg db.readings.aggregate( FROM readings {$group: {_id: "$sid", GROUP BY sid tavg: {$avg:"$temp"}}}) SQL MongoDB readings -> group by sid = $.sid a = LOAD 'readings' AS into { tavg: avg($.temp) }; (sid:int, temp:float); Jaql b = GROUP a BY sid; for $r in collection("readings") c = FOREACH b GENERATE group by $r.sid AVG(temp); return { tavg: avg($r.temp) } DUMP c; JSONiq Pig with limited query abilities… Growing out of key-value databases or SQL with special type UDFs Far from the full-fledged power of prior non-relational query languages – OQL, Xquery *academic projects being the exception (eg, ASTERIX, FORWARD) 2 9/9/2015 in lieu of formal semantics… Query languages were not meant for Ninjas Outline • What is SQL++ ? • How we use SQL++ in UCSD’s FORWARD ? • Introduction to the formal survey of NoSQL, NewSQL and SQL-on-Hadoop SQL++ : Data model + query language for querying both structured data (SQL) and semistructured data (native JSON) Formal Schemaless Full fledged, Semantics: adjusting tuple declarative , calculus to the aggressively backwards- richness of JSON compatible with SQL Configurable SQL++: Unifying: Formal configuration 11 popular databases/QLs options towards choosing captured by appropriate semantics, understanding settings of Configurable repercussions SQL++ 3 9/9/2015 How is it used in the Big Data industry • ASTERIXDB • CouchDB heading towards SQL++ How do we use it in UCSD Automatic Incremental View Maintenance Middleware Query Optimization & Execution: FORWARD virtual database query plans based on SQL++ algebra based on a schemaless SQL++ algebra SQL++ Data Model (also JSON++ data model) A superset of SQL and JSON • Extend SQL with arrays + nesting + heterogeneity { location: 'Alpine', readings: [ • Array nested { inside a tuple time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, • Heterogeneous no2: 0.0050 tuples in }, collections { time: timestamp('2014-03-12T22:00:00'), • Arbitrary ozone: 'm', compositions of co: 0.4 array, bag, tuple } ] } 4 9/9/2015 SQL++ Data Model (also JSON++ data model) A superset of SQL and JSON • Extend SQL with arrays + nesting + heterogeneity + permissive structures [ [5, 10, 3], [21, 2, 15, 6], [4, {lo: 3, exp: 4, hi: 7}, 2, 13, 6] ] • Arbitrary compositions of array, bag, tuple SQL++ Data Model (also JSON++ data model) A superset of SQL and JSON • Extend SQL with arrays + nesting + heterogeneity + permissive structures • Extend JSON with bags and enriched types { location: 'Alpine', readings: {{ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, • Bags {{ … }} no2: 0.0050 }, • Enriched types { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } }} } Exercise: Why the SQL++ data model is a superset of the SQL data model? • a SQL table corresponds to a JSON bag that has homogeneous tuples 5 9/9/2015 SQL++ Data Model (also JSON++ data model) A superset of SQL and JSON • Extend SQL with arrays + nesting + heterogeneity + permissive structures + permissive schema • Extend JSON with bags and enriched types { { location: 'Alpine', location: string, readings: {{ readings: {{ { { time: timestamp('2014-03-12T20:00:00'), time: timestamp, • With schema ozone: 0.035, ozone: any, no2: 0.0050 * • Or, schemaless }, } { }} • Or, partial schema time: timestamp('2014} -03-12T22:00:00'), ozone: 'm', co: 0.4 } }} } From SQL to SQL++ • Goal: Queries that input/output any JSON type, yet backwards compatible with SQL Backwards Compatibility with SQL readings : {{ { sid: 2, temp: 70.1 }, {{ { sid: 2, temp: 49.2 }, { sid : 2 } { sid: 1, temp: null } SELECT DISTINCT r.sid }} }} FROM readings AS r WHERE r.temp < 50 Find sensors that recorded a temperature below 50 sid sid temp 2 2 70.1 2 49.2 1 null 6 9/9/2015 From SQL to SQL++ • Goal: Queries that input/output any JSON type, yet backwards compatible with SQL • Adjust SQL syntax/semantics for heterogeneity, complex input/output values • Achieved with few extensions and mostly by SQL limitation removals From Tuple Variables to Element Variables SELECT r AS co readings : FROM readings AS r [ [ 1.3, WHERE r < 1.0 { co: 0.8 }, 0.7, ORDER BY r DESC { co: 0.7 } 0.3, LIMIT 2 ] 0.8 ] Find the highest two sensor readings that are below 1.0 Element Variables FROM readings AS r out in B FROM = B WHERE = {{ ⟨ r : 1.3 ⟩, ⟨ r : 0.7 ⟩, ⟨ r : 0.3 ⟩, ⟨ r : 0.8 ⟩ }} WHERE r < 1.0 : readings Bout = Bin = {{ ⟨ r : 0.7 ⟩, [ WHERE ORDERBY ⟨ r : 0.3 ⟩, [ 1.3, ⟨ r : 0.8 ⟩ }} { co: 0.8 }, 0.7, { co: 0.7 } 0.3, ORDER BY r DESC ] 0.8 Bout = Bin = [ ⟨ r : 0.8 ⟩, ] ORDERBY LIMIT ⟨ : ⟩, r 0.7 ⟨ r : 0.3 ⟩ ] LIMIT 2 out in B LIMIT = B SELECT = [ ⟨ r : 0.8 ⟩ ⟨ r : 0.7 ⟩ ] SELECT r AS co 7 9/9/2015 Correlated Subqueries Everywhere SELECT r AS co sensors : FROM sensors AS s, [ s AS r [ [1.3, 2] WHERE r < 1.0 { co: 0.9 }, [0.7, 0.7, 0.9] ORDER BY r DESC { co: 0.8 } [0.3, 0.8, 1.1] LIMIT 2 ] [0.7, 1.4] ] Find the highest two sensor readings that are below 1.0 Correlated Subqueries Everywhere FROM sensors AS s, s AS r out in B FROM = B WHERE = {{ ⟨ s : [1.3, 2] , r : 1.3 ⟩, ⟨ s : [1.3, 2] , r : 2 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, sensors : [ ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, [1.3, 2] ⟨ s : [0.3, 0.8, 1.1] , r : 0.3 ⟩, [0.7, 0.7, 0.9] ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩, [0.3, 0.8, 1.1] ⟨ s : [0.3, 0.8, 1.1] , r : 1.1 ⟩, [0.7, 1.4] ⟨ s : [0.7, 1.4] , r : 0.7 ⟩, ] ⟨ s : [0.7, 1.4] , r : 1.4 ⟩ }} WHERE r < 1.0 ORDER BY r DESC LIMIT 2 SELECT r AS co Correlated Subqueries Everywhere FROM sensors AS s, s AS r WHERE r < 1.0 out in B WHERE = B ORDERBY = {{ ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, : sensors [ ⟨ s : [0.3, 0.8, 1.1] , r : 0.3 ⟩, [1.3, 2] ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩, [0.7, 0.7, 0.9] ⟨ s : [0.7, 1.4] , r : 0.7 ⟩ [0.3, 0.8, 1.1] }} [0.7, 1.4] ] ORDER BY r DESC LIMIT 2 SELECT r AS co 8 9/9/2015 Correlated Subqueries Everywhere FROM sensors AS s, s AS r WHERE r < 1.0 ORDER BY r DESC out in B ORDERBY = B LIMIT = [ ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, sensors : [ ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩, [1.3, 2] ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, [ [0.7, 0.7, 0.9] ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, { co: 0.9 }, [0.3, 0.8, 1.1] ⟨ s : [0.7, 1.4] , r : 0.7 ⟩, { co: 0.8 } [0.7, 1.4] ⟨ s : [0.3, 0.8, 1.1] , r : 0.3 ⟩ ] ] ] LIMIT 2 out in B LIMIT = B SELECT = [ ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩ ] SELECT r AS co Utilization of Input Order x y SELECT r AS co, [ rp AS x, { co: 0.9, sensors : sp AS y x: 3, [ FROM sensors AS s AT sp, y: 2 [1.3, 2] s AS r AT rp }, [0.7, 0.7, 0.9] WHERE r < 1.0 [0.3, 0.8, 1.1] ORDER BY r DESC { co: 0.8, [0.7, 1.4] LIMIT 2 x: 2, ] y: 3 } Find the highest two sensor readings that are ] below 1.0 Constructing Nested Results – Subqueries Again {{ ℾ = ⟨ 0 { sensors : {{ sensor : 1, {sensor: 1}, FROM sensors AS s readings: {{ {sensor: 2} Bout = Bin = {{ {co:0.4}, }} , FROM SELECT ⟨ s : {sensor: 1} ⟩, {co:0.2} ⟨ s : {sensor: 2} ⟩ }} logs: {{ }} }, {sensor: 1, co: 0.4}, SELECT { {sensor: 1, s.sensor AS sensor, sensor : 2, co: 0.2}, ( readings: {{ {sensor: 2, SELECT l.co AS co {co:0.3} co: 0.3}, FROM logs AS l }} }} WHERE l.sensor = s.sensor } ⟩ ) AS readings }} 9 9/9/2015 Constructing Nested Results FROM logs AS l ℾ1 = ⟨ s : {sensor : 1}, out in B FROM = B SELECT = {{ sensors : {{ ⟨ l : {sensor: 1, co: 0.4} ⟩, {sensor: 1}, ⟨ l : {sensor: 1, co: 0.2} ⟩, {sensor: 2} ⟨ l : {sensor: 2, co: 0.3} ⟩ }} , }} {{ {co:0.4}, logs: {{ WHERE l.sensor = s.sensor {co:0.2} {sensor: 1, }} co: 0.4}, Bout = Bin = {{ {sensor: 1, FROM SELECT ⟨ l : {sensor: 1, co: 0.4} ⟩, co: 0.2}, ⟨ l : {sensor: 1, co: 0.2} ⟩ {sensor: 2, }} co: 0.3}, }} ⟩ SELECT l.co AS co Iterating over Attribute-Value pairs INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN FROM [ readings AS {g:v}, { v AS n gas: "co", num: null ℾ = ⟨ }, { readings : { gas: "no2", co : [], num: 0.7 {{ no2: [0.7], }, { ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, so2: [0.5, 2] gas: "so2", ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, } num: 0.5, ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }} ⟩ }, { gas: "so2", SELECT ELEMENT num: 2 { gas: g, } val: v } ] 10 9/9/2015 INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN FROM [ readings AS {g:v} { INNER CORRELATE gas: "co", v AS n num: null ℾ = ⟨ }, { readings : { gas: "no2", co : [], num: 0.7 {{ no2: [0.7], }, { ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, so2: [0.5, 2] gas: "so2", ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, } num: 0.5, ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }} ⟩ }, { gas: "so2", SELECT ELEMENT num: 2 { gas: g, } val: v } ] INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN FROM [ readings AS {g:v} { OUTER CORRELATE gas: "co", v AS n num: null ℾ = ⟨ }, { readings : { gas: "no2", co : [], num: 0.7 {{ ⟨ g : "co", v : [], n : missing ⟩, no2: [0.7], }, { ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, so2: [0.5, 2] gas: "so2", ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, } num: 0.5, ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }} ⟩ }, { gas: "so2", SELECT ELEMENT num: 2 { gas: g, } val: v } ] The SQL++ Extensions to SQL • Goal: Queries that input/output any JSON type, yet backwards compatible with SQL • Adjust SQL syntax/semantics for heterogeneity, complex values • Achieved with few extensions and SQL limitation removals • Element Variables • Correlated Subqueries (in FROM and SELECT clauses) • Optional utilization of Input Order • Grouping indendent of Aggregation Functions • ..

Load more