9/9/2015
SQL++ Query Language
and its use in the FORWARD framework
with the support of National Science Foundation, Google, Informatica, Couchbase
Semistructured Databases have Arrived
SQL-on-Hadoop NoSQL
CQL PIG
N1QL SQL-on-Hadoop Jaql Others
AQL
in the form of JSON…
{ location: 'Alpine', readings: [ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, no2: 0.0050 }, { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } ] }
1 9/9/2015
Semistructured Query Languages come along…
SQL: JSON: Expressive Semistructured Power and Data Adoption
but in a Tower of Babel…
SELECT AVG(temp) AS tavg db.readings.aggregate( FROM readings {$group: {_id: "$sid", GROUP BY sid tavg: {$avg:"$temp"}}}) SQL MongoDB readings -> group by sid = $.sid a = LOAD 'readings' AS into { tavg: avg($.temp) }; (sid:int, temp:float); Jaql b = GROUP a BY sid; for $r in collection("readings") c = FOREACH b GENERATE group by $r.sid AVG(temp); return { tavg: avg($r.temp) } DUMP c; JSONiq Pig
with limited query abilities…
Growing out of key-value databases or SQL with special type UDFs
Far from the full-fledged power of prior non-relational query languages – OQL, Xquery *academic projects being the exception (eg, ASTERIX, FORWARD)
2 9/9/2015
in lieu of formal semantics…
Query languages were not meant for Ninjas
Outline
• What is SQL++ ?
• How we use SQL++ in UCSD’s FORWARD ?
• Introduction to the formal survey of NoSQL, NewSQL and SQL-on-Hadoop
SQL++ : Data model + query language for querying both structured data (SQL) and semistructured data (native JSON)
Formal Schemaless Full fledged, Semantics: adjusting tuple declarative , calculus to the aggressively backwards- richness of JSON compatible with SQL
Configurable SQL++: Unifying: Formal configuration 11 popular databases/QLs options towards choosing captured by appropriate semantics, understanding settings of Configurable repercussions SQL++
3 9/9/2015
How is it used in the Big Data industry • ASTERIXDB • CouchDB heading towards SQL++
How do we use it in UCSD
Automatic Incremental View Maintenance
Middleware Query Optimization & Execution: FORWARD virtual database query plans based on SQL++ algebra based on a schemaless SQL++ algebra
SQL++ Data Model (also JSON++ data model) A superset of SQL and JSON • Extend SQL with arrays + nesting + heterogeneity
{ location: 'Alpine', readings: [ • Array nested { inside a tuple time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, • Heterogeneous no2: 0.0050 tuples in }, collections { time: timestamp('2014-03-12T22:00:00'), • Arbitrary ozone: 'm', compositions of co: 0.4 array, bag, tuple } ] }
4 9/9/2015
SQL++ Data Model (also JSON++ data model) A superset of SQL and JSON • Extend SQL with arrays + nesting + heterogeneity + permissive structures
[ [5, 10, 3], [21, 2, 15, 6], [4, {lo: 3, exp: 4, hi: 7}, 2, 13, 6] ] • Arbitrary compositions of array, bag, tuple
SQL++ Data Model (also JSON++ data model) A superset of SQL and JSON • Extend SQL with arrays + nesting + heterogeneity + permissive structures • Extend JSON with bags and enriched types { location: 'Alpine', readings: {{ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, • Bags {{ … }} no2: 0.0050
}, • Enriched types { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } }} }
Exercise: Why the SQL++ data model is a superset of the SQL data model? • a SQL table corresponds to a JSON bag that has homogeneous tuples
5 9/9/2015
SQL++ Data Model (also JSON++ data model) A superset of SQL and JSON • Extend SQL with arrays + nesting + heterogeneity + permissive structures + permissive schema • Extend JSON with bags and enriched types { { location: 'Alpine', location: string, readings: {{ readings: {{ { { time: timestamp('2014-03-12T20:00:00'), time: timestamp, • With schema ozone: 0.035, ozone: any,
no2: 0.0050 * • Or, schemaless }, }
{ }} • Or, partial schema time: timestamp('2014} -03-12T22:00:00'), ozone: 'm', co: 0.4 } }} }
From SQL to SQL++
• Goal: Queries that input/output any JSON type, yet backwards compatible with SQL
Backwards Compatibility with SQL
readings : {{ { sid: 2, temp: 70.1 }, {{ { sid: 2, temp: 49.2 }, { sid : 2 } { sid: 1, temp: null } SELECT DISTINCT r.sid }} }} FROM readings AS r WHERE r.temp < 50
Find sensors that recorded a temperature below 50 sid sid temp 2 2 70.1 2 49.2 1 null
6 9/9/2015
From SQL to SQL++
• Goal: Queries that input/output any JSON type, yet backwards compatible with SQL
• Adjust SQL syntax/semantics for heterogeneity, complex input/output values
• Achieved with few extensions and mostly by SQL limitation removals
From Tuple Variables to Element Variables
SELECT r AS co readings : FROM readings AS r [ [ 1.3, WHERE r < 1.0 { co: 0.8 }, 0.7, ORDER BY r DESC { co: 0.7 } 0.3, LIMIT 2 ] 0.8 ] Find the highest two sensor readings that are below 1.0
Element Variables FROM readings AS r
out in B FROM = B WHERE = {{ ⟨ r : 1.3 ⟩, ⟨ r : 0.7 ⟩, ⟨ r : 0.3 ⟩, ⟨ r : 0.8 ⟩ }}
WHERE r < 1.0 : readings Bout = Bin = {{ ⟨ r : 0.7 ⟩, [ WHERE ORDERBY ⟨ r : 0.3 ⟩, [ 1.3, ⟨ r : 0.8 ⟩ }} { co: 0.8 }, 0.7, { co: 0.7 } 0.3, ORDER BY r DESC ] 0.8 Bout = Bin = [ ⟨ r : 0.8 ⟩, ] ORDERBY LIMIT ⟨ : ⟩, r 0.7 ⟨ r : 0.3 ⟩ ] LIMIT 2
out in B LIMIT = B SELECT = [ ⟨ r : 0.8 ⟩ ⟨ r : 0.7 ⟩ ] SELECT r AS co
7 9/9/2015
Correlated Subqueries Everywhere
SELECT r AS co
sensors : FROM sensors AS s, [ s AS r [ [1.3, 2] WHERE r < 1.0 { co: 0.9 }, [0.7, 0.7, 0.9] ORDER BY r DESC { co: 0.8 } [0.3, 0.8, 1.1] LIMIT 2 ] [0.7, 1.4] ] Find the highest two sensor readings
that are below 1.0
Correlated Subqueries Everywhere FROM sensors AS s, s AS r
out in B FROM = B WHERE = {{ ⟨ s : [1.3, 2] , r : 1.3 ⟩, ⟨ s : [1.3, 2] , r : 2 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, sensors : [ ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, [1.3, 2] ⟨ s : [0.3, 0.8, 1.1] , r : 0.3 ⟩, [0.7, 0.7, 0.9] ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩, [0.3, 0.8, 1.1] ⟨ s : [0.3, 0.8, 1.1] , r : 1.1 ⟩, [0.7, 1.4] ⟨ s : [0.7, 1.4] , r : 0.7 ⟩, ] ⟨ s : [0.7, 1.4] , r : 1.4 ⟩ }}
WHERE r < 1.0 ORDER BY r DESC LIMIT 2 SELECT r AS co
Correlated Subqueries Everywhere FROM sensors AS s, s AS r WHERE r < 1.0
out in B WHERE = B ORDERBY = {{ ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩,
⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, : sensors [ ⟨ s : [0.3, 0.8, 1.1] , r : 0.3 ⟩, [1.3, 2] ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩, [0.7, 0.7, 0.9] ⟨ s : [0.7, 1.4] , r : 0.7 ⟩ [0.3, 0.8, 1.1] }} [0.7, 1.4] ] ORDER BY r DESC
LIMIT 2 SELECT r AS co
8 9/9/2015
Correlated Subqueries Everywhere FROM sensors AS s, s AS r WHERE r < 1.0 ORDER BY r DESC
out in B ORDERBY = B LIMIT = [ ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, sensors : [ ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩, [1.3, 2] ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, [ [0.7, 0.7, 0.9] ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, { co: 0.9 }, [0.3, 0.8, 1.1] ⟨ s : [0.7, 1.4] , r : 0.7 ⟩, { co: 0.8 } [0.7, 1.4] ⟨ s : [0.3, 0.8, 1.1] , r : 0.3 ⟩ ] ] ]
LIMIT 2
out in B LIMIT = B SELECT = [ ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩ ] SELECT r AS co
Utilization of Input Order
x y
SELECT r AS co, [ rp AS x, { co: 0.9, sensors : sp AS y x: 3, [ FROM sensors AS s AT sp, y: 2 [1.3, 2] s AS r AT rp }, [0.7, 0.7, 0.9] WHERE r < 1.0 [0.3, 0.8, 1.1] ORDER BY r DESC { co: 0.8, [0.7, 1.4] LIMIT 2 x: 2, ] y: 3 } Find the highest two sensor readings that are ] below 1.0
Constructing Nested Results – Subqueries Again
{{
ℾ = ⟨ 0 { sensors : {{ sensor : 1, {sensor: 1}, FROM sensors AS s readings: {{ {sensor: 2} Bout = Bin = {{ {co:0.4}, }} , FROM SELECT ⟨ s : {sensor: 1} ⟩, {co:0.2}
⟨ s : {sensor: 2} ⟩ }} logs: {{ }} }, {sensor: 1,
co: 0.4}, SELECT { {sensor: 1, s.sensor AS sensor, sensor : 2, co: 0.2}, ( readings: {{ {sensor: 2, SELECT l.co AS co {co:0.3} co: 0.3}, FROM logs AS l }} }} WHERE l.sensor = s.sensor } ⟩ ) AS readings }}
9 9/9/2015
Constructing Nested Results
FROM logs AS l ℾ1 = ⟨ s : {sensor : 1}, out in B FROM = B SELECT = {{ sensors : {{ ⟨ l : {sensor: 1, co: 0.4} ⟩, {sensor: 1}, ⟨ l : {sensor: 1, co: 0.2} ⟩, {sensor: 2} ⟨ l : {sensor: 2, co: 0.3} ⟩ }} , }} {{
{co:0.4}, logs: {{ WHERE l.sensor = s.sensor {co:0.2} {sensor: 1, }} co: 0.4}, Bout = Bin = {{ {sensor: 1, FROM SELECT ⟨ l : {sensor: 1, co: 0.4} ⟩, co: 0.2}, ⟨ l : {sensor: 1, co: 0.2} ⟩ {sensor: 2, }} co: 0.3}, }} ⟩ SELECT l.co AS co
Iterating over Attribute-Value pairs
INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN
FROM [ readings AS {g:v}, { v AS n gas: "co", num: null ℾ = ⟨ }, { readings : { gas: "no2", co : [], num: 0.7 {{ no2: [0.7], }, { ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, so2: [0.5, 2] gas: "so2", ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, } num: 0.5, ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }} ⟩ }, { gas: "so2", SELECT ELEMENT num: 2 { gas: g, } val: v } ]
10 9/9/2015
INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN
FROM [ readings AS {g:v} { INNER CORRELATE gas: "co", v AS n num: null ℾ = ⟨ }, { readings : { gas: "no2", co : [], num: 0.7 {{ no2: [0.7], }, { ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, so2: [0.5, 2] gas: "so2", ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, } num: 0.5, ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }} ⟩ }, { gas: "so2", SELECT ELEMENT num: 2 { gas: g, } val: v } ]
INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN
FROM [ readings AS {g:v} { OUTER CORRELATE gas: "co", v AS n num: null ℾ = ⟨ }, { readings : { gas: "no2", co : [], num: 0.7 {{ ⟨ g : "co", v : [], n : missing ⟩, no2: [0.7], }, { ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, so2: [0.5, 2] gas: "so2", ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, } num: 0.5, ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }} ⟩ }, { gas: "so2", SELECT ELEMENT num: 2 { gas: g, } val: v } ]
The SQL++ Extensions to SQL
• Goal: Queries that input/output any JSON type, yet backwards compatible with SQL
• Adjust SQL syntax/semantics for heterogeneity, complex values
• Achieved with few extensions and SQL limitation removals • Element Variables • Correlated Subqueries (in FROM and SELECT clauses) • Optional utilization of Input Order • Grouping indendent of Aggregation Functions • ... and a few more details
• Configurability: making SQL++ a unifying and configurable language
11 9/9/2015
Algebras do not need schemaful inputs
FROM logs AS l ℾ1 = ⟨ s : {sensor : 1}, out in B FROM = B SELECT = {{ sensors : {{ ⟨ l : {sensor: 1, co: 0.4} ⟩, {sensor: 1}, ⟨ l : {sensor: 1, co: 0.2} ⟩, {sensor:sensor 2} co ⟨ l : {sensor: 2, co: 0.3} ⟩ }} , }} {{ 1 0.4 {co:0.4}, logs: {{ 1 0.2 WHERE l.sensor = s.sensor {co:0.2} {sensor: 1, }} co: 2 0.4}, 0.3 Bout = Bin = {{ {sensor: 1, FROM SELECT ⟨ l : {sensor: 1, co: 0.4} ⟩, co: 0.2}, ⟨ l : {sensor: 1, co: 0.2} ⟩ {sensor: 2, }} co: 0.3}, }} ⟩ SELECT l.co AS co
Outline
• What is SQL++ ?
• How we use SQL++ in UCSD’s FORWARD ? • SQL++ uses in integration
• Introduction to the formal survey of NoSQL, NewSQL and SQL-on-Hadoop
SQL++ based Virtual DatabaseClient Federated SQL++ Queries SQL++ Results FORWARD Virtual Database
SQL++ Query Processor
SQL++ Virtual Views of Sources
SQL++ SQL++ SQL++ SQL++ SQL++ Virtual Virtual Virtual Virtual Virtual View View View View View SQL-on- SQL NewSQL NoSQL Java Objects Hadoop Wrapper Wrapper Wrapper Wrapper Wrapper Native Native queries results
SQL-on- Java SQL NewSQL NoSQL Hadoop In-Memory Database Database Database Database Objects
12 9/9/2015
SQL++ Virtual and/or MaterializedClient Views SQL++ Queries SQL++ Results FORWARD Integration Database
SQL++ Query Processor SQL++ View SQL++ Integrated Views Definitions
SQL++ SQL++ SQL++ Integrated Integrated Added Value View View View
SQL++ SQL++ SQL++ SQL++ SQL++ Virtual Virtual Virtual Virtual Virtual View View View View View
SQL-on- Java SQL NewSQL NoSQL Hadoop In-Memory Database Database Database Database Objects
Use Case 1: Hide the ++ source features behind SQL views
BI ToolSELECT / Report Writerm.sid , t AS temp FROM measurements AS m, SQL query SQL result m.temperatures AS t FORWARD Virtual Database SQL View or its SQL++ equivalent
measurements: sid temp measurements: {{
2 70.1 {sid: 2, temp: 70.1}, {sid: 2, temp: 70.2}, 2 70.2 {sid: 1, temp: 71.0} 1 71.0 }}
Couchbase query Couchbase results
measurements: {{ { sid: 2, temperatures: [70.1, 70.2] }, { sid: 1, temperatures: [71.0] } Couchbase }}
Use Case 2: Semantics-Aware Pushdown
"Sensors that recorded a temperature below 50"
SQL++ query SQL++ result SELECT DISTINCT m.sid FROM measurements AS m MongoDB Virtual Database WHERE m.temp < 50 Virtual View SQL++ semantics for (Identical) m.temp < 50 MongoDB Wrapper
MongoDB query MongoDB results ... { $match: {temp: {$lt: 50}}}, MongoDB { $match: {temp: {$not: null}}} measurements: {{ ... { sid: 2, temp: 70.1 }, MongoDB semantics for { sid: 2, temp: 49.2 }, temp $lt 50 { sid: 1, temp: null } }}
13 9/9/2015
Push-Down Challenges: Limited capabilities & semantic variations
• How to simulate incompatible semantics/missing features? • Sources like SQL and MongoDB cannot execute the entirety of SQL++ • How to efficiently push down computation?
Issues automatically handled by the query processor. Plenty of query rewriting problems, including novel ones on semi-structured operators
SQL++ captures Semantics Variations
• Semantics of "less-than" comparison are different across sources:
• Config Parameters to capture these variations @lt { complex : deep, @lt { MongoDB } type_mismatch: false, ( x < y ) null_lt_null : false, null_lt_value: false, ... } ( x < y ) Semantics Variations In NoSQL, NewSQL, SQL-on-Hadoop variation of semantics for: • Paths • Equality • Comparisons And all the operators that use them: • Selections • Grouping • Ordering • Set operations Each of these features has a set of config parameters in SQL++ 14 9/9/2015 Don’t give a standard Teach how to reach a standard • Fast evolving space • Standard would be premature • Configuration options itemize and explain the space options • Rationalize discussion Advise on consistent settings of the configuration options • Not every configuration option setting is sensible • Configuration options have to subscribe to common sense iff x = y then [x] = y if x = y and y = z then x = z • which enables rewritings • Have to be consistent with each other Eg, if true AND missing is missing, then false OR missing is missing FORWARD: SQL++ Incremental View Maintenance and Application Visualization layer • The Incremental View Maintenance functionality: IVM • SQL++ (Materialized) View Definition Stream Eg, Couchbase has a JSON web log showing View View {{ {user, list of displayed products } }} Before After and FORWARD produces materialized view {{ {product category, count, products: [{product, count}]} }} Stream DB DB • Stream of inserts, deletes, updates on base data Before After • The Incremental View Maintenance module • Automatically and efficiently updates the materialized view to reflect the stream of changes • SQL++ can also enable automatic Incremental View Maintenance! • With attention to replication of data in views • Opportunities by keys 15 9/9/2015 FORWARD: SQL++ Incremental View Maintenance and Application Visualization layer Custom dashboards, interactive pages & apps • The data models of visualization components (e.g. Google Maps) can be nicely captured with JSON models • The pages are SQL++ (JSON) views! • Mashups of the components views • SQL++ feeds and incrementally updates the page views Use case • From data to visualization with just SQL++ & markup • Ajax/Javascript visuals with no Ajax/Javascript mess • How to easily connect to today’s JS libraries • Custom Ajax visualizations & interfaces for IT personnel FORWARD: SQL++ Incremental View Maintenance and Application Visualization layer (part of) the Google Map model <% unit google.maps.Maps %> { markers: [ { position: { latitude : number, longitude: number } ... } ] } <% end unit %> Outline • What is SQL++ ? • How we use SQL++ in UCSD’s FORWARD ? • SQL++ uses in integration and (live) analytics applications • Introduction to the formal survey of NoSQL, NewSQL and SQL- on-Hadoop 16 9/9/2015 AQLJaqlSQLPigHiveMongoJDBCJSONiqN1QLCQLMongoDBBigQuery (IBM) ((Cassandra)AsterixDB (Couchbase (Google) ) ) Surveyed Databases • FLWORAlgebraicSQLBasedCreatedJSONPreviously- standardlike-like on syntax syntaxat syntaxXQuery "FacebookDremel " • ResearchHadoopAlsoMongoDB28msecNoSQLMost production- likecomprises used cluster(s)clusterssyntax database via NoSQL JDBC releaseNewSQL database (UCI)driver SQL-on-Hadoop NoSQL CQL PIG N1QL SQL-on-Hadoop Jaql Others SQL & NewSQL AQL MongoDB driver SQL++ Removes Superficial Differences • SQL++ covers SQL, N1QL and QL research prototypes (e.g., UCI’s ASTERIX) • Removing the current “Tower of Babel” effect • Providing formal syntax and semantics SELECT AVG(temp) AS tavg db.readings.aggregate( FROM readings {$group: {_id: "$sid", GROUP BY sid tavg: {$avg:"$temp"}}}) SQL MongoDB readings -> group by sid = $.sid a = LOAD 'readings' AS into { tavg: avg($.temp) }; (sid:int, temp:float); Jaql b = GROUP a BY sid; for $r in collection("readings") c = FOREACH b GENERATE group by $r.sid AVG(temp); return { tavg: avg($r.temp) } DUMP c; JSONiq Pig Surveyed features 15 feature matrices (1-11 dimensions each) classifying: • Data values • Schemas • Access and construct nested data • Missing information • Equality semantics • Ordering semantics • Aggregation • Joins • Set operators • Extensibility 17 9/9/2015 Outline • What is SQL++ ? • How we use SQL++ in UCSD’s FORWARD ? • Introduction to the formal survey of NoSQL, NewSQL and SQL- on-Hadoop • Methodology • Example 1: data model (data values) • Example 2: query language (SELECT clause) • Example 3: semantics (path) • Example 4: semantics (equality function) Methodology For each feature: 1. A formal definition of the feature in SQL++ 2. A SQL++ example 3. A feature matrix that classifies each dimension of the feature 4. A discussion of the results, partial support and unexpected behaviors All the results are empirically validated Example: Data values 1. SQL++ example: 1 { 2 location: 'Alpine', 3 readings: [ 4 { 5 time: timestamp('2014-03-12T20:00:00'), 6 ozone: 0.035, 7 no2: 0.0050 8 }, 9 { 10 time: timestamp('2014-03-12T22:00:00'), 11 ozone: 'm', 12 co: 0.4 13 } ] 14 } 18 9/9/2015 Example: Data values 2. SQL++ BNF for values: Example: Data values 3. Feature matrix: Composability (top-level values) Heterogeneity Arrays Bags Sets Maps Tuples Primitives Hive Bag of tuples No Yes No No Partial Yes Yes Jaql Any Value Yes Yes No No No Yes Yes Pig Bag of tuples Partial No Partial No Partial Yes Yes CQL Bag of tuples No Partial No Partial Partial No Yes JSONiq Any Value Yes Yes No No No Yes Yes MongoDB Bag of tuples Yes Yes No No No Yes Yes N1QL Bag of tuples Yes Yes No No No Yes Yes SQL Bag of tuples No No No No No No Yes AQL Any Value Yes Yes Yes No No Yes Yes BigQuery Bag of tuples No No No No No Yes Yes MongoJDBC Bag of tuples Yes Yes No No No Yes Yes SQL++ Any Value Yes Yes Yes Partial Yes Yes Yes Example: Data values • Column-by-column comparison 4. Discussion of the results: • Partial support (65k scalar elements) • Identify clusters (who supports JSON?) Composability (top-level values) Heterogeneity Arrays Bags Sets Maps Tuples Primitives Hive Bag of tuples No Yes No No Partial Yes Yes Jaql Any Value Yes Yes No No No Yes Yes Pig Bag of tuples Partial No Partial No Partial Yes Yes CQL Bag of tuples No Partial No Partial Partial No Yes JSONiq Any Value Yes Yes No No No Yes Yes MongoDB Bag of tuples Yes Yes No No No Yes Yes N1QL Bag of tuples Yes Yes No No No Yes Yes SQL Bag of tuples No No No No No No Yes AQL Any Value Yes Yes Yes No No Yes Yes BigQuery Bag of tuples No No No No No Yes Yes MongoJDBC Bag of tuples Yes Yes No No No Yes Yes SQL++ Any Value Yes Yes Yes Partial Yes Yes Yes 19 9/9/2015 Example: SELECT clause 1. SQL++ example: • Projecting nested collections: SELECT TUPLE s.lat, s.long, (SELECT r.ozone "Position and (nested) FROM readings AS r ozone readings of WHERE r.location = s.location) each sensor?" FROM sensors AS s • Projecting non-tuples: "Bag of all the (scalar) SELECT ELEMENT ozone FROM readings ozone readings?" Example: SELECT clause 3. Feature matrix: Projecting tuples containing nested collections Projecting non-tuples Hive Partial No Jaql Yes Yes Pig Partial No CQL No No JSONiq Yes Yes MongoDB Partial Partial N1QL Partial Partial SQL No No AQL Yes Yes BigQuery No No MongoJDBC No No SQL++ Yes Yes Example: SELECT clause • Not well supported features 4. Discussion of the results: • 3 languages support them entirely (same cluster as for data values) Projecting tuples containing nested collections Projecting non-tuples Hive Partial No Jaql Yes Yes Pig Partial No CQL No No JSONiq Yes Yes MongoDB Partial Partial N1QL Partial Partial SQL No No AQL Yes Yes BigQuery No No MongoJDBC No No SQL++ Yes Yes 20 9/9/2015 Config Parameters We use config parameters to encompass and compare various semantics of a feature: • Minimal number of independent dimensions • 1 dimension = 1 config parameter • SQL++ formalism parametrized by the config parameters • Feature matrix classifies the values of each config parameter Example: Paths • Config parameters for tuple navigation: @tuple_nav { absent : missing, type_mismatch: error, } ( x.y ) Example: Paths • The feature matrix classifies, for each language, the value of each config parameter: Missing Type mismatch Hive Error Error Jaql Null Error# Pig Error Error CQL Error Error JSONiq Missing# Missing# MongoDB Missing Missing N1QL Missing Missing SQL Error Error AQL Null Error BigQuery Error Error MongoJDBC Missing# Missing# SQL++ @path @path 21 9/9/2015 Example: Equality Function • Config parameters for equality: @eq { complex : error, type_mismatch : false, null_eq_null : null, null_eq_value : null, missing_eq_missing: missing, missing_eq_value : missing, missing_eq_null : missing } ( x = y ) Example: Equality Function • The feature matrix classifies, for each language, the value of each config parameter: • No real cluster • Some languages have multiple (incompatible) equality functions • Some edge cases cannot happen due to other limitations (SQL has no complex values) Type Missing = Missing = Missing = Complex Null = Null Null = Value mismatch Missing Value Null Hive (=, <=>) Err, Err Err, Err Null, True Null, False N/A N/A N/A Jaql Boolean Null Null Null N/A N/A N/A Pig Boolean partial Err Null Null N/A N/A N/A CQL Err Err Err False/Null N/A N/A N/A JSONiq (=, deep-equal()) Err, Boolean Err, False True, True False, False Missing, True Missing, False Missing, False MongoDB Boolean False True False True False False N1QL Boolean False Null Null Missing Missing Missing SQL N/A Err Null Null N/A N/A N/A AQL Err Err Null Null N/A N/A N/A BigQuery Err Err Null Null N/A N/A N/A MongoJDBC Boolean False True False N/A False False SQL++ @equal @equal @equal @equal @equal @equal @equal How to use this survey? • As a database user: • Understand the semantics of a (often underspecified) query language / be aware of the limitation of a database • As a designer/architect of a database • Produce formal specification of your query language • Align semantics with SQL's • As a database researcher • The results might change, but the survey methodology stays • As a designer/architect of database middleware • Understand what capability variations need to be encapsulated and simulated 22 9/9/2015 The survey shows: • The marketing clusters do not correspond to real capabilities • Limited capabilities: matrices are sparse and fragmented (more pressure on source-specific rewriters and distributor) The Future is Semi-Structured and Declarative • Scalability • Flexibility of Semistructured Data • Power of Declarative: Automatic Optimization, View Maintenance • The primary operational and the secondary analytics application out of semistructured, declarative platforms 23