9/9/2015

SQL++

and its use in the FORWARD framework

with the support of National Science Foundation, Google, Informatica, Couchbase

Semistructured Databases have Arrived

SQL-on-Hadoop NoSQL

CQL PIG

N1QL SQL-on-Hadoop Others

AQL

in the form of JSON…

{ location: 'Alpine', readings: [ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, no2: 0.0050 }, { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } ] }

1 9/9/2015

Semistructured Query Languages come along…

SQL: JSON: Expressive Semistructured Power and Data Adoption

but in a Tower of Babel…

SELECT AVG(temp) AS tavg db.readings.aggregate( FROM readings {$group: {_id: "$sid", GROUP BY sid tavg: {$avg:"$temp"}}}) SQL MongoDB readings -> group by sid = $.sid a = LOAD 'readings' AS into { tavg: avg($.temp) }; (sid:int, temp:float); Jaql b = GROUP a BY sid; for $r in collection("readings") c = FOREACH b GENERATE group by $r.sid AVG(temp); return { tavg: avg($r.temp) } DUMP c; JSONiq Pig

with limited query abilities…

Growing out of key-value databases or SQL with special type UDFs

Far from the full-fledged power of prior non-relational query languages – OQL, Xquery *academic projects being the exception (eg, ASTERIX, FORWARD)

2 9/9/2015

in lieu of formal semantics…

Query languages were not meant for Ninjas

Outline

• What is SQL++ ?

• How we use SQL++ in UCSD’s FORWARD ?

• Introduction to the formal survey of NoSQL, NewSQL and SQL-on-Hadoop

SQL++ : Data model + query language for querying both structured data (SQL) and semistructured data (native JSON)

Formal Schemaless Full fledged, Semantics: adjusting tuple declarative , calculus to the aggressively backwards- richness of JSON compatible with SQL

Configurable SQL++: Unifying: Formal configuration 11 popular databases/QLs options towards choosing captured by appropriate semantics, understanding settings of Configurable repercussions SQL++

3 9/9/2015

How is it used in the Big Data industry • ASTERIXDB • CouchDB heading towards SQL++

How do we use it in UCSD

Automatic Incremental View Maintenance

Middleware Query Optimization & Execution: FORWARD virtual database query plans based on SQL++ algebra based on a schemaless SQL++ algebra

SQL++ Data Model (also JSON++ data model) A superset of SQL and JSON • Extend SQL with arrays + nesting + heterogeneity

{ location: 'Alpine', readings: [ • Array nested { inside a tuple time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, • Heterogeneous no2: 0.0050 tuples in }, collections { time: timestamp('2014-03-12T22:00:00'), • Arbitrary ozone: 'm', compositions of co: 0.4 array, bag, tuple } ] }

4 9/9/2015

SQL++ Data Model (also JSON++ data model) A superset of SQL and JSON • Extend SQL with arrays + nesting + heterogeneity + permissive structures

[ [5, 10, 3], [21, 2, 15, 6], [4, {lo: 3, exp: 4, hi: 7}, 2, 13, 6] ] • Arbitrary compositions of array, bag, tuple

SQL++ Data Model (also JSON++ data model) A superset of SQL and JSON • Extend SQL with arrays + nesting + heterogeneity + permissive structures • Extend JSON with bags and enriched types { location: 'Alpine', readings: {{ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, • Bags {{ … }} no2: 0.0050

}, • Enriched types { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } }} }

Exercise: Why the SQL++ data model is a superset of the SQL data model? • a SQL table corresponds to a JSON bag that has homogeneous tuples

5 9/9/2015

SQL++ Data Model (also JSON++ data model) A superset of SQL and JSON • Extend SQL with arrays + nesting + heterogeneity + permissive structures + permissive schema • Extend JSON with bags and enriched types { { location: 'Alpine', location: string, readings: {{ readings: {{ { { time: timestamp('2014-03-12T20:00:00'), time: timestamp, • With schema ozone: 0.035, ozone: any,

no2: 0.0050 * • Or, schemaless }, }

{ }} • Or, partial schema time: timestamp('2014} -03-12T22:00:00'), ozone: 'm', co: 0.4 } }} }

From SQL to SQL++

• Goal: Queries that input/output any JSON type, yet backwards compatible with SQL

Backwards Compatibility with SQL

readings : {{ { sid: 2, temp: 70.1 }, {{ { sid: 2, temp: 49.2 }, { sid : 2 } { sid: 1, temp: null } SELECT DISTINCT r.sid }} }} FROM readings AS r WHERE r.temp < 50

Find sensors that recorded a temperature below 50 sid sid temp 2 2 70.1 2 49.2 1 null

6 9/9/2015

From SQL to SQL++

• Goal: Queries that input/output any JSON type, yet backwards compatible with SQL

• Adjust SQL syntax/semantics for heterogeneity, complex input/output values

• Achieved with few extensions and mostly by SQL limitation removals

From Tuple Variables to Element Variables

SELECT r AS co readings : FROM readings AS r [ [ 1.3, WHERE r < 1.0 { co: 0.8 }, 0.7, ORDER BY r DESC { co: 0.7 } 0.3, LIMIT 2 ] 0.8 ] Find the highest two sensor readings that are below 1.0

Element Variables FROM readings AS r

out in B FROM = B WHERE = {{ ⟨ r : 1.3 ⟩, ⟨ r : 0.7 ⟩, ⟨ r : 0.3 ⟩, ⟨ r : 0.8 ⟩ }}

WHERE r < 1.0 : readings Bout = Bin = {{ ⟨ r : 0.7 ⟩, [ WHERE ORDERBY ⟨ r : 0.3 ⟩, [ 1.3, ⟨ r : 0.8 ⟩ }} { co: 0.8 }, 0.7, { co: 0.7 } 0.3, ORDER BY r DESC ] 0.8 Bout = Bin = [ ⟨ r : 0.8 ⟩, ] ORDERBY LIMIT ⟨ : ⟩, r 0.7 ⟨ r : 0.3 ⟩ ] LIMIT 2

out in B LIMIT = B SELECT = [ ⟨ r : 0.8 ⟩ ⟨ r : 0.7 ⟩ ] SELECT r AS co

7 9/9/2015

Correlated Subqueries Everywhere

SELECT r AS co

sensors : FROM sensors AS s, [ s AS r [ [1.3, 2] WHERE r < 1.0 { co: 0.9 }, [0.7, 0.7, 0.9] ORDER BY r DESC { co: 0.8 } [0.3, 0.8, 1.1] LIMIT 2 ] [0.7, 1.4] ] Find the highest two sensor readings

that are below 1.0

Correlated Subqueries Everywhere FROM sensors AS s, s AS r

out in B FROM = B WHERE = {{ ⟨ s : [1.3, 2] , r : 1.3 ⟩, ⟨ s : [1.3, 2] , r : 2 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, sensors : [ ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, [1.3, 2] ⟨ s : [0.3, 0.8, 1.1] , r : 0.3 ⟩, [0.7, 0.7, 0.9] ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩, [0.3, 0.8, 1.1] ⟨ s : [0.3, 0.8, 1.1] , r : 1.1 ⟩, [0.7, 1.4] ⟨ s : [0.7, 1.4] , r : 0.7 ⟩, ] ⟨ s : [0.7, 1.4] , r : 1.4 ⟩ }}

WHERE r < 1.0 ORDER BY r DESC LIMIT 2 SELECT r AS co

Correlated Subqueries Everywhere FROM sensors AS s, s AS r WHERE r < 1.0

out in B WHERE = B ORDERBY = {{ ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩,

⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, : sensors [ ⟨ s : [0.3, 0.8, 1.1] , r : 0.3 ⟩, [1.3, 2] ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩, [0.7, 0.7, 0.9] ⟨ s : [0.7, 1.4] , r : 0.7 ⟩ [0.3, 0.8, 1.1] }} [0.7, 1.4] ] ORDER BY r DESC

LIMIT 2 SELECT r AS co

8 9/9/2015

Correlated Subqueries Everywhere FROM sensors AS s, s AS r WHERE r < 1.0 ORDER BY r DESC

out in B ORDERBY = B LIMIT = [ ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, sensors : [ ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩, [1.3, 2] ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, [ [0.7, 0.7, 0.9] ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, { co: 0.9 }, [0.3, 0.8, 1.1] ⟨ s : [0.7, 1.4] , r : 0.7 ⟩, { co: 0.8 } [0.7, 1.4] ⟨ s : [0.3, 0.8, 1.1] , r : 0.3 ⟩ ] ] ]

LIMIT 2

out in B LIMIT = B SELECT = [ ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩ ] SELECT r AS co

Utilization of Input Order

x y

SELECT r AS co, [ rp AS x, { co: 0.9, sensors : sp AS y x: 3, [ FROM sensors AS s AT sp, y: 2 [1.3, 2] s AS r AT rp }, [0.7, 0.7, 0.9] WHERE r < 1.0 [0.3, 0.8, 1.1] ORDER BY r DESC { co: 0.8, [0.7, 1.4] LIMIT 2 x: 2, ] y: 3 } Find the highest two sensor readings that are ] below 1.0

Constructing Nested Results – Subqueries Again

{{

ℾ = ⟨ 0 { sensors : {{ sensor : 1, {sensor: 1}, FROM sensors AS s readings: {{ {sensor: 2} Bout = Bin = {{ {co:0.4}, }} , FROM SELECT ⟨ s : {sensor: 1} ⟩, {co:0.2}

⟨ s : {sensor: 2} ⟩ }} logs: {{ }} }, {sensor: 1,

co: 0.4}, SELECT { {sensor: 1, s.sensor AS sensor, sensor : 2, co: 0.2}, ( readings: {{ {sensor: 2, SELECT l.co AS co {co:0.3} co: 0.3}, FROM logs AS l }} }} WHERE l.sensor = s.sensor } ⟩ ) AS readings }}

9 9/9/2015

Constructing Nested Results

FROM logs AS l ℾ1 = ⟨ s : {sensor : 1}, out in B FROM = B SELECT = {{ sensors : {{ ⟨ l : {sensor: 1, co: 0.4} ⟩, {sensor: 1}, ⟨ l : {sensor: 1, co: 0.2} ⟩, {sensor: 2} ⟨ l : {sensor: 2, co: 0.3} ⟩ }} , }} {{

{co:0.4}, logs: {{ WHERE l.sensor = s.sensor {co:0.2} {sensor: 1, }} co: 0.4}, Bout = Bin = {{ {sensor: 1, FROM SELECT ⟨ l : {sensor: 1, co: 0.4} ⟩, co: 0.2}, ⟨ l : {sensor: 1, co: 0.2} ⟩ {sensor: 2, }} co: 0.3}, }} ⟩ SELECT l.co AS co

Iterating over Attribute-Value pairs

INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN

FROM [ readings AS {g:v}, { v AS n gas: "co", num: null ℾ = ⟨ }, { readings : { gas: "no2", co : [], num: 0.7 {{ no2: [0.7], }, { ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, so2: [0.5, 2] gas: "so2", ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, } num: 0.5, ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }} ⟩ }, { gas: "so2", SELECT ELEMENT num: 2 { gas: g, } val: v } ]

10 9/9/2015

INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN

FROM [ readings AS {g:v} { INNER CORRELATE gas: "co", v AS n num: null ℾ = ⟨ }, { readings : { gas: "no2", co : [], num: 0.7 {{ no2: [0.7], }, { ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, so2: [0.5, 2] gas: "so2", ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, } num: 0.5, ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }} ⟩ }, { gas: "so2", SELECT ELEMENT num: 2 { gas: g, } val: v } ]

INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN

FROM [ readings AS {g:v} { OUTER CORRELATE gas: "co", v AS n num: null ℾ = ⟨ }, { readings : { gas: "no2", co : [], num: 0.7 {{ ⟨ g : "co", v : [], n : missing ⟩, no2: [0.7], }, { ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, so2: [0.5, 2] gas: "so2", ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, } num: 0.5, ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }} ⟩ }, { gas: "so2", SELECT ELEMENT num: 2 { gas: g, } val: v } ]

The SQL++ Extensions to SQL

• Goal: Queries that input/output any JSON type, yet backwards compatible with SQL

• Adjust SQL syntax/semantics for heterogeneity, complex values

• Achieved with few extensions and SQL limitation removals • Element Variables • Correlated Subqueries (in FROM and SELECT clauses) • Optional utilization of Input Order • Grouping indendent of Aggregation Functions • ... and a few more details

• Configurability: making SQL++ a unifying and configurable language

11 9/9/2015

Algebras do not need schemaful inputs

FROM logs AS l ℾ1 = ⟨ s : {sensor : 1}, out in B FROM = B SELECT = {{ sensors : {{ ⟨ l : {sensor: 1, co: 0.4} ⟩, {sensor: 1}, ⟨ l : {sensor: 1, co: 0.2} ⟩, {sensor:sensor 2} co ⟨ l : {sensor: 2, co: 0.3} ⟩ }} , }} {{ 1 0.4 {co:0.4}, logs: {{ 1 0.2 WHERE l.sensor = s.sensor {co:0.2} {sensor: 1, }} co: 2 0.4}, 0.3 Bout = Bin = {{ {sensor: 1, FROM SELECT ⟨ l : {sensor: 1, co: 0.4} ⟩, co: 0.2}, ⟨ l : {sensor: 1, co: 0.2} ⟩ {sensor: 2, }} co: 0.3}, }} ⟩ SELECT l.co AS co

Outline

• What is SQL++ ?

• How we use SQL++ in UCSD’s FORWARD ? • SQL++ uses in integration

• Introduction to the formal survey of NoSQL, NewSQL and SQL-on-Hadoop

SQL++ based Virtual DatabaseClient Federated SQL++ Queries SQL++ Results FORWARD Virtual Database

SQL++ Query Processor

SQL++ Virtual Views of Sources

SQL++ SQL++ SQL++ SQL++ SQL++ Virtual Virtual Virtual Virtual Virtual View View View View View SQL-on- SQL NewSQL NoSQL Java Objects Hadoop Wrapper Wrapper Wrapper Wrapper Wrapper Native Native queries results

SQL-on- Java SQL NewSQL NoSQL Hadoop In-Memory Database Database Database Database Objects

12 9/9/2015

SQL++ Virtual and/or MaterializedClient Views SQL++ Queries SQL++ Results FORWARD Integration Database

SQL++ Query Processor SQL++ View SQL++ Integrated Views Definitions

SQL++ SQL++ SQL++ Integrated Integrated Added Value View View View

SQL++ SQL++ SQL++ SQL++ SQL++ Virtual Virtual Virtual Virtual Virtual View View View View View

SQL-on- Java SQL NewSQL NoSQL Hadoop In-Memory Database Database Database Database Objects

Use Case 1: Hide the ++ source features behind SQL views

BI ToolSELECT / Report Writerm.sid , t AS temp FROM measurements AS m, SQL query SQL result m.temperatures AS t FORWARD Virtual Database SQL View or its SQL++ equivalent

measurements: sid temp measurements: {{

2 70.1 {sid: 2, temp: 70.1}, {sid: 2, temp: 70.2}, 2 70.2 {sid: 1, temp: 71.0} 1 71.0 }}

Couchbase query Couchbase results

measurements: {{ { sid: 2, temperatures: [70.1, 70.2] }, { sid: 1, temperatures: [71.0] } Couchbase }}

Use Case 2: Semantics-Aware Pushdown

"Sensors that recorded a temperature below 50"

SQL++ query SQL++ result SELECT DISTINCT m.sid FROM measurements AS m MongoDB Virtual Database WHERE m.temp < 50 Virtual View SQL++ semantics for (Identical) m.temp < 50 MongoDB Wrapper

MongoDB query MongoDB results ... { $match: {temp: {$lt: 50}}}, MongoDB { $match: {temp: {$not: null}}} measurements: {{ ... { sid: 2, temp: 70.1 }, MongoDB semantics for { sid: 2, temp: 49.2 }, temp $lt 50 { sid: 1, temp: null } }}

13 9/9/2015

Push-Down Challenges: Limited capabilities & semantic variations

• How to simulate incompatible semantics/missing features? • Sources like SQL and MongoDB cannot execute the entirety of SQL++ • How to efficiently push down computation?

Issues automatically handled by the query processor. Plenty of query rewriting problems, including novel ones on semi-structured operators

SQL++ captures Semantics Variations

• Semantics of "less-than" comparison are different across sources:

• Config Parameters to capture these variations

@lt { complex : deep, @lt { MongoDB } type_mismatch: false, ( x < y ) null_lt_null : false, null_lt_value: false, ... } ( x < y )

Semantics Variations

In NoSQL, NewSQL, SQL-on-Hadoop variation of semantics for: • Paths • Equality • Comparisons

And all the operators that use them: • Selections • Grouping • Ordering • Set operations Each of these features has a set of config parameters in SQL++

14 9/9/2015

Don’t give a standard Teach how to reach a standard

• Fast evolving space • Standard would be premature • Configuration options itemize and explain the space options • Rationalize discussion

Advise on consistent settings of the configuration options

• Not every configuration option setting is sensible • Configuration options have to subscribe to common sense iff x = y then [x] = y if x = y and y = z then x = z • which enables rewritings • Have to be consistent with each other Eg, if true AND missing is missing, then false OR missing is missing

FORWARD: SQL++ Incremental View Maintenance and Application Visualization layer • The Incremental View Maintenance functionality: IVM • SQL++ (Materialized) View Definition Stream

Eg, Couchbase has a JSON web log showing View View {{ {user, list of displayed products } }} Before After and FORWARD produces materialized view {{ {product category, count, products: [{product, count}]} }} Stream

DB DB • Stream of inserts, deletes, updates on base data Before After • The Incremental View Maintenance module • Automatically and efficiently updates the materialized view to reflect the stream of changes • SQL++ can also enable automatic Incremental View Maintenance! • With attention to replication of data in views • Opportunities by keys

15 9/9/2015

FORWARD: SQL++ Incremental View Maintenance and Application Visualization layer

Custom dashboards, interactive pages & apps • The data models of visualization components (e.g. Google Maps) can be nicely captured with JSON models • The pages are SQL++ (JSON) views! • Mashups of the components views • SQL++ feeds and incrementally updates the page views

Use case • From data to visualization with just SQL++ & markup • Ajax/Javascript visuals with no Ajax/Javascript mess • How to easily connect to today’s JS libraries • Custom Ajax visualizations & interfaces for IT personnel

FORWARD: SQL++ Incremental View Maintenance and Application Visualization layer

(part of) the Google Map model

<% unit google.maps.Maps %> { markers: [ { position: { latitude : number, longitude: number } ... } ] } <% end unit %>

Outline

• What is SQL++ ?

• How we use SQL++ in UCSD’s FORWARD ? • SQL++ uses in integration and (live) analytics applications

• Introduction to the formal survey of NoSQL, NewSQL and SQL- on-Hadoop

16 9/9/2015

AQLJaqlSQLPigHiveMongoJDBCJSONiqN1QLCQLMongoDBBigQuery (IBM) ((Cassandra)AsterixDB (Couchbase (Google) ) ) Surveyed Databases • FLWORAlgebraicSQLBasedCreatedJSONPreviously- standardlike-like on syntax syntaxat syntaxXQuery "FacebookDremel " • ResearchHadoopAlsoMongoDB28msecNoSQLMost production- likecomprises used cluster(s)clusterssyntax database via NoSQL JDBC releaseNewSQL database (UCI)driver SQL-on-Hadoop NoSQL

CQL PIG

N1QL SQL-on-Hadoop Jaql Others

SQL & NewSQL AQL

MongoDB driver

SQL++ Removes Superficial Differences

• SQL++ covers SQL, N1QL and QL research prototypes (e.g., UCI’s ASTERIX) • Removing the current “Tower of Babel” effect • Providing formal syntax and semantics

SELECT AVG(temp) AS tavg db.readings.aggregate( FROM readings {$group: {_id: "$sid", GROUP BY sid tavg: {$avg:"$temp"}}}) SQL MongoDB readings -> group by sid = $.sid a = LOAD 'readings' AS into { tavg: avg($.temp) }; (sid:int, temp:float); Jaql b = GROUP a BY sid; for $r in collection("readings") c = FOREACH b GENERATE group by $r.sid AVG(temp); return { tavg: avg($r.temp) } DUMP c; JSONiq Pig

Surveyed features

15 feature matrices (1-11 dimensions each) classifying: • Data values • Schemas • Access and construct nested data • Missing information • Equality semantics • Ordering semantics • Aggregation • Joins • Set operators • Extensibility

17 9/9/2015

Outline

• What is SQL++ ?

• How we use SQL++ in UCSD’s FORWARD ?

• Introduction to the formal survey of NoSQL, NewSQL and SQL- on-Hadoop • Methodology • Example 1: data model (data values) • Example 2: query language (SELECT clause) • Example 3: semantics (path) • Example 4: semantics (equality function)

Methodology

For each feature: 1. A formal definition of the feature in SQL++ 2. A SQL++ example 3. A feature matrix that classifies each dimension of the feature 4. A discussion of the results, partial support and unexpected behaviors

All the results are empirically validated

Example: Data values

1. SQL++ example:

1 { 2 location: 'Alpine', 3 readings: [ 4 { 5 time: timestamp('2014-03-12T20:00:00'), 6 ozone: 0.035, 7 no2: 0.0050 8 }, 9 { 10 time: timestamp('2014-03-12T22:00:00'), 11 ozone: 'm', 12 co: 0.4 13 } ] 14 }

18 9/9/2015

Example: Data values

2. SQL++ BNF for values:

Example: Data values

3. Feature matrix:

Composability (top-level values) Heterogeneity Arrays Bags Sets Maps Tuples Primitives Hive Bag of tuples No Yes No No Partial Yes Yes Jaql Any Value Yes Yes No No No Yes Yes Pig Bag of tuples Partial No Partial No Partial Yes Yes CQL Bag of tuples No Partial No Partial Partial No Yes JSONiq Any Value Yes Yes No No No Yes Yes MongoDB Bag of tuples Yes Yes No No No Yes Yes N1QL Bag of tuples Yes Yes No No No Yes Yes SQL Bag of tuples No No No No No No Yes AQL Any Value Yes Yes Yes No No Yes Yes BigQuery Bag of tuples No No No No No Yes Yes MongoJDBC Bag of tuples Yes Yes No No No Yes Yes SQL++ Any Value Yes Yes Yes Partial Yes Yes Yes

Example: Data values

• Column-by-column comparison 4. Discussion of the results: • Partial support (65k scalar elements) • Identify clusters (who supports JSON?)

Composability (top-level values) Heterogeneity Arrays Bags Sets Maps Tuples Primitives Hive Bag of tuples No Yes No No Partial Yes Yes Jaql Any Value Yes Yes No No No Yes Yes Pig Bag of tuples Partial No Partial No Partial Yes Yes CQL Bag of tuples No Partial No Partial Partial No Yes JSONiq Any Value Yes Yes No No No Yes Yes MongoDB Bag of tuples Yes Yes No No No Yes Yes N1QL Bag of tuples Yes Yes No No No Yes Yes SQL Bag of tuples No No No No No No Yes AQL Any Value Yes Yes Yes No No Yes Yes BigQuery Bag of tuples No No No No No Yes Yes MongoJDBC Bag of tuples Yes Yes No No No Yes Yes SQL++ Any Value Yes Yes Yes Partial Yes Yes Yes

19 9/9/2015

Example: SELECT clause

1. SQL++ example: • Projecting nested collections:

SELECT TUPLE s.lat, s.long, (SELECT r.ozone "Position and (nested) FROM readings AS r ozone readings of WHERE r.location = s.location) each sensor?" FROM sensors AS s

• Projecting non-tuples: "Bag of all the (scalar) SELECT ELEMENT ozone FROM readings ozone readings?"

Example: SELECT clause

3. Feature matrix:

Projecting tuples containing nested collections Projecting non-tuples Hive Partial No Jaql Yes Yes Pig Partial No CQL No No JSONiq Yes Yes MongoDB Partial Partial N1QL Partial Partial SQL No No AQL Yes Yes BigQuery No No MongoJDBC No No SQL++ Yes Yes

Example: SELECT clause

• Not well supported features 4. Discussion of the results: • 3 languages support them entirely (same cluster as for data values)

Projecting tuples containing nested collections Projecting non-tuples Hive Partial No Jaql Yes Yes Pig Partial No CQL No No JSONiq Yes Yes MongoDB Partial Partial N1QL Partial Partial SQL No No AQL Yes Yes BigQuery No No MongoJDBC No No SQL++ Yes Yes

20 9/9/2015

Config Parameters

We use config parameters to encompass and compare various semantics of a feature:

• Minimal number of independent dimensions • 1 dimension = 1 config parameter • SQL++ formalism parametrized by the config parameters • Feature matrix classifies the values of each config parameter

Example: Paths

• Config parameters for tuple navigation:

@tuple_nav { absent : missing, type_mismatch: error, } ( x.y )

Example: Paths

• The feature matrix classifies, for each language, the value of each config parameter:

Missing Type mismatch Hive Error Error Jaql Null Error# Pig Error Error CQL Error Error JSONiq Missing# Missing# MongoDB Missing Missing N1QL Missing Missing SQL Error Error AQL Null Error BigQuery Error Error MongoJDBC Missing# Missing# SQL++ @path @path

21 9/9/2015

Example: Equality Function

• Config parameters for equality:

@eq { complex : error, type_mismatch : false, null_eq_null : null, null_eq_value : null, missing_eq_missing: missing, missing_eq_value : missing, missing_eq_null : missing } ( x = y )

Example: Equality Function

• The feature matrix classifies, for each language, the value of each config parameter: • No real cluster • Some languages have multiple (incompatible) equality functions • Some edge cases cannot happen due to other limitations (SQL has no complex values)

Type Missing = Missing = Missing = Complex Null = Null Null = Value mismatch Missing Value Null Hive (=, <=>) Err, Err Err, Err Null, True Null, False N/A N/A N/A Jaql Boolean Null Null Null N/A N/A N/A Pig Boolean partial Err Null Null N/A N/A N/A CQL Err Err Err False/Null N/A N/A N/A JSONiq (=, deep-equal()) Err, Boolean Err, False True, True False, False Missing, True Missing, False Missing, False MongoDB Boolean False True False True False False N1QL Boolean False Null Null Missing Missing Missing SQL N/A Err Null Null N/A N/A N/A AQL Err Err Null Null N/A N/A N/A BigQuery Err Err Null Null N/A N/A N/A MongoJDBC Boolean False True False N/A False False SQL++ @equal @equal @equal @equal @equal @equal @equal

How to use this survey?

• As a database user: • Understand the semantics of a (often underspecified) query language / be aware of the limitation of a database

• As a designer/architect of a database • Produce formal specification of your query language • Align semantics with SQL's

• As a database researcher • The results might change, but the survey methodology stays

• As a designer/architect of database middleware • Understand what capability variations need to be encapsulated and simulated

22 9/9/2015

The survey shows:

• The marketing clusters do not correspond to real capabilities

• Limited capabilities: matrices are sparse and fragmented (more pressure on source-specific rewriters and distributor)

The Future is Semi-Structured and Declarative

• Scalability

• Flexibility of Semistructured Data

• Power of Declarative: Automatic Optimization, View Maintenance

• The primary operational and the secondary analytics application out of semistructured, declarative platforms

23