SQL++ Query Language

Total Page:16

File Type:pdf, Size:1020Kb

SQL++ Query Language 9/9/2015 SQL++ Query Language and its use in the FORWARD framework with the support of National Science Foundation, Google, Informatica, Couchbase Semistructured Databases have Arrived SQL-on-Hadoop NoSQL CQL PIG N1QL SQL-on-Hadoop Jaql Others AQL in the form of JSON… { location: 'Alpine', readings: [ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, no2: 0.0050 }, { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } ] } 1 9/9/2015 Semistructured Query Languages come along… SQL: JSON: Expressive Semistructured Power and Data Adoption but in a Tower of Babel… SELECT AVG(temp) AS tavg db.readings.aggregate( FROM readings {$group: {_id: "$sid", GROUP BY sid tavg: {$avg:"$temp"}}}) SQL MongoDB readings -> group by sid = $.sid a = LOAD 'readings' AS into { tavg: avg($.temp) }; (sid:int, temp:float); Jaql b = GROUP a BY sid; for $r in collection("readings") c = FOREACH b GENERATE group by $r.sid AVG(temp); return { tavg: avg($r.temp) } DUMP c; JSONiq Pig with limited query abilities… Growing out of key-value databases or SQL with special type UDFs Far from the full-fledged power of prior non-relational query languages – OQL, Xquery *academic projects being the exception (eg, ASTERIX, FORWARD) 2 9/9/2015 in lieu of formal semantics… Query languages were not meant for Ninjas Outline • What is SQL++ ? • How we use SQL++ in UCSD’s FORWARD ? • Introduction to the formal survey of NoSQL, NewSQL and SQL-on-Hadoop SQL++ : Data model + query language for querying both structured data (SQL) and semistructured data (native JSON) Formal Schemaless Full fledged, Semantics: adjusting tuple declarative , calculus to the aggressively backwards- richness of JSON compatible with SQL Configurable SQL++: Unifying: Formal configuration 11 popular databases/QLs options towards choosing captured by appropriate semantics, understanding settings of Configurable repercussions SQL++ 3 9/9/2015 How is it used in the Big Data industry • ASTERIXDB • CouchDB heading towards SQL++ How do we use it in UCSD Automatic Incremental View Maintenance Middleware Query Optimization & Execution: FORWARD virtual database query plans based on SQL++ algebra based on a schemaless SQL++ algebra SQL++ Data Model (also JSON++ data model) A superset of SQL and JSON • Extend SQL with arrays + nesting + heterogeneity { location: 'Alpine', readings: [ • Array nested { inside a tuple time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, • Heterogeneous no2: 0.0050 tuples in }, collections { time: timestamp('2014-03-12T22:00:00'), • Arbitrary ozone: 'm', compositions of co: 0.4 array, bag, tuple } ] } 4 9/9/2015 SQL++ Data Model (also JSON++ data model) A superset of SQL and JSON • Extend SQL with arrays + nesting + heterogeneity + permissive structures [ [5, 10, 3], [21, 2, 15, 6], [4, {lo: 3, exp: 4, hi: 7}, 2, 13, 6] ] • Arbitrary compositions of array, bag, tuple SQL++ Data Model (also JSON++ data model) A superset of SQL and JSON • Extend SQL with arrays + nesting + heterogeneity + permissive structures • Extend JSON with bags and enriched types { location: 'Alpine', readings: {{ { time: timestamp('2014-03-12T20:00:00'), ozone: 0.035, • Bags {{ … }} no2: 0.0050 }, • Enriched types { time: timestamp('2014-03-12T22:00:00'), ozone: 'm', co: 0.4 } }} } Exercise: Why the SQL++ data model is a superset of the SQL data model? • a SQL table corresponds to a JSON bag that has homogeneous tuples 5 9/9/2015 SQL++ Data Model (also JSON++ data model) A superset of SQL and JSON • Extend SQL with arrays + nesting + heterogeneity + permissive structures + permissive schema • Extend JSON with bags and enriched types { { location: 'Alpine', location: string, readings: {{ readings: {{ { { time: timestamp('2014-03-12T20:00:00'), time: timestamp, • With schema ozone: 0.035, ozone: any, no2: 0.0050 * • Or, schemaless }, } { }} • Or, partial schema time: timestamp('2014} -03-12T22:00:00'), ozone: 'm', co: 0.4 } }} } From SQL to SQL++ • Goal: Queries that input/output any JSON type, yet backwards compatible with SQL Backwards Compatibility with SQL readings : {{ { sid: 2, temp: 70.1 }, {{ { sid: 2, temp: 49.2 }, { sid : 2 } { sid: 1, temp: null } SELECT DISTINCT r.sid }} }} FROM readings AS r WHERE r.temp < 50 Find sensors that recorded a temperature below 50 sid sid temp 2 2 70.1 2 49.2 1 null 6 9/9/2015 From SQL to SQL++ • Goal: Queries that input/output any JSON type, yet backwards compatible with SQL • Adjust SQL syntax/semantics for heterogeneity, complex input/output values • Achieved with few extensions and mostly by SQL limitation removals From Tuple Variables to Element Variables SELECT r AS co readings : FROM readings AS r [ [ 1.3, WHERE r < 1.0 { co: 0.8 }, 0.7, ORDER BY r DESC { co: 0.7 } 0.3, LIMIT 2 ] 0.8 ] Find the highest two sensor readings that are below 1.0 Element Variables FROM readings AS r out in B FROM = B WHERE = {{ ⟨ r : 1.3 ⟩, ⟨ r : 0.7 ⟩, ⟨ r : 0.3 ⟩, ⟨ r : 0.8 ⟩ }} WHERE r < 1.0 : readings Bout = Bin = {{ ⟨ r : 0.7 ⟩, [ WHERE ORDERBY ⟨ r : 0.3 ⟩, [ 1.3, ⟨ r : 0.8 ⟩ }} { co: 0.8 }, 0.7, { co: 0.7 } 0.3, ORDER BY r DESC ] 0.8 Bout = Bin = [ ⟨ r : 0.8 ⟩, ] ORDERBY LIMIT ⟨ : ⟩, r 0.7 ⟨ r : 0.3 ⟩ ] LIMIT 2 out in B LIMIT = B SELECT = [ ⟨ r : 0.8 ⟩ ⟨ r : 0.7 ⟩ ] SELECT r AS co 7 9/9/2015 Correlated Subqueries Everywhere SELECT r AS co sensors : FROM sensors AS s, [ s AS r [ [1.3, 2] WHERE r < 1.0 { co: 0.9 }, [0.7, 0.7, 0.9] ORDER BY r DESC { co: 0.8 } [0.3, 0.8, 1.1] LIMIT 2 ] [0.7, 1.4] ] Find the highest two sensor readings that are below 1.0 Correlated Subqueries Everywhere FROM sensors AS s, s AS r out in B FROM = B WHERE = {{ ⟨ s : [1.3, 2] , r : 1.3 ⟩, ⟨ s : [1.3, 2] , r : 2 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, sensors : [ ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, [1.3, 2] ⟨ s : [0.3, 0.8, 1.1] , r : 0.3 ⟩, [0.7, 0.7, 0.9] ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩, [0.3, 0.8, 1.1] ⟨ s : [0.3, 0.8, 1.1] , r : 1.1 ⟩, [0.7, 1.4] ⟨ s : [0.7, 1.4] , r : 0.7 ⟩, ] ⟨ s : [0.7, 1.4] , r : 1.4 ⟩ }} WHERE r < 1.0 ORDER BY r DESC LIMIT 2 SELECT r AS co Correlated Subqueries Everywhere FROM sensors AS s, s AS r WHERE r < 1.0 out in B WHERE = B ORDERBY = {{ ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, : sensors [ ⟨ s : [0.3, 0.8, 1.1] , r : 0.3 ⟩, [1.3, 2] ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩, [0.7, 0.7, 0.9] ⟨ s : [0.7, 1.4] , r : 0.7 ⟩ [0.3, 0.8, 1.1] }} [0.7, 1.4] ] ORDER BY r DESC LIMIT 2 SELECT r AS co 8 9/9/2015 Correlated Subqueries Everywhere FROM sensors AS s, s AS r WHERE r < 1.0 ORDER BY r DESC out in B ORDERBY = B LIMIT = [ ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, sensors : [ ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩, [1.3, 2] ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, [ [0.7, 0.7, 0.9] ⟨ s : [0.7, 0.7, 0.9] , r : 0.7 ⟩, { co: 0.9 }, [0.3, 0.8, 1.1] ⟨ s : [0.7, 1.4] , r : 0.7 ⟩, { co: 0.8 } [0.7, 1.4] ⟨ s : [0.3, 0.8, 1.1] , r : 0.3 ⟩ ] ] ] LIMIT 2 out in B LIMIT = B SELECT = [ ⟨ s : [0.7, 0.7, 0.9] , r : 0.9 ⟩, ⟨ s : [0.3, 0.8, 1.1] , r : 0.8 ⟩ ] SELECT r AS co Utilization of Input Order x y SELECT r AS co, [ rp AS x, { co: 0.9, sensors : sp AS y x: 3, [ FROM sensors AS s AT sp, y: 2 [1.3, 2] s AS r AT rp }, [0.7, 0.7, 0.9] WHERE r < 1.0 [0.3, 0.8, 1.1] ORDER BY r DESC { co: 0.8, [0.7, 1.4] LIMIT 2 x: 2, ] y: 3 } Find the highest two sensor readings that are ] below 1.0 Constructing Nested Results – Subqueries Again {{ ℾ = ⟨ 0 { sensors : {{ sensor : 1, {sensor: 1}, FROM sensors AS s readings: {{ {sensor: 2} Bout = Bin = {{ {co:0.4}, }} , FROM SELECT ⟨ s : {sensor: 1} ⟩, {co:0.2} ⟨ s : {sensor: 2} ⟩ }} logs: {{ }} }, {sensor: 1, co: 0.4}, SELECT { {sensor: 1, s.sensor AS sensor, sensor : 2, co: 0.2}, ( readings: {{ {sensor: 2, SELECT l.co AS co {co:0.3} co: 0.3}, FROM logs AS l }} }} WHERE l.sensor = s.sensor } ⟩ ) AS readings }} 9 9/9/2015 Constructing Nested Results FROM logs AS l ℾ1 = ⟨ s : {sensor : 1}, out in B FROM = B SELECT = {{ sensors : {{ ⟨ l : {sensor: 1, co: 0.4} ⟩, {sensor: 1}, ⟨ l : {sensor: 1, co: 0.2} ⟩, {sensor: 2} ⟨ l : {sensor: 2, co: 0.3} ⟩ }} , }} {{ {co:0.4}, logs: {{ WHERE l.sensor = s.sensor {co:0.2} {sensor: 1, }} co: 0.4}, Bout = Bin = {{ {sensor: 1, FROM SELECT ⟨ l : {sensor: 1, co: 0.4} ⟩, co: 0.2}, ⟨ l : {sensor: 1, co: 0.2} ⟩ {sensor: 2, }} co: 0.3}, }} ⟩ SELECT l.co AS co Iterating over Attribute-Value pairs INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN FROM [ readings AS {g:v}, { v AS n gas: "co", num: null ℾ = ⟨ }, { readings : { gas: "no2", co : [], num: 0.7 {{ no2: [0.7], }, { ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, so2: [0.5, 2] gas: "so2", ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, } num: 0.5, ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }} ⟩ }, { gas: "so2", SELECT ELEMENT num: 2 { gas: g, } val: v } ] 10 9/9/2015 INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN FROM [ readings AS {g:v} { INNER CORRELATE gas: "co", v AS n num: null ℾ = ⟨ }, { readings : { gas: "no2", co : [], num: 0.7 {{ no2: [0.7], }, { ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, so2: [0.5, 2] gas: "so2", ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, } num: 0.5, ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }} ⟩ }, { gas: "so2", SELECT ELEMENT num: 2 { gas: g, } val: v } ] INNER Vs OUTER Correlation: Capturing Outerjoins, OUTER FLATTEN FROM [ readings AS {g:v} { OUTER CORRELATE gas: "co", v AS n num: null ℾ = ⟨ }, { readings : { gas: "no2", co : [], num: 0.7 {{ ⟨ g : "co", v : [], n : missing ⟩, no2: [0.7], }, { ⟨ g : "no2", v : [0.7], n : 0.7 ⟩, so2: [0.5, 2] gas: "so2", ⟨ g : "so2", v : [0.5, 2], n : 0.5 ⟩, } num: 0.5, ⟨ g : "co", v : [0.5, 2], n : 2 ⟩ }} ⟩ }, { gas: "so2", SELECT ELEMENT num: 2 { gas: g, } val: v } ] The SQL++ Extensions to SQL • Goal: Queries that input/output any JSON type, yet backwards compatible with SQL • Adjust SQL syntax/semantics for heterogeneity, complex values • Achieved with few extensions and SQL limitation removals • Element Variables • Correlated Subqueries (in FROM and SELECT clauses) • Optional utilization of Input Order • Grouping indendent of Aggregation Functions • ..
Recommended publications
  • Towards an Analytics Query Engine *
    Towards an Analytics Query Engine ∗ Nantia Makrynioti Vasilis Vassalos Athens University of Economics and Business Athens University of Economics and Business Athens, Greece Athens, Greece [email protected] [email protected] ABSTRACT with black box libraries, evaluating various algorithms for a This vision paper presents new challenges and opportuni- task and tuning their parameters, in order to produce an ef- ties in the area of distributed data analytics, at the core of fective model, is a time-consuming process. Things become which are data mining and machine learning. At first, we even more complicated when we want to leverage paralleliza- provide an overview of the current state of the art in the area tion on clusters of independent computers for processing big and then analyse two aspects of data analytics systems, se- data. Details concerning load balancing, scheduling or fault mantics and optimization. We argue that these aspects will tolerance can be quite overwhelming even for an experienced emerge as important issues for the data management com- software engineer. munity in the next years and propose promising research Research in the data management domain recently started directions for solving them. tackling the above issues by developing systems for large- scale analytics that aim at providing higher-level primitives for building data mining and machine learning algorithms, Keywords as well as hiding low-level details of distributed execution. Data analytics, Declarative machine learning, Distributed MapReduce [12] and Dryad [16] were the first frameworks processing that paved the way. However, these initial efforts suffered from low usability, as they offered expressive but at the same 1.
    [Show full text]
  • Evaluation of Xpath Queries on XML Streams with Networks of Early Nested Word Automata Tom Sebastian
    Evaluation of XPath Queries on XML Streams with Networks of Early Nested Word Automata Tom Sebastian To cite this version: Tom Sebastian. Evaluation of XPath Queries on XML Streams with Networks of Early Nested Word Automata. Databases [cs.DB]. Universite Lille 1, 2016. English. tel-01342511 HAL Id: tel-01342511 https://hal.inria.fr/tel-01342511 Submitted on 6 Jul 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Universit´eLille 1 – Sciences et Technologies Innovimax Sarl Institut National de Recherche en Informatique et en Automatique Centre de Recherche en Informatique, Signal et Automatique de Lille These` pr´esent´ee en premi`ere version en vue d’obtenir le grade de Docteur par l’Universit´ede Lille 1 en sp´ecialit´eInformatique par Tom Sebastian Evaluation of XPath Queries on XML Streams with Networks of Early Nested Word Automata Th`ese soutenue le 17/06/2016 devant le jury compos´ede : Carlo Zaniolo University of California Rapporteur Anca Muscholl Universit´eBordeaux Rapporteur Kim Nguyen Universit´eParis-Sud Examinateur Remi Gilleron Universit´eLille 3 Pr´esident du Jury Joachim Niehren INRIA Directeur Abstract The eXtended Markup Language (Xml) provides a format for representing data trees, which is standardized by the W3C and largely used today for exchanging information between all kinds of computer programs.
    [Show full text]
  • QUERYING JSON and XML Performance Evaluation of Querying Tools for Offline-Enabled Web Applications
    QUERYING JSON AND XML Performance evaluation of querying tools for offline-enabled web applications Master Degree Project in Informatics One year Level 30 ECTS Spring term 2012 Adrian Hellström Supervisor: Henrik Gustavsson Examiner: Birgitta Lindström Querying JSON and XML Submitted by Adrian Hellström to the University of Skövde as a final year project towards the degree of M.Sc. in the School of Humanities and Informatics. The project has been supervised by Henrik Gustavsson. 2012-06-03 I hereby certify that all material in this final year project which is not my own work has been identified and that no work is included for which a degree has already been conferred on me. Signature: ___________________________________________ Abstract This article explores the viability of third-party JSON tools as an alternative to XML when an application requires querying and filtering of data, as well as how the application deviates between browsers. We examine and describe the querying alternatives as well as the technologies we worked with and used in the application. The application is built using HTML 5 features such as local storage and canvas, and is benchmarked in Internet Explorer, Chrome and Firefox. The application built is an animated infographical display that uses querying functions in JSON and XML to filter values from a dataset and then display them through the HTML5 canvas technology. The results were in favor of JSON and suggested that using third-party tools did not impact performance compared to native XML functions. In addition, the usage of JSON enabled easier development and cross-browser compatibility. Further research is proposed to examine document-based data filtering as well as investigating why performance deviated between toolsets.
    [Show full text]
  • Programming a Parallel Future
    Programming a Parallel Future Joseph M. Hellerstein Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2008-144 http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-144.html November 7, 2008 Copyright 2008, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Programming a Parallel Future Joe Hellerstein UC Berkeley Computer Science Things change fast in computer science, but odds are good that they will change especially fast in the next few years. Much of this change centers on the shift toward parallel computing. In the short term, parallelism will thrive in settings with massive datasets and analytics. Longer term, the shift to parallelism will impact all software. In this note, I’ll outline some key changes that have recently taken place in the computer industry, show why existing software systems are ill‐equipped to handle this new reality, and point toward some bright spots on the horizon. Technology Trends: A Divergence in Moore’s Law Like many changes in computer science, the rapid drive toward parallel computing is a function of technology trends in hardware. Most technology watchers are familiar with Moore's Law, and the general notion that computing performance per dollar has grown at an exponential rate — doubling about every 18‐24 months — over the last 50 years.
    [Show full text]
  • Declarative Data Analytics: a Survey
    Declarative Data Analytics: a Survey NANTIA MAKRYNIOTI, Athens University of Economics and Business, Greece VASILIS VASSALOS, Athens University of Economics and Business, Greece The area of declarative data analytics explores the application of the declarative paradigm on data science and machine learning. It proposes declarative languages for expressing data analysis tasks and develops systems which optimize programs written in those languages. The execution engine can be either centralized or distributed, as the declarative paradigm advocates independence from particular physical implementations. The survey explores a wide range of declarative data analysis frameworks by examining both the programming model and the optimization techniques used, in order to provide conclusions on the current state of the art in the area and identify open challenges. CCS Concepts: • General and reference → Surveys and overviews; • Information systems → Data analytics; Data mining; • Computing methodologies → Machine learning approaches; • Software and its engineering → Domain specific languages; Data flow languages; Additional Key Words and Phrases: Declarative Programming, Data Science, Machine Learning, Large-scale Analytics ACM Reference Format: Nantia Makrynioti and Vasilis Vassalos. 2019. Declarative Data Analytics: a Survey. 1, 1 (February 2019), 36 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn 1 INTRODUCTION With the rapid growth of world wide web (WWW) and the development of social networks, the available amount of data has exploded. This availability has encouraged many companies and organizations in recent years to collect and analyse data, in order to extract information and gain valuable knowledge. At the same time hardware cost has decreased, so storage and processing of big data is not prohibitively expensive even for smaller companies.
    [Show full text]
  • A Platform-Independent Framework for Managing Geo-Referenced JSON Data Sets
    electronics Article J-CO: A Platform-Independent Framework for Managing Geo-Referenced JSON Data Sets Giuseppe Psaila * and Paolo Fosci Department of Management, Information and Production Engineering, University of Bergamo, 24044 Dalmine (BG), Italy; [email protected] * Correspondence: [email protected]; Tel.: +39-035-205-2355 Abstract: Internet technology and mobile technology have enabled producing and diffusing massive data sets concerning almost every aspect of day-by-day life. Remarkable examples are social media and apps for volunteered information production, as well as Open Data portals on which public administrations publish authoritative and (often) geo-referenced data sets. In this context, JSON has become the most popular standard for representing and exchanging possibly geo-referenced data sets over the Internet.Analysts, wishing to manage, integrate and cross-analyze such data sets, need a framework that allows them to access possibly remote storage systems for JSON data sets, to retrieve and query data sets by means of a unique query language (independent of the specific storage technology), by exploiting possibly-remote computational resources (such as cloud servers), comfortably working on their PC in their office, more or less unaware of real location of resources. In this paper, we present the current state of the J-CO Framework, a platform-independent and analyst-oriented software framework to manipulate and cross-analyze possibly geo-tagged JSON data sets. The paper presents the general approach behind the Framework, by illustrating J-CO the query language by means of a simple, yet non-trivial, example of geographical cross-analysis. The paper also presents the novel features introduced by the re-engineered version of the execution Citation: Psaila, G.; Fosci, P.
    [Show full text]
  • XML Prague 2012
    XML Prague 2012 Conference Proceedings University of Economics, Prague Prague, Czech Republic February 10–12, 2012 XML Prague 2012 – Conference Proceedings Copyright © 2012 Jiří Kosek ISBN 978-80-260-1572-7 Table of Contents General Information ..................................................................................................... vii Sponsors ........................................................................................................................... ix Preface .............................................................................................................................. xi The eX Markup Language? – Eric van der Vlist ........................................................... 1 XML and HTML Cross-Pollination: A Bridge Too Far? – Norman Walsh and Robin Berjon .................................................................................... 11 XML5's Story – Anne van Kesteren ................................................................................ 23 XProc: Beyond application/xml – Vojtěch Toman ........................................................ 27 The Anatomy of an Open Source XProc/XSLT implementation of NVDL – George Bina ...................................................................................................................... 49 JSONiq – Jonathan Robie, Matthias Brantner, Daniela Florescu, Ghislain Fourny, and Till Westmann .................................................................................................................. 63 Corona: Managing and
    [Show full text]
  • Big Data Analytic Approaches Classification
    Big Data Analytic Approaches Classification Yudith Cardinale1 and Sonia Guehis2,3 and Marta Rukoz2,3 1Dept. de Computacion,´ Universidad Simon´ Bol´ıvar, Venezuela 2Universite´ Paris Nanterre, 92001 Nanterre, France 3Universite´ Paris Dauphine, PSL Research University, CNRS, UMR[7243], LAMSADE, 75016 Paris, France Keywords: Big Data Analytic, Analytic Models for Big Data, Analytical Data Management Applications. Abstract: Analytical data management applications, affected by the explosion of the amount of generated data in the context of Big Data, are shifting away their analytical databases towards a vast landscape of architectural solutions combining storage techniques, programming models, languages, and tools. To support users in the hard task of deciding which Big Data solution is the most appropriate according to their specific requirements, we propose a generic architecture to classify analytical approaches. We also establish a classification of the existing query languages, based on the facilities provided to access the Big Data architectures. Moreover, to evaluate different solutions, we propose a set of criteria of comparison, such as OLAP support, scalability, and fault tolerance support. We classify different existing Big Data analytics solutions according to our proposed generic architecture and qualitatively evaluate them in terms of the criteria of comparison. We illustrate how our proposed generic architecture can be used to decide which Big Data analytic approach is suitable in the context of several use cases. 1 INTRODUCTION nally, despite these increases in scale and complexity, users still expect to be able to query data at interac- The term Big Data has been coined for represent- tive speeds. In this context, several enterprises such ing the challenge to support a continuous increase as Internet companies and Social Network associa- on the computational power that produces an over- tions have proposed their own analytical approaches, whelming flow of data (Kune et al., 2016).
    [Show full text]
  • Download the Result from the Cluster
    UC Riverside UC Riverside Electronic Theses and Dissertations Title Interval Joins for Big Data Permalink https://escholarship.org/uc/item/7xb001cz Author Carman, Jr., Eldon Preston Publication Date 2020 License https://creativecommons.org/licenses/by/4.0/ 4.0 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA RIVERSIDE Interval Joins for Big Data A Dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science by Eldon Preston Carman, Jr. September 2020 Dissertation Committee: Dr. Vassilis J. Tsotras, Chairperson Dr. Michael J. Carey Dr. Ahmed Eldawy Dr. Vagelis Hristidis Dr. Eamonn Keogh Copyright by Eldon Preston Carman, Jr. 2020 The Dissertation of Eldon Preston Carman, Jr. is approved: Committee Chairperson University of California, Riverside Acknowledgments I am grateful to my advisor, Professor Vassilis Tsotras, without whose support, I would not have been able to complete this journey. Thank you for your continued encour- agement and guidance through this learning experience. I especially would like to thank my dissertation committee members, Dr. Ahmed Eldawy, Dr. Vagelis Hristidis, Dr. Eamonn Keogh, and Dr. Michael Carey, for reviewing my dissertation. I also like to thank Dr. Rajiv Gupta for his support in the beginning of this dissertation. I would like to thank everyone in AsterixDB’s team for their support. Especially Professor Michael Carey for his guidance in understanding the AsterixDB stack and managing the research in an active software product. Thank you to my lab partners, Ildar Absalyamov and Steven Jacobs, for all the lively discussions on our regular drives out to Irvine.
    [Show full text]
  • A Dataflow Language for Large Scale Processing of RDF Data
    SYRql: A Dataflow Language for Large Scale Processing of RDF Data Fadi Maali1, Padmashree Ravindra2, Kemafor Anyanwu2, and Stefan Decker1 1 Insight Centre for Data Analytics, National University of Ireland Galway {fadi.maali,stefan.decker}@insight-centre.org 2 Department of Computer Science, North Carolina State University, Raleigh, NC {pravind2,kogan}@ncsu.edu Abstract. The recent big data movement resulted in a surge of activity on layering declarative languages on top of distributed computation plat- forms. In the Semantic Web realm, this surge of analytics languages was not reflected despite the significant growth in the available RDF data. Consequently, when analysing large RDF datasets, users are left with two main options: using SPARQL or using an existing non-RDF-specific big data language, both with its own limitations. The pure declarative nature of SPARQL and the high cost of evaluation can be limiting in some scenarios. On the other hand, existing big data languages are de- signed mainly for tabular data and, therefore, applying them to RDF data results in verbose, unreadable, and sometimes inefficient scripts. In this paper, we introduce SYRql, a dataflow language designed to process RDF data at a large scale. SYRql blends concepts from both SPARQL and existing big data languages. We formally define a closed algebra that underlies SYRql and discuss its properties and some unique optimisation opportunities this algebra provides. Furthermore, we describe an imple- mentation that translates SYRql scripts into a series of MapReduce jobs and compare the performance to other big data processing languages. 1 Introduction Declarative query languages have been a corner stone of data management since the early days of relational databases.
    [Show full text]
  • Recent Trends in JSON Filters Atul Jain*, Dr
    International Journal of Scientific Research in Computer Science, Engineering and Information Technology ISSN : 2456-3307 (www.ijsrcseit.com) https://doi.org/10.32628/CSEIT217116 Recent Trends in JSON Filters Atul Jain*, Dr. ShashiKant Gupta CSA Department, ITM University, Gwalior, Madhya Pradesh, India ABSTRACT Article Info JavaScript Object Notation is a text-based data exchange format for structuring Volume 7, Issue 1 data between a server and web application on the client-side. It is basically a Page Number: 87-93 data format, so it is not limited to Ajax-style web applications and can be used Publication Issue : with API’s to exchange or store information. However, the whole data never to January-February-2021 be used by the system or application, It needs some extract of a piece of requirement that may vary person to person and with the changing of time. The searching and filtration from the JSON string are very typical so most of the studies give only basics operation to query the data from the JSON object. The aim of this paper to find out all the methods with different technology to search and filter with JSON data. It explains the extensive results of previous research on the JSONiq Flwor expression and compares it with the json-query module of npm to extract information from JSON. This research has the intention of achieving the data from JSON with some advanced operators with the help of a prototype in json-query package of NodeJS. Thus, the data can be filtered out more efficiently and accurately Article History without the need for any other programming language dependency.
    [Show full text]
  • Extracting Data from Nosql Databases a Step Towards Interactive Visual Analysis of Nosql Data Master of Science Thesis
    Extracting Data from NoSQL Databases A Step towards Interactive Visual Analysis of NoSQL Data Master of Science Thesis PETTER NÄSHOLM Chalmers University of Technology University of Gothenburg Department of Computer Science and Engineering Göteborg, Sweden, January 2012 The Author grants to Chalmers University of Technology and University of Gothenburg the non-exclusive right to publish the Work electronically and in a non-commercial purpose make it accessible on the Internet. The Author warrants that he/she is the author to the Work, and warrants that the Work does not contain text, pictures or other material that violates copyright law. The Author shall, when transferring the rights of the Work to a third party (for example a publisher or a company), acknowledge the third party about this agreement. If the Author has signed a copyright agreement with a third party regarding the Work, the Author warrants hereby that he/she has obtained any necessary permission from this third party to let Chalmers University of Technology and University of Gothenburg store the Work electronically and make it accessible on the Internet. Extracting Data from NoSQL Databases A Step towards Interactive Visual Analysis of NoSQL Data PETTER NÄSHOLM © PETTER NÄSHOLM, January 2012. Examiner: GRAHAM KEMP Chalmers University of Technology University of Gothenburg Department of Computer Science and Engineering SE-412 96 Göteborg Sweden Telephone + 46 (0)31-772 1000 Department of Computer Science and Engineering Göteborg, Sweden January 2012 Abstract Businesses and organizations today generate increasing volumes of data. Being able to analyze and visualize this data to find trends that can be used as input when making business decisions is an important factor for competitive advan- tage.
    [Show full text]