’ …Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) HBASE storage model : HFILE
➢Hfile ➔ storage unit for family of columns (like Memtable in Cassandra) ➢« TABLES » and « ROWS » with « columns » ➢ Rows in HBASE ➔ ROW KEYS ➢« Column family » (various columns) ➔ HFILE (Key-value pairs)
➢DB update with WAL protocol (Write-Ahead log protocol)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) CASSANDRA
➢COLUMN-ORIENTED ➢BIG TABLE from Google/Facebook + DYNAMO from Amazon➔CASSANDRA Facebook then in 2008 DYNAMO developpers (from Amazon) and Microsoft then APACHE in 2009 DYNAMO : {(KEY,VALUE)} distributed with no centralized control ➢Customers : APPLE, NETFLIX, eBAY, INSTAGRAM… CALL of DUTY (>100 M of gamers)
➢NOSQL DB with JAVA API ➢CQL (SQL-like query Language) ➢No predefined schema (until 2014 and CQL3) ➢Every line could be different (referring to columns) ➢« column family » (then « tables ») represent storage units for keys (MEMTABLE file) ➢Each column encompasses a triple : NAME (UID possible), VALUE and TIMESTAMP
➢DATA update with WAL protocol (commit log)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) CQL3 (since 2014) : Create TABLE…
➢Create TABLE < replacing column family > ; no foreign keys ➢Create alter /drop/use KEYSPACE (data base) ➢Create TRIGGER, CREATE INDEX ; EX : Create Index on flight (DC)
➢SELECT ..FROM..WHERE .. ORDER BY with LIMIT/ALLOW FILTERING clauses ; no GROUP BY Example : Create Table PILOTS (PIL# bigint primary key, PIlname TEXT, ADDRESS TEXT) Create Table FLIGHT ( F# bigint, Date-comment timestamp, author varchar, Content text, PRIMARY KEY (F#, date-comment))
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) CQL3 (Sub queries)
➢ No query optimization ➢ IN : « Semi Join » ➢ NOT IN : « anti join »
Generic example : SELECT PILNAME From PILOT JOIN EACH FLIGHT ON Pilot.pil#=flight.pil# and DC =‘Nice’ ;
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) … and Google Big Query (https://bigquery.cloud.google.com) ?
3 major Google contributions to Big Data ecosystem : 1. BGFS (GOOGLE FILE SYSTEM) ➔ HDFS 2. BIG TABLE (column oriented) ➔ HBASE 3. MAP REDUCE ➔ HADOOP
41 Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) DREMEL (Google) 2006
➢BIG TABLE (on GFS) ➔ DREMEL (on COLOSSUS) ➢Distributed Query engine (Scale OUT) ➢Column-oriented DB ➢BIGQUERY SQL interactive interface (proprietary) with NEST/UNNEST ➢SCHEMA ➢CLOUD-based approach ➢Note : GMAIL is built on top of DREMEL
➢Parallel execution on thousands of machines ➢ >n (100 000 disks) ➢ > p (10 000 processors) < SCALE OUT> ➢ 50 giga Bytes/sec with response time < 5 sec ➢
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Wikipedia Benchmark with Big Query in 2012
➢2011 : Wikipedia bigquery-sample : wikipedia_benchmark
➢[Examples Bigquery SQL from TIGANI2014]
SELECT language SUM (views) AS views From [bigquery_samples : wikipedia_benchmark WHERE REGEXP_MATCH (title, *G.*O.*O.*G.) GROUP By Language ORDER BY Views Dsc
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Counting the word « KING » in SHAKESPEARE‘s plays with Big Query
(publicdata:samples.shakespeare)<1,6 GB SELECT LOWER (Word) AS word, word_count AS frequency, corpus FROM [publicdata:samples.shakespeare] WHERE corpus CONTAINS ‘king’ AND LENGTH (Word) > 5 ORDER BY frequency DESC LIMIT 10
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Some SQL examples on KEY-VALUE NO SQL DBMS with N1QL (Couchbase), CQL (Cassandra), …
Typical Example : N1QL : SELECT PIL#, PILName From Pilot Where ANY F in Flight SATISFIES F.DC= ‘Nice’ and ADDR = ‘Nice’; CQL3 SELECT PIL#, PILNAME From PILOT JOIN EACH FLIGHT ON Pilot.pl#=FLIGHT.PL# and DC =‘Nice’ and ADDR = ‘Nice’;
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Graph-oriented NO SQL : NEO4J (2000)
➢J : Java (NEO4J was developped in Java)
➢Implementation and management of GRAPHs (Node, Relationship, Property) ➢ DATA storage in a directory with REST interface
➢2 specific languages : CYPHER (SQL flavour) and GREMLIN (script language based on Groovy )
➢Example : Node creation with CYPHER Create n={pilname: ‘serge’, ADDR :’Nice’} ➢Applications : Walmart, Cisco, Twitter
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) NEO4J
➢START Serge=node(1) Create flight100={f#:’IT100’, DC=‘Nice’, AC=‘Toulouse’} Serge-[r:insure]➔ flight100 Return flight10
➢START Serge=node(1) MATCH Serge-[r:insure *]➔ Node (2) Return r
*indefinite number of hierarchical paths
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) NEO4J (MATCH clause)
➢Match clause : START flight=node:flight (DC=‘Nice’) MATCH passenger-[:LIKE]->flight RETURN passenger ➢Start (query) START pilot=node:pilot(pilname= ‘serge’) RETURN pilot MATCH (p:pilot) USING INDEX p:pilot(ADDR) WHERE p.pilname = ‘serge‘ RETURN p
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Towards GQL (Graph Query Language)*
➢ for graph-based NO SQL systems
➢There are planned extensions to SQL for graph queries ➢ Neo4j with Oracle, Microsoft, IBM and SAP with CIPHER ; ➢ The property graph data model is a superset of the tabular SQL model, ➔ to have a graph query language, GQL, that complements SQL.
➢*Proposed standard to SQL Committe, Alastair Green 30 may 2018 https://db- engines.com/en/blog_post/78
49 Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Benefits of Fusing Three Languages into One Standard GQL (https://db-engines.com/en/blog_post/78)
➢There are three existing pure graph languages which use a shared « graph pattern » for inserting, updating or extracting data from a property graph (comparison in gql.today) ➢PGQL comes from Oracle PGX (first appearing in 2016) ➢openCypher started out in Neo4j’s graph database in 2011 and is now used in other commercial products ➢G-CORE is a research language, described in a SIGMOD 2018 paper (LDBC Query Language task force) https://db-engines.com/en/blog_post/78
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Towards GQL initiating an industry standard graph query language
GQL Manifesto (gql.today) :
SELECT FROM GRAPH MATCH < Graph Pattern : sub graph> WHERE
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Example with GQL
Query : Name of the pilots who insure a flight from Nice with a Boeing 747 ? SELECT pl.name AS pilot.name
FROM GRAPH pilotflightsplanes
MATCH // graph pattern
(pl:Pilot)->{:INSURES}->(f:flight)<-{:IS-USED}<- (p:plane)
WHERE p.name = ’B747’ and f.DC = “Nice”;
➢The pattern means that all data in the graph that matches the sequence of nodes and edges (each of which has a particular « label » or element type) will be identified. ➢This operation lifts a «sub-graph » or a « projected graph » of flights for a particular pilot into the application. ➢Properties on all instances of :PILOTnodes or :FLIGHTedges that match can now be read by the application.
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) NEW SQL
« Replacing real SQL ACID with either no ACID or « ACID lite » just pushes consistency problems into the applications where they are far harder to solve. Second, the absence of SQL makes queries a lot of work » M.Stonebraker (VoltDB, 2011)
➢BRIDGES Between SQL and NO SQL ➢ New DBMS (Main-memory DBMS , etc.) ex : VOLTDB ➢ Integration of NO SQL access to SQL DBMS Ex : Oracle, IBM, Microsoft, Teradata… ➢ « EXTERNAL TABLES » (for external NO SQL data stores) in FROM clause ➢ HIVE driver (for OSS approach)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) « From NO SQL to NEW SQL »
« NEW SQL » (on top of SQL) :
➢VoltDB by M. Stonebraker, REDIS, MYSQL, Scale DB, Clustrix, AKIBAN, NUODB (NimbusDB)
➢ and TERADATA BIG DATA, Oracle BIG DATA, BIG QUERY SQL (Google), Microsoft BIG DATA, IBM Big Data…
« Future is polyglot persistence and POLYSTORES »
54 Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Architecture of VOLTDB (2009, M. Stonebraker)
NO BUFFER POOL, NO LOCKING (timestamping) NO WAL (no DATA Log : stored procedure log) NO Threading overhead Single threaded ➢No shared data ➢Main memory divided per core ➢Open Source Shared-nothing architecture (cf SMP & LAN) ➢1 TERABYTES of data ➔ one cluster of 30 nodes with 30 G bytes a node ➢4 orders of magnitude more important in tps than 25 years ago ! (1000 tps in 1990)
55 Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) « NEW SQL »* with Oracle (on Oracle 12C ; only with EXADATA)
➢EXTERNAL TABLES ➢HIVE driver ➢JSON documents accessed via SQL
* « Unified query for BIG DATA management » Oracle White Paper, January 2015
56 Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) « SQL is the lingua franca/esperanto for data management » Oracle 2015
« An object relational Mapper (ORM) can access SQL and NO SQL/Hadoop simply by adding object relations to its existing data stores »
Source:http://www.oracle.com/technetwork/database/nosqldb/learnmore/nosql-database-498041.pdf
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) ORACLE (Oracle NoSQL Database)
Source : https://docs.oracle.com/cd/E26161_02/html/AdminGuide/introduction.html
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) ORACLE (Big Data SQL for Oracle NoSQL Database) with HIVE driver and external tables
Source : https://blogs.oracle.com/NoSQL/entry/bigdata_sql_with_oracle_nosql
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) « Language federation » approach and « EXTERNAL TABLES »
Query franchising and smart SCAN with external tables
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Teradata
« Now you could benefit the power of MAP REDUCE with the ease of use of SQL. Before with Hadoop, users were the administrators »
Stephen Brobst, Teradata CTO, (Oct 2012)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Teradata
➢UDA : Unified Data Architecture with Hadoop integration ➢HDFS (Hadoop Distributed File System), with SQL ➢HCatalog, framework of Open Source metadata developped by Hortonworks « SQL-H » enables to analyze data stored in HDFS system using SQL
➢ASTER, from Teradata: ➢SQL-Map Reduce, which integrates MAP REDUCE functions within SQL ➢Teradata-Aster Big Analytics
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) SQL-H /Map-Reduce (SQL/MR)*
➢Set of built-in UDF/UDT SELECT ... (User-defined TABLES) From function_name ( ➢2 main functions to enable parallelism ON Table-or-Query ➢Row Function (as a mapper) : perform row-level transformations and processing [PARTITION BY expression, ...] ➢Partition Function (sharding) [ORDER BY expression] (as a reducer) : perform treatment on [Clausename (arg, ...), ...] each group of rows defined by the same PARTITION KEY clause ➢To invoke SQL Map Reduce functions using SQL thru ASTER DB ➢GROUP BY with complex functions *[Eric Friedman et al, 2009] ➢Proprietary ML packages (time series analysis,..)
63 Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Big Data Polybase for Microsoft
➢ SQL SERVER encompasses Hadoop ➢ Excel Interface with Hadoop ➢Sqoop ➢Mahoot (data mining for Hadoop)
➢POLYBASE with Hadoop and Azure
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) BIG DATA with Blu Acceleration for IBM
Source : http://www.redbooks.ibm.com/redbooks/pdfs/sg248212.pdf
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) BIG DATA properties : « WHAT ! »
➢Web DATA : « semi-structured data » « Open Data », XML, « Linked DATA » / « Semantic web » (RDF paradigm, SparQL, OWL) Triple Data Store
➢Hadoop/Hive : unstructured data on open source platforms (Hbase)/map-reduce (and KEY-VALUE paradigm)
➢Analytics orientation (ML, DL)
➢Real Time data
66 Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Big Data Systems ! (Aslett, 2013)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) BIG DATA Management SYSTEMS and data paradigms
Codd’s relational data model OBJECT data model VALUE paradigm SQL2, SQL3/ODMG NEW SQL TIPS RICE
Big Data POINTER-VALUE SYSTEMS paradigm (SQL3) OBJECT-VALUE paradigm (ODMG) SPARQL N.O. SQL (OWL)
PREDICATE-VALUE (RDF) WHAT KEY VALUE paradigm paradigm (Map Reduce) (Semantic web)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Towards BIG SQL and interactive real-time analytics (ML & DL) ! Some key SQL3 extensions
➢« External table » for NO SQL data store SELECT optional-expression (JSON object) ➢ WHERE expression ➢« PARTITION clause » for GROUP BY exp list map reduce & SHARDING HAVING expression ➢« Match clause » for sub graphs (Cipher ORDER Exp List and GQL) LIMIT expression OFFSET expression ➢NEST/UNNEST (N1QL) ➢« LIMIT/OFFSET from < UnQL ,2011>;Couchbase and SQL https://www.couchbase.com/press-releases/unql-query- language http://unql.sqlite.org/index.html/wiki?name=UnQL+Synt ax+Notes
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Towards « BIG SQL »
SELECT FROM {T/Tables}, {V/Views} < SQL2; Create Table..Create View,..) { SQL query} < SQL3> {EXTERNAL TABLES} < N.O.SQL DB> < Oracle, IBM, Microsoft, Informix, Sybase/SAP, MySQL,..> {GRAPH} < GQL> WHERE like NEST/UNNEST , , UDF/UDT, MAP/REDUCE PARTITION BY like LIMIT/OFFSET, PIVOT/UNPIVOT MATCH GROUP BY /HAVING GROUPING SETS with CUBE, ML & DL operators
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Unifying THEORY for BIG SQL (and polystore or MULTI-MODEL)
« One of the main arguments… was the industry needs a common query language and data model to feed the ecosystem for key-value stores... We are looking forward to working with other industry leaders in the NoSQL space on taking the design to the next level. » Erik Meijer, Microsoft Research, CO-SQL in CACM 2011
« An effective mathematical model that encompasses the concepts of SQL, NoSQL and NewSQL would enable their interoperability » Jeremy Kepner (MIT, 2016)
Cf Following Research SEMINAR on that hot topic whose framework is summarized here
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) SQL, RDF, NoSQL and NewSQL on an example
➢ SQL : a set of rows within a table < STRUCTURED DATA> FLIGHT2 F# PilotName PlaneN
AF100 Serge AirbusA320
AF110 Peter B747
AF102 Serge B747
SELECT * From FLIGHT2 WHERE Pilotname=‘Serge’;
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) SEMI-STRUCTURED DATA in RDF
TRIPLEs (OBJECT-PREDICATE-VALUE) ➢To describe WEB resources (:serge: insureflight:AF100) (:Peter:insureflight:AF110) (:AIRBUSA320:isusedinflight: AF100) (:Paul:ispasengeroflight:AF100) …
➢Note : One triple RDF is a fact in 1st-order predicate logic : P(S,O) with P: Predicate, S Subject et O object Example : Insureflight (Serge, AF100)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) RDFS graph (Example)
:AF100 AF102 :insureflight
:isusedinflight :Serge :ispasengeroflight
:drivesplane AIRBUS Paul A320
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) NoSQL graph for unstructured data
Serge AF100 AF102
A320 B747
AF110 Peter
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) NewSQL matrix
MT V MTV AF100
AF110 AF102
Serge Peter
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Need for a simple theory to unify the major data types of a big data system
➢4 types of DATA with 3 theoretical frameworks ➢STRUCTURED (SQL) and SET theory (of VALUES/DATA) ➢SEMI STRUCTURED (SparQL, OWL) : GRAPH THEORY (inferences) ➢UNSTRUCTURED (NOSQL) : GRAPH THEORY ➢NEW SQL and MATRICES (linear algebra) ➢NOTE : DATA SCIENCE (ML and DL ) and MATRICES management
➢Some formal unifying proposals ➢Associative array from MIT 2016 (KEPNER 2016), ..) (KEPNER 2016) Jeremy Kepner and al : « Associative array model of SQL, NoSQL and NewSQL Data bases » < MIT CS and AI laboratory, 2016> ➢Another attractive theory both in terms of simplicity and implementation ➢CATEGORY theory (MEIJ2011) (MEIJ2011) « A co-Relational Model of Data for Large Shared Data Banks » Erik Meijer and Gavin Bierman, Microsoft Research, CACM 2011
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Mathematics underlying big data management
DATA type paradigm pties Data model Math Data Data ops REF theory structures Structured VALUE TIPS Codd’s RM SET Relation/ Relational (Codd70) data TABLE algebra (DATE95) (TRANSACTION POINTER/ RICE Object RM (DATE’s GRAPH NF2, CLASS oriented) VALUE 3rd Manifesto)
Semi- PREDICATE/VALUE WHAT RDF data model GRAPH CLASS (RDF 98) structured DATA (SEARCH oriented) Unstructured KEY/ WHAT Key/blob GRAPH CLASS & NOMAD (Chang08) DATA VALUE Key/doc DOCUMENT ALGEBRA (NA) (analytics &Graph Key/column Associative oriented) Array (AA) algebra
NEWSQL VALUE & Graph RM (sparse) TABLES NA & AA (Cattel 10) (analytics MATRICES oriented) Polystore / / D4 model ARRAYS arrays AA (Duggan15)
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Big Questions ?
A new data world is in our hands/feet ! Short Bibliography on Big Data
➢Rudi BRUCHEZ « Les bases de données NO SQL et le Big Data » Eyrolles 2015 ➢« BIG DATA et Machine learning », P. Lemberger et al, DUNOD, 2016 ➢« Bases de Données » JL Hainaut, Dunod 2015
< In English> ➢Dan Mc Greary, Ann kelly « Making sense of NO SQL » Manning 2014 ➢Jordan Tigani, Siddartha Naidi « Google Bigquery Analytics » WILEY, 2014 (510 pages) ➢Ian Davis « 30 Minute Guide to RDF and Linked Data » 2009, Slide Share ➢Mike Stonebraker, « New SQL : An Alternative to NoSQL and Old SQL for New OLTP Apps » ACM, Juin 2011 ➢S. Miranda , « Systèmes d’information Mobiquitaires » Revue RTSI, Sept 2011 and « THE ART AND SCIENCE OF BIG DATA » (2019) ➢W.CHU Editor « Data mining and knowledge Discovery for big data » Springer 2014 ➢F.Provost, T Fawcell « DATA SCIENCE for Business » O’Reilly 2013
81 Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Research seminar : towards a unifying theory for BIG DATA management (and data lakes)
« Everything in Big Data management is a CATEGORY »
➢ Erik Meijer and Gavin Bierman « A co-Relational Model of Data for Large Shared Data Banks », Microsoft Research, CACM 2011 ➢ M. Fokkinga, « SQL versus coSQL — a compendium to Erik Meijer’s paper » 2012.
➢ Jeremy Kepner and al : « Associative array model of SQL, NoSQL and NewSQL Data bases » ➢ Kepner, J. Chaidez, V. Gadepally, and H. Jansen, « Associative arrays : Unified mathematics for spreadsheets, databases, matrices, and graphs » arXiv preprint arXiv:1501.05709, 2015.
➢ Gaetan Lescouflair, PH-D 2019, University of Nice Sophia Antipolis (MBDS) ➢ J.Lu et al “Multi model databases and highly integrated polystores” Tutorial CIKM 2018
82 Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Extra slides
83 Data integration (Virtual DATA LAKE)
Solution Query language Data model Target Schema Bridge Query Data Sources
BigIntegrator SQL-like Relational LAV Datalog RDBMS and Bigtable ( GQL) [Zhu & Rish 2011]
RDBMS and Distributed SQL/ MFR SQL-like Relational GAV MFR Processing Framework [Bondiombouy & al 2015] (MapReduce et Spark)
NoSQL/ Access path mapping SQL Relational GAV BQL RDBMS, NoSQL [Curé & al 2011]
FORWARD RDBMS, NoSQL, NewSQL, Middleware/SQL++ SQL++ JSON based GAV - SQL-on-Hadoop [Ong & al 2014]
CloudMdsQL SQL-like JSON based Schema-less - RDBMS, NoSQL, HDFS [Kolev & al 2016]
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Integrating R in Data Systems
Execution Parallel Process Solution External structure Brigde Execution Plateform model technique
SQL Server + R [Berral & Poggi 2016] In-DB - - Store procedure Microsoft SQL Server
Spark R Distributed Data partitioned Socket server Cluster Spark Workers [Venkataraman & al 2016] frame execution based on Netty
proxy to a in- Big R MPP, IBM InfoSphere memory Data [Yejas & al 2014] Cluster partitioned JaQL BigInsight (Hadoop frame, Vector and execution version) List
Rhipe [Oancea & Dragoescu 2014] Cluster - MapReduce Protocol Buffer Hadoop
R Wrapper to RHadoop Hadoop [Oancea & Dragoescu 2014] Cluster - MapReduce Hadoop Streaming Framework
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Pre-processing for ML (DATA LAKE) or temporary data storage (Velocity): relational data model
➢RELATIONAL ALGEBRA ➢MATRIX ALGEBRA (ex : Interface with R language)
➢WORFLOW with different technologies/models ➢Programming patterns (like MAP REDUCE) ➢DATA distribution on parallel processors (instead of code distribution) ➢MAP REDUCE are application dependent
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) BIG DATA in the CLOUD
➢ BIG QUERY de Google ➢ AZURE SQL de Microsoft ➢ TERADATA ➢ IMPALA (Cloudera) avec compatibilité cubes OLAP et Big Data ➢ REDSHIT (Amazon) sur base Postgres ➢ PRESTO (Facebook) ➢ DRILL (version Open source de... Dremel) ➢ SNOWFLAKE…
87 Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Mathematical entities of a category (wikipedia)
A category C consists of the following mathematical entities :
➢A class ob(C), whose elements are called objectsA class hom(C), whose elements are called morphisms or maps or arrows . ➢Each morphism f has a source object a and target object b. ➢The expression f : a → b, would be verbally stated as "f is a morphism from a to b” ➢The expression hom(a, b) – alternatively expressed as homC(a, b), mor(a, b), or C(a, b) denotes the hom-class of all morphisms from a to b.
➢A binary operation ∘, called COMPOSITION of morphims, such that for any three objects a, b, and c, we have hom(b, c) × hom(a, b) → hom(a, c). The composition of f : a → b and g : b → c is written as g ∘ f or g ∘ f, governed by two axioms : ASSOCIATIVITY : If f : a → b, g : b → c and h : c → d then h ∘ (g ∘ f) = (h ∘ g) ∘ f, and IDENTITY : For every object x, there exists a morphism 1x : x → x called the identity morphism such that for every morphism f : a → b, we have 1b ∘ f = f = f ∘ 1a.
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) MORPHISMS (arrows in Category)
➢Relations among morphisms (such as fg = h) are often depicted using commutative diagrams, with "points" (corners) representing objects and "arrows" representing morphisms.
➢Morphisms can have any of the following properties. A morphism f : a→b is a :
➢monomorphism (or monic) if f ∘ g1 = f ∘ g2 implies g1 = g2 for all morphisms g1, g2 : x → a.
➢epimorphism (or epic) if g1 ∘ f = g2 ∘ f implies g1 = g2 for all morphisms g1, g2 : b → x. ➢ bimorphism if f is both epic and monic. ➢isomorphism if there exists a morphism g : b → a such that f ∘ g = 1b and g ∘ f = 1a.[b] ➢endomorphism if a = b. end(a) denotes the class of endomorphisms of a. ➢automorphism if f is both an endomorphism and an isomorphism. aut(a) denotes the class of automorphisms of a.
➢retraction if a right inverse of f exists, i.e. if there exists a morphism g : b → a with f ∘ g = 1b.
➢section if a left inverse of f exists, i.e. if there exists a morphism g : b → a with g ∘ f = 1a.
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) FUNCTORS (Category)
➢FUNCTORS are structure-preserving morphisms between categories. ➢A (covariant) functor F from a category C to a category D, written F : C → D, consists of : ➢for each object x in C, an object F(x) in D ; and ➢for each morphism f : x → y in C, a morphism F(f) : F(x) → F(y), ➢such that the following two properties hold :
➢For every object x in C, F(1x) = 1F(x) ; ➢For all morphisms f : x → y and g : y → z, F(g ∘ f) = F(g) ∘ F(f). ➢A contravariant functor F: C → D is like a covariant functor, except that it "turns morphisms around" ("reverses all the arrows"). ➢More specifically, every morphism f : x → y in C must be assigned to a morphismF(f) : F(y) → F(x) in D. In other words, a contravariant functor acts as a covariant functor from the opposite category Cop to D
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) « NATURAL TRANSFORMATION » (Category)
Note : Historical rationale for CATEGORY !➔ Functors
A natural transformation is a relation between two functors. Functors often describe "natural constructions" and natural transformations then describe "natural homomorphisms" between two such constructions. Sometimes two quite different constructions yield "the same" result; this is expressed by a natural isomorphism between the two functors. If F and G are (covariant) functors between the categories C and D, then a natural transformation η from F to G associates to
every object X in C a morphism ηX : F(X) → G(X) in D such that for every morphism f : X → Y in C, we
have ηY ∘ F(f) = G(f) ∘ ηX; this means that the following diagram is commutative:
The two functors F and G are called naturally isomorphic if there exists a natural
transformation from F to G such that ηX is an isomorphism for every object X in C.
Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA)