Hadoop (Map Reduce) ?

Total Page:16

File Type:pdf, Size:1020Kb

Hadoop (Map Reduce) ? « From data bases to big data » (7 lectures) MBDS graduate course Professor Serge Miranda Dept of Computer Science University of Nice Sophia Antipolis ( menber of Universite Côte d’Azur –UCA-) Director of MBDS Master degree (www.mbds-fr.org) 1 www.mbds-fr.org BIG DATA : N.O. SQL and NEW SQL (Lecture 7) Professor Serge MIRANDA Dept of Computer Science University of Nice Sophia Antipolis (UCA) Director of MBDS Master degree (www.mbds-fr.org) 2 (www.mbds-fr.org) www.mbds-fr.org BIG DATA management systems ➢TOP DOWN approach for structured and semi- structured DATA ➢SQL2, SQL3, ODMG ➢Semantic Web (SPARQL, OWL) ➢BOTTOM UP Approach for UNSTRUCTURED DATA ➢N.O. SQL (NOT ONLY SQL) ➢NEWSQL Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Bottom up approach for unstructured data (no schema, no metadata) ➢ « N.O. SQL » (Not Only SQL) < meaning NO Relational> ➢ « KEY /VALUE Paradigm » ➢ GRAPH paradigm ➢ « NEW SQL » ➢ « SQL paradigm » Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) « COMPLEX » data : SQL3, N.O. SQL et NEWSQL? PROCESSING (MIRA2013) SQL OR- NEW SQL DBMS SQL3 OO-DBMS N.O. SQL no SQL ODMG DATA STRUCTURE Complex Complex Structured data Unstructured data Top Down Bottom Up (schema) (no schema ; no metadata) Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) « BASE » properties ➢BASE : ➢ Basically ➢ Available ➢ Scalable (OUT) ➢ Eventually consistent (final consistency) ➢Replica consistency ; Cross Node Consistency ➢ CAP Theorem (Eric Brewer, Prof Berkeley, 2000 & 2012 ; Revised by Altend MIT, 2002) ➢ Consistency, SQL ➢ Availability, ➢ Partitioning NO SQL Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) CAP Theorem : « Pick 2 ! » (Brewer 2000 ; 2012) Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) N.O. SQL (Not Only SQL) (1998) 4 « no » : 1. no SCHEMA (schema-less ; Variability) & NO METADATA 2. no RELATIONAL/ NO JOIN (extract data without joins) 3. no DATA FORMAT(graph, document, row, column) 4. no (ACID)Transactions (CAP theorem ; BASE) + + + + (VALUE)… VOLUME VELOCITY VARIETY Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) 2 Complementary approachs for big data management SQL N.O.SQL STRUCTURED (SCHEMA) Unstructured (no schema) VOLUME & VARIETY TERA/PETA bytes EXA/ZETA++ bytes VELOCITY NO YES TRANSACTIONS YES NO (ACID and Gray’s theorem) (BASE & CAP theorem) SCALABILITY UP (Scale up) OUT (scale OUT) USER INTERFACE AD HOC Queries, Predefined queries, JOIN & Transaction oriented NO JOIN & Decision oriented STANDARDS SQL3/ODMG Not yet (BIG SQL) Typical approach TOP DOWN Bottom UP (predefined Schema) (no schema) Administrator Yes No Vendor support Yes No (Open Source) Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) N.O. SQL and Web actors ➢ Google ➢ Map Reduce, BigTable & BIG QUERY SQL (Data/ANALYTICS as a Service) ➢ Yahoo! ➢ Hadoop, S4 ➢ Amazon ➢ Dynamo, S3 ➢ Facebook ➢ Cassandra, Hive ➢ Twitter : Storm, FlockDB ➢ LinkedIn : Kafka, SenseiDB, Voldemort, etc. Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Taxonomy of BIG DATA Systems Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) « N.O. SQL » DBMS 4 data paradigms: 3 KEY-VALUE oriented and one GRAPH oriented ➢ KEY-VALUE with BLOBS (Binary Large Objects) ex : Hadoop, Cassandra, Ryak, Redis,DynamoDB, BerkeleyDB, etc. ➢➔ HASHING arrays (no query engine) ➢ KEY-VALUE with JSON/XML documents ex : MongoDB, CouchDB, etc. ➢ JSON simpler than XML with Java Script interface ➢ <KEY, VALUE> model with VALUE in JSON (BSON, XML) for documents ; ➢ KEY–VALUE with COLUMNS ex : HBASE, Cassandra, BigTable/Google,… ➢ <KEY, (SETofcolumns, VALUE, TIMESTAMP)> ➢ GRAPH oriented ex: Neo4j, OrientDB… : towards GQL (Graph Query Language) Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) KEY-VALUE NOSQL and SQL convergence ➢(KEY, VALUE) pairs ➢(primary key, relational TUPLE) ➢Like (OID, VALUE) for OBJECTS ➢Hashing tables for access In SQL : Create table PAIR (KEY, varchar primary key, VALUE blob) ➢ 4 basic operators ➢ INSERT/DELETE/UPDATE pair ➢ FIND value for a key ➢Ex : Cassandra, Redis, Voldemort, Memcached, Riak, Dynamo (Amazon), CACHE (Intersystems), CouchDB, Redis, BIG TABLE, Berkeley DB,… Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) REST (REpresentational State Transfer) ➢2 communication modes for client –server ➢RPC (Remote Procedure Call) which is connection oriented (TCP) ➢REST <Representational State Transfer> which is service oriented ➢REST is based upon HTTP ➢DATA access facility ➢6 REST methods : GET, HEAD, PUT, POST, DELETE, OPTIONS ➢Restful NO SQL systems : COUCHDB, HBASE, NEO4J, RIAK... Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) SCALE OUT & SHARDING (fragmentation) ➢distributed DATA partitioning for parallel processing ➢SHARD KEY : key for data partitioning Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) HADOOP ecosystem for the 1st « V » of BIG DATA (VOLUME) HADOOP ecosystem around HDFS and MAP REDUCE : PIG LATIN* (script) developped by Yahoo, HIVE (datawarehouse) by Facebook (HIVEQL), … * PIG Latin with SQL operators : Join, Group By, Union but procedural approach for batch; like HIVE UDF (User defined functions) are possible Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Hadoop (Map Reduce) ? ➢OPEN SOURCE (Apache Foundation) written in JAVA ; ➢HADOOP (Map-Reduce implementation) : Created by Doug Cutting from Yahoo (then Open Source) ➢HDFS ➢HBASE : oriented-column key-value data store ➢From GOOGLE : ➢Google Map Reduce (2004) ➔ HADOOP ➢Google Filesystem (GFS) 2003 ➔ HDFS ➢Google BIG TABLE (Distributed hashing table over GFS)➔ HBASE ➢HADOOP distributions (Linux initially) ➢Cloudera (Impala) ➢Hortonworks (Windows version) ➢ MAPR (HDFS centrics) ➢and for the cloud : EMR- Elastic Map Reduce- (Amazon, 2009) Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Hadoop Map-Reduce (MR) ➢Map-Reduce architecture consists with ➢one JobTracker ➢different TaskTracker in charge on executing map-reduce on each machine ➢With YARN, evolution of MR Architecture : ➢one JOB TRACKER ➢RESOURCE MANAGER ➢ APPLICATION MASTER (AM) <not only Map Reduce> ➢ SCHEDULER Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) MAP REDUCE processing steps DATA distribution approach instead of PROGRAM distribution (parellelization) Creation of (KEY, VALUE) pairs… 4 Steps : ➢SHARDING/SPLITTING input data for parallel processing ➢Mapping BLOCKS to create values associated with keys (key, value) ➢shuffling (sorting) by keys ➢Reducing groups with an aggregate value for each key Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Typical Example for MAP REDUCE (words counting in a given text) Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Example * MAP REDUCE for JOINing 2 tables ? ☺ Pilot PL# PLNAME 1 Serge 2 Leo FLIGHT1 F# PL# DC AC AF100 1 Nice Paris AF101 1 Paris Toulouse AF104 2 Toulouse Lyon * Example inspired from book « Big Data et Machine Learning » P.Lemberger et al, Dunod, 2018 < In French> Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) MAP (Option with coalescence of DATA INPUT+Splitting) Sharding at the Pilot 1 Serge tuple level Pilot 2 Leo FLIGHT1 AF100 1 Nice Paris FLIGHT1 AF101 1 Paris Toulouse FLIGHT1 AF104 2 Toulouse Lyon MAP & Shuffling with KEY= PL# (sorting) VALUEs KEY=1; (KEY, tuple) Pilot 1 Serge FLIGHT1 AF100 1 Nice Paris FLIGHT1 AF101 1 Paris Toulouse KEY=2; (KEY, tuple) Pilot 2 Leo FLIGHT1 AF104 2 Toulouse Lyon Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) REDUCE (aggregation; tuple fusion) on each partition and final result… JOIN ☺ 1 Serge AF100 Nice Paris REDUCE 1 Serge AF101 Paris Toulouse 2 Leo AF104 Toulouse Lyon Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Map reduce issues ➢ split/Shard size ? 64 MO (HDFS block size) ➢ KEY selection ? ➢ Processing with these2 functions MAP & REDUCE ? ➢Hadoop framework complexity ? ➢ Batch-oriented (days to validate a map reduce job) ➢Map and reduce coding complexity ! ➢ scarce implementation of Map & Reduce ➢Use of SCRIPT language (PIG) ➢ use of SQL-like interface (HIVE, SPARK SQL) Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) HADOOP Ecosystem ➢ HDFS or API for other distributed FS like S3 (Amazon) ➢MAHOUT for Machine learning in Java ➢ SPARK (2014)with MLlib ➢ Zookeeper (scheduler), Oozie (job plan), Flume (data flow) , Rhadoop (for R developpers) , Sqoop (data transfer with R DBMS) and… Apache STORM (real-time big data …) ➢ ---------------- next : Interactivity (>> batch) + SQL (>> scripts) ➢ Impala : MPP SQL engine ➢ DRILL (with Zookeeper) like Big Query (Google with Dremel) Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) HIVE http://hive.apache.org/ ➢ 2009 Facebook ➢ HiveQL 1.0 (Feb. 2015) Tabular model on HADOOP with SQL-like interface Transformation of HiveQL query into MAP REDUCE jobs ➢CREATE TABLE PILOT (PIL# INTEGER, PLNAME STRING, ADDR STRUCTURE (Street : String, City : String, Zip : INT)) ➢Only EQUI JOINS (inner join, left join, right outer join) Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) DOCUMENT*-oriented NO SQL DBMS (KEY/VALUE
Recommended publications
  • Empirical Study on the Usage of Graph Query Languages in Open Source Java Projects
    Empirical Study on the Usage of Graph Query Languages in Open Source Java Projects Philipp Seifer Johannes Härtel Martin Leinberger University of Koblenz-Landau University of Koblenz-Landau University of Koblenz-Landau Software Languages Team Software Languages Team Institute WeST Koblenz, Germany Koblenz, Germany Koblenz, Germany [email protected] [email protected] [email protected] Ralf Lämmel Steffen Staab University of Koblenz-Landau University of Koblenz-Landau Software Languages Team Koblenz, Germany Koblenz, Germany University of Southampton [email protected] Southampton, United Kingdom [email protected] Abstract including project and domain specific ones. Common applica- Graph data models are interesting in various domains, in tion domains are management systems and data visualization part because of the intuitiveness and flexibility they offer tools. compared to relational models. Specialized query languages, CCS Concepts • General and reference → Empirical such as Cypher for property graphs or SPARQL for RDF, studies; • Information systems → Query languages; • facilitate their use. In this paper, we present an empirical Software and its engineering → Software libraries and study on the usage of graph-based query languages in open- repositories. source Java projects on GitHub. We investigate the usage of SPARQL, Cypher, Gremlin and GraphQL in terms of popular- Keywords Empirical Study, GitHub, Graphs, Query Lan- ity and their development over time. We select repositories guages, SPARQL, Cypher, Gremlin, GraphQL based on dependencies related to these technologies and ACM Reference Format: employ various popularity and source-code based filters and Philipp Seifer, Johannes Härtel, Martin Leinberger, Ralf Lämmel, ranking features for a targeted selection of projects.
    [Show full text]
  • Towards an Analytics Query Engine *
    Towards an Analytics Query Engine ∗ Nantia Makrynioti Vasilis Vassalos Athens University of Economics and Business Athens University of Economics and Business Athens, Greece Athens, Greece [email protected] [email protected] ABSTRACT with black box libraries, evaluating various algorithms for a This vision paper presents new challenges and opportuni- task and tuning their parameters, in order to produce an ef- ties in the area of distributed data analytics, at the core of fective model, is a time-consuming process. Things become which are data mining and machine learning. At first, we even more complicated when we want to leverage paralleliza- provide an overview of the current state of the art in the area tion on clusters of independent computers for processing big and then analyse two aspects of data analytics systems, se- data. Details concerning load balancing, scheduling or fault mantics and optimization. We argue that these aspects will tolerance can be quite overwhelming even for an experienced emerge as important issues for the data management com- software engineer. munity in the next years and propose promising research Research in the data management domain recently started directions for solving them. tackling the above issues by developing systems for large- scale analytics that aim at providing higher-level primitives for building data mining and machine learning algorithms, Keywords as well as hiding low-level details of distributed execution. Data analytics, Declarative machine learning, Distributed MapReduce [12] and Dryad [16] were the first frameworks processing that paved the way. However, these initial efforts suffered from low usability, as they offered expressive but at the same 1.
    [Show full text]
  • Evaluation of Xpath Queries on XML Streams with Networks of Early Nested Word Automata Tom Sebastian
    Evaluation of XPath Queries on XML Streams with Networks of Early Nested Word Automata Tom Sebastian To cite this version: Tom Sebastian. Evaluation of XPath Queries on XML Streams with Networks of Early Nested Word Automata. Databases [cs.DB]. Universite Lille 1, 2016. English. tel-01342511 HAL Id: tel-01342511 https://hal.inria.fr/tel-01342511 Submitted on 6 Jul 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Universit´eLille 1 – Sciences et Technologies Innovimax Sarl Institut National de Recherche en Informatique et en Automatique Centre de Recherche en Informatique, Signal et Automatique de Lille These` pr´esent´ee en premi`ere version en vue d’obtenir le grade de Docteur par l’Universit´ede Lille 1 en sp´ecialit´eInformatique par Tom Sebastian Evaluation of XPath Queries on XML Streams with Networks of Early Nested Word Automata Th`ese soutenue le 17/06/2016 devant le jury compos´ede : Carlo Zaniolo University of California Rapporteur Anca Muscholl Universit´eBordeaux Rapporteur Kim Nguyen Universit´eParis-Sud Examinateur Remi Gilleron Universit´eLille 3 Pr´esident du Jury Joachim Niehren INRIA Directeur Abstract The eXtended Markup Language (Xml) provides a format for representing data trees, which is standardized by the W3C and largely used today for exchanging information between all kinds of computer programs.
    [Show full text]
  • QUERYING JSON and XML Performance Evaluation of Querying Tools for Offline-Enabled Web Applications
    QUERYING JSON AND XML Performance evaluation of querying tools for offline-enabled web applications Master Degree Project in Informatics One year Level 30 ECTS Spring term 2012 Adrian Hellström Supervisor: Henrik Gustavsson Examiner: Birgitta Lindström Querying JSON and XML Submitted by Adrian Hellström to the University of Skövde as a final year project towards the degree of M.Sc. in the School of Humanities and Informatics. The project has been supervised by Henrik Gustavsson. 2012-06-03 I hereby certify that all material in this final year project which is not my own work has been identified and that no work is included for which a degree has already been conferred on me. Signature: ___________________________________________ Abstract This article explores the viability of third-party JSON tools as an alternative to XML when an application requires querying and filtering of data, as well as how the application deviates between browsers. We examine and describe the querying alternatives as well as the technologies we worked with and used in the application. The application is built using HTML 5 features such as local storage and canvas, and is benchmarked in Internet Explorer, Chrome and Firefox. The application built is an animated infographical display that uses querying functions in JSON and XML to filter values from a dataset and then display them through the HTML5 canvas technology. The results were in favor of JSON and suggested that using third-party tools did not impact performance compared to native XML functions. In addition, the usage of JSON enabled easier development and cross-browser compatibility. Further research is proposed to examine document-based data filtering as well as investigating why performance deviated between toolsets.
    [Show full text]
  • Programming a Parallel Future
    Programming a Parallel Future Joseph M. Hellerstein Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2008-144 http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-144.html November 7, 2008 Copyright 2008, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Programming a Parallel Future Joe Hellerstein UC Berkeley Computer Science Things change fast in computer science, but odds are good that they will change especially fast in the next few years. Much of this change centers on the shift toward parallel computing. In the short term, parallelism will thrive in settings with massive datasets and analytics. Longer term, the shift to parallelism will impact all software. In this note, I’ll outline some key changes that have recently taken place in the computer industry, show why existing software systems are ill‐equipped to handle this new reality, and point toward some bright spots on the horizon. Technology Trends: A Divergence in Moore’s Law Like many changes in computer science, the rapid drive toward parallel computing is a function of technology trends in hardware. Most technology watchers are familiar with Moore's Law, and the general notion that computing performance per dollar has grown at an exponential rate — doubling about every 18‐24 months — over the last 50 years.
    [Show full text]
  • Declarative Data Analytics: a Survey
    Declarative Data Analytics: a Survey NANTIA MAKRYNIOTI, Athens University of Economics and Business, Greece VASILIS VASSALOS, Athens University of Economics and Business, Greece The area of declarative data analytics explores the application of the declarative paradigm on data science and machine learning. It proposes declarative languages for expressing data analysis tasks and develops systems which optimize programs written in those languages. The execution engine can be either centralized or distributed, as the declarative paradigm advocates independence from particular physical implementations. The survey explores a wide range of declarative data analysis frameworks by examining both the programming model and the optimization techniques used, in order to provide conclusions on the current state of the art in the area and identify open challenges. CCS Concepts: • General and reference → Surveys and overviews; • Information systems → Data analytics; Data mining; • Computing methodologies → Machine learning approaches; • Software and its engineering → Domain specific languages; Data flow languages; Additional Key Words and Phrases: Declarative Programming, Data Science, Machine Learning, Large-scale Analytics ACM Reference Format: Nantia Makrynioti and Vasilis Vassalos. 2019. Declarative Data Analytics: a Survey. 1, 1 (February 2019), 36 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn 1 INTRODUCTION With the rapid growth of world wide web (WWW) and the development of social networks, the available amount of data has exploded. This availability has encouraged many companies and organizations in recent years to collect and analyse data, in order to extract information and gain valuable knowledge. At the same time hardware cost has decreased, so storage and processing of big data is not prohibitively expensive even for smaller companies.
    [Show full text]
  • Towards an Integrated Graph Algebra for Graph Pattern Matching with Gremlin
    Towards an Integrated Graph Algebra for Graph Pattern Matching with Gremlin Harsh Thakkar1, S¨orenAuer1;2, Maria-Esther Vidal2 1 Smart Data Analytics Lab (SDA), University of Bonn, Germany 2 TIB & Leibniz University of Hannover, Germany [email protected], [email protected] Abstract. Graph data management (also called NoSQL) has revealed beneficial characteristics in terms of flexibility and scalability by differ- ently balancing between query expressivity and schema flexibility. This peculiar advantage has resulted into an unforeseen race of developing new task-specific graph systems, query languages and data models, such as property graphs, key-value, wide column, resource description framework (RDF), etc. Present-day graph query languages are focused towards flex- ible graph pattern matching (aka sub-graph matching), whereas graph computing frameworks aim towards providing fast parallel (distributed) execution of instructions. The consequence of this rapid growth in the variety of graph-based data management systems has resulted in a lack of standardization. Gremlin, a graph traversal language, and machine provide a common platform for supporting any graph computing sys- tem (such as an OLTP graph database or OLAP graph processors). In this extended report, we present a formalization of graph pattern match- ing for Gremlin queries. We also study, discuss and consolidate various existing graph algebra operators into an integrated graph algebra. Keywords: Graph Pattern Matching, Graph Traversal, Gremlin, Graph Alge- bra 1 Introduction Upon observing the evolution of information technology, we can observe a trend arXiv:1908.06265v2 [cs.DB] 7 Sep 2019 from data models and knowledge representation techniques being tightly bound to the capabilities of the underlying hardware towards more intuitive and natural methods resembling human-style information processing.
    [Show full text]
  • Licensing Information User Manual
    Oracle® Database Express Edition Licensing Information User Manual 18c E89902-02 February 2020 Oracle Database Express Edition Licensing Information User Manual, 18c E89902-02 Copyright © 2005, 2020, Oracle and/or its affiliates. This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, then the following notice is applicable: U.S. GOVERNMENT END USERS: Oracle programs (including any operating system, integrated software, any programs embedded, installed or activated on delivered hardware, and modifications of such programs) and Oracle computer documentation or other Oracle data delivered to or accessed by U.S. Government end users are "commercial computer software" or “commercial computer software documentation” pursuant to the applicable Federal
    [Show full text]
  • Large Scale Querying and Processing for Property Graphs Phd Symposium∗
    Large Scale Querying and Processing for Property Graphs PhD Symposium∗ Mohamed Ragab Data Systems Group, University of Tartu Tartu, Estonia [email protected] ABSTRACT Recently, large scale graph data management, querying and pro- cessing have experienced a renaissance in several timely applica- tion domains (e.g., social networks, bibliographical networks and knowledge graphs). However, these applications still introduce new challenges with large-scale graph processing. Therefore, recently, we have witnessed a remarkable growth in the preva- lence of work on graph processing in both academia and industry. Querying and processing large graphs is an interesting and chal- lenging task. Recently, several centralized/distributed large-scale graph processing frameworks have been developed. However, they mainly focus on batch graph analytics. On the other hand, the state-of-the-art graph databases can’t sustain for distributed Figure 1: A simple example of a Property Graph efficient querying for large graphs with complex queries. Inpar- ticular, online large scale graph querying engines are still limited. In this paper, we present a research plan shipped with the state- graph data following the core principles of relational database systems [10]. Popular Graph databases include Neo4j1, Titan2, of-the-art techniques for large-scale property graph querying and 3 4 processing. We present our goals and initial results for querying ArangoDB and HyperGraphDB among many others. and processing large property graphs based on the emerging and In general, graphs can be represented in different data mod- promising Apache Spark framework, a defacto standard platform els [1]. In practice, the two most commonly-used graph data models are: Edge-Directed/Labelled graph (e.g.
    [Show full text]
  • Big Data Analytic Approaches Classification
    Big Data Analytic Approaches Classification Yudith Cardinale1 and Sonia Guehis2,3 and Marta Rukoz2,3 1Dept. de Computacion,´ Universidad Simon´ Bol´ıvar, Venezuela 2Universite´ Paris Nanterre, 92001 Nanterre, France 3Universite´ Paris Dauphine, PSL Research University, CNRS, UMR[7243], LAMSADE, 75016 Paris, France Keywords: Big Data Analytic, Analytic Models for Big Data, Analytical Data Management Applications. Abstract: Analytical data management applications, affected by the explosion of the amount of generated data in the context of Big Data, are shifting away their analytical databases towards a vast landscape of architectural solutions combining storage techniques, programming models, languages, and tools. To support users in the hard task of deciding which Big Data solution is the most appropriate according to their specific requirements, we propose a generic architecture to classify analytical approaches. We also establish a classification of the existing query languages, based on the facilities provided to access the Big Data architectures. Moreover, to evaluate different solutions, we propose a set of criteria of comparison, such as OLAP support, scalability, and fault tolerance support. We classify different existing Big Data analytics solutions according to our proposed generic architecture and qualitatively evaluate them in terms of the criteria of comparison. We illustrate how our proposed generic architecture can be used to decide which Big Data analytic approach is suitable in the context of several use cases. 1 INTRODUCTION nally, despite these increases in scale and complexity, users still expect to be able to query data at interac- The term Big Data has been coined for represent- tive speeds. In this context, several enterprises such ing the challenge to support a continuous increase as Internet companies and Social Network associa- on the computational power that produces an over- tions have proposed their own analytical approaches, whelming flow of data (Kune et al., 2016).
    [Show full text]
  • A Comparison Between Cypher and Conjunctive Queries
    A comparison between Cypher and conjunctive queries Jaime Castro and Adri´anSoto Pontificia Universidad Cat´olicade Chile, Santiago, Chile 1 Introduction Graph databases are one of the most popular type of NoSQL databases [2]. Those databases are specially useful to store data with many relations among the entities, like social networks, provenance datasets or any kind of linked data. One way to store graphs is property graphs, which is a type of graph where nodes and edges can have attributes [1]. A popular graph database system is Neo4j [4]. Neo4j is used in production by many companies, like Cisco, Walmart and Ebay [4]. The data model is like a property graph but with small differences. Its query language is Cypher. The expressive power is not known because currently Cypher does not have a for- mal definition. Although there is not a standard for querying graph databases, this paper proposes a comparison between Cypher, because of the popularity of Neo4j, and conjunctive queries which are arguably the most popular language used for pattern matching. 1.1 Preliminaries Graph Databases. Graphs can be used to store data. The nodes represent objects in a domain of interest and edges represent relationships between these objects. We assume familiarity with property graphs [1]. Neo4j graphs are stored as property graphs, but the nodes can have zero or more labels, and edges have exactly one type (and not a label). Now we present the model we work with. A Neo4j graph G is a tuple (V; E; ρ, Λ, τ; Σ) where: 1. V is a finite set of nodes, and E is a finite set of edges such that V \ E = ;.
    [Show full text]
  • Introduction to Graph Database with Cypher & Neo4j
    Introduction to Graph Database with Cypher & Neo4j Zeyuan Hu April. 19th 2021 Austin, TX History • Lots of logical data models have been proposed in the history of DBMS • Hierarchical (IMS), Network (CODASYL), Relational, etc • What Goes Around Comes Around • Graph database uses data models that are “spiritual successors” of Network data model that is popular in 1970’s. • CODASYL = Committee on Data Systems Languages Supplier (sno, sname, scity) Supply (qty, price) Part (pno, pname, psize, pcolor) supplies supplied_by Edge-labelled Graph • We assign labels to edges that indicate the different types of relationships between nodes • Nodes = {Steve Carell, The Office, B.J. Novak} • Edges = {(Steve Carell, acts_in, The Office), (B.J. Novak, produces, The Office), (B.J. Novak, acts_in, The Office)} • Basis of Resource Description Framework (RDF) aka. “Triplestore” The Property Graph Model • Extends Edge-labelled Graph with labels • Both edges and nodes can be labelled with a set of property-value pairs attributes directly to each edge or node. • The Office crew graph • Node �" has node label Person with attributes: <name, Steve Carell>, <gender, male> • Edge �" has edge label acts_in with attributes: <role, Michael G. Scott>, <ref, WiKipedia> Property Graph v.s. Edge-labelled Graph • Having node labels as part of the model can offer a more direct abstraction that is easier for users to query and understand • Steve Carell and B.J. Novak can be labelled as Person • Suitable for scenarios where various new types of meta-information may regularly
    [Show full text]