Hadoop (Map Reduce) ?
Total Page:16
File Type:pdf, Size:1020Kb
« From data bases to big data » (7 lectures) MBDS graduate course Professor Serge Miranda Dept of Computer Science University of Nice Sophia Antipolis ( menber of Universite Côte d’Azur –UCA-) Director of MBDS Master degree (www.mbds-fr.org) 1 www.mbds-fr.org BIG DATA : N.O. SQL and NEW SQL (Lecture 7) Professor Serge MIRANDA Dept of Computer Science University of Nice Sophia Antipolis (UCA) Director of MBDS Master degree (www.mbds-fr.org) 2 (www.mbds-fr.org) www.mbds-fr.org BIG DATA management systems ➢TOP DOWN approach for structured and semi- structured DATA ➢SQL2, SQL3, ODMG ➢Semantic Web (SPARQL, OWL) ➢BOTTOM UP Approach for UNSTRUCTURED DATA ➢N.O. SQL (NOT ONLY SQL) ➢NEWSQL Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Bottom up approach for unstructured data (no schema, no metadata) ➢ « N.O. SQL » (Not Only SQL) < meaning NO Relational> ➢ « KEY /VALUE Paradigm » ➢ GRAPH paradigm ➢ « NEW SQL » ➢ « SQL paradigm » Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) « COMPLEX » data : SQL3, N.O. SQL et NEWSQL? PROCESSING (MIRA2013) SQL OR- NEW SQL DBMS SQL3 OO-DBMS N.O. SQL no SQL ODMG DATA STRUCTURE Complex Complex Structured data Unstructured data Top Down Bottom Up (schema) (no schema ; no metadata) Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) « BASE » properties ➢BASE : ➢ Basically ➢ Available ➢ Scalable (OUT) ➢ Eventually consistent (final consistency) ➢Replica consistency ; Cross Node Consistency ➢ CAP Theorem (Eric Brewer, Prof Berkeley, 2000 & 2012 ; Revised by Altend MIT, 2002) ➢ Consistency, SQL ➢ Availability, ➢ Partitioning NO SQL Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) CAP Theorem : « Pick 2 ! » (Brewer 2000 ; 2012) Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) N.O. SQL (Not Only SQL) (1998) 4 « no » : 1. no SCHEMA (schema-less ; Variability) & NO METADATA 2. no RELATIONAL/ NO JOIN (extract data without joins) 3. no DATA FORMAT(graph, document, row, column) 4. no (ACID)Transactions (CAP theorem ; BASE) + + + + (VALUE)… VOLUME VELOCITY VARIETY Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) 2 Complementary approachs for big data management SQL N.O.SQL STRUCTURED (SCHEMA) Unstructured (no schema) VOLUME & VARIETY TERA/PETA bytes EXA/ZETA++ bytes VELOCITY NO YES TRANSACTIONS YES NO (ACID and Gray’s theorem) (BASE & CAP theorem) SCALABILITY UP (Scale up) OUT (scale OUT) USER INTERFACE AD HOC Queries, Predefined queries, JOIN & Transaction oriented NO JOIN & Decision oriented STANDARDS SQL3/ODMG Not yet (BIG SQL) Typical approach TOP DOWN Bottom UP (predefined Schema) (no schema) Administrator Yes No Vendor support Yes No (Open Source) Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) N.O. SQL and Web actors ➢ Google ➢ Map Reduce, BigTable & BIG QUERY SQL (Data/ANALYTICS as a Service) ➢ Yahoo! ➢ Hadoop, S4 ➢ Amazon ➢ Dynamo, S3 ➢ Facebook ➢ Cassandra, Hive ➢ Twitter : Storm, FlockDB ➢ LinkedIn : Kafka, SenseiDB, Voldemort, etc. Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Taxonomy of BIG DATA Systems Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) « N.O. SQL » DBMS 4 data paradigms: 3 KEY-VALUE oriented and one GRAPH oriented ➢ KEY-VALUE with BLOBS (Binary Large Objects) ex : Hadoop, Cassandra, Ryak, Redis,DynamoDB, BerkeleyDB, etc. ➢➔ HASHING arrays (no query engine) ➢ KEY-VALUE with JSON/XML documents ex : MongoDB, CouchDB, etc. ➢ JSON simpler than XML with Java Script interface ➢ <KEY, VALUE> model with VALUE in JSON (BSON, XML) for documents ; ➢ KEY–VALUE with COLUMNS ex : HBASE, Cassandra, BigTable/Google,… ➢ <KEY, (SETofcolumns, VALUE, TIMESTAMP)> ➢ GRAPH oriented ex: Neo4j, OrientDB… : towards GQL (Graph Query Language) Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) KEY-VALUE NOSQL and SQL convergence ➢(KEY, VALUE) pairs ➢(primary key, relational TUPLE) ➢Like (OID, VALUE) for OBJECTS ➢Hashing tables for access In SQL : Create table PAIR (KEY, varchar primary key, VALUE blob) ➢ 4 basic operators ➢ INSERT/DELETE/UPDATE pair ➢ FIND value for a key ➢Ex : Cassandra, Redis, Voldemort, Memcached, Riak, Dynamo (Amazon), CACHE (Intersystems), CouchDB, Redis, BIG TABLE, Berkeley DB,… Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) REST (REpresentational State Transfer) ➢2 communication modes for client –server ➢RPC (Remote Procedure Call) which is connection oriented (TCP) ➢REST <Representational State Transfer> which is service oriented ➢REST is based upon HTTP ➢DATA access facility ➢6 REST methods : GET, HEAD, PUT, POST, DELETE, OPTIONS ➢Restful NO SQL systems : COUCHDB, HBASE, NEO4J, RIAK... Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) SCALE OUT & SHARDING (fragmentation) ➢distributed DATA partitioning for parallel processing ➢SHARD KEY : key for data partitioning Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) HADOOP ecosystem for the 1st « V » of BIG DATA (VOLUME) HADOOP ecosystem around HDFS and MAP REDUCE : PIG LATIN* (script) developped by Yahoo, HIVE (datawarehouse) by Facebook (HIVEQL), … * PIG Latin with SQL operators : Join, Group By, Union but procedural approach for batch; like HIVE UDF (User defined functions) are possible Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Hadoop (Map Reduce) ? ➢OPEN SOURCE (Apache Foundation) written in JAVA ; ➢HADOOP (Map-Reduce implementation) : Created by Doug Cutting from Yahoo (then Open Source) ➢HDFS ➢HBASE : oriented-column key-value data store ➢From GOOGLE : ➢Google Map Reduce (2004) ➔ HADOOP ➢Google Filesystem (GFS) 2003 ➔ HDFS ➢Google BIG TABLE (Distributed hashing table over GFS)➔ HBASE ➢HADOOP distributions (Linux initially) ➢Cloudera (Impala) ➢Hortonworks (Windows version) ➢ MAPR (HDFS centrics) ➢and for the cloud : EMR- Elastic Map Reduce- (Amazon, 2009) Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Hadoop Map-Reduce (MR) ➢Map-Reduce architecture consists with ➢one JobTracker ➢different TaskTracker in charge on executing map-reduce on each machine ➢With YARN, evolution of MR Architecture : ➢one JOB TRACKER ➢RESOURCE MANAGER ➢ APPLICATION MASTER (AM) <not only Map Reduce> ➢ SCHEDULER Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) MAP REDUCE processing steps DATA distribution approach instead of PROGRAM distribution (parellelization) Creation of (KEY, VALUE) pairs… 4 Steps : ➢SHARDING/SPLITTING input data for parallel processing ➢Mapping BLOCKS to create values associated with keys (key, value) ➢shuffling (sorting) by keys ➢Reducing groups with an aggregate value for each key Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Typical Example for MAP REDUCE (words counting in a given text) Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Example * MAP REDUCE for JOINing 2 tables ? ☺ Pilot PL# PLNAME 1 Serge 2 Leo FLIGHT1 F# PL# DC AC AF100 1 Nice Paris AF101 1 Paris Toulouse AF104 2 Toulouse Lyon * Example inspired from book « Big Data et Machine Learning » P.Lemberger et al, Dunod, 2018 < In French> Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) MAP (Option with coalescence of DATA INPUT+Splitting) Sharding at the Pilot 1 Serge tuple level Pilot 2 Leo FLIGHT1 AF100 1 Nice Paris FLIGHT1 AF101 1 Paris Toulouse FLIGHT1 AF104 2 Toulouse Lyon MAP & Shuffling with KEY= PL# (sorting) VALUEs KEY=1; (KEY, tuple) Pilot 1 Serge FLIGHT1 AF100 1 Nice Paris FLIGHT1 AF101 1 Paris Toulouse KEY=2; (KEY, tuple) Pilot 2 Leo FLIGHT1 AF104 2 Toulouse Lyon Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) REDUCE (aggregation; tuple fusion) on each partition and final result… JOIN ☺ 1 Serge AF100 Nice Paris REDUCE 1 Serge AF101 Paris Toulouse 2 Leo AF104 Toulouse Lyon Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) Map reduce issues ➢ split/Shard size ? 64 MO (HDFS block size) ➢ KEY selection ? ➢ Processing with these2 functions MAP & REDUCE ? ➢Hadoop framework complexity ? ➢ Batch-oriented (days to validate a map reduce job) ➢Map and reduce coding complexity ! ➢ scarce implementation of Map & Reduce ➢Use of SCRIPT language (PIG) ➢ use of SQL-like interface (HIVE, SPARK SQL) Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) HADOOP Ecosystem ➢ HDFS or API for other distributed FS like S3 (Amazon) ➢MAHOUT for Machine learning in Java ➢ SPARK (2014)with MLlib ➢ Zookeeper (scheduler), Oozie (job plan), Flume (data flow) , Rhadoop (for R developpers) , Sqoop (data transfer with R DBMS) and… Apache STORM (real-time big data …) ➢ ---------------- next : Interactivity (>> batch) + SQL (>> scripts) ➢ Impala : MPP SQL engine ➢ DRILL (with Zookeeper) like Big Query (Google with Dremel) Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) HIVE http://hive.apache.org/ ➢ 2009 Facebook ➢ HiveQL 1.0 (Feb. 2015) Tabular model on HADOOP with SQL-like interface Transformation of HiveQL query into MAP REDUCE jobs ➢CREATE TABLE PILOT (PIL# INTEGER, PLNAME STRING, ADDR STRUCTURE (Street : String, City : String, Zip : INT)) ➢Only EQUI JOINS (inner join, left join, right outer join) Copyright Big Data Pr Serge Miranda, MBDS, Univ de Nice Sophia Antipolis (UCA) DOCUMENT*-oriented NO SQL DBMS (KEY/VALUE