Design of the Integrated Big and Fast Data Eco-‐System

D4.1: Design of the integrated big and fast data eco-system Sandro Fiore (CMCC), Donatello Elia (CMCC), Walter dos Author(s) Santos Filho (UFMG), Carlos Eduardo Pires (UFCG) Status DraFt/Review/Approval/Final Version v1.0 Date 01/07/2016 Dissemination Level X PU: Public PP: Restricted to other programme participants (including the Commission) RE: Restricted to a group speciFied by the consortium (including the Commission) CO: Confidential, only For members of the consortium (including the Commission) EUBra-BIGSEA is Funded by the European Commission under the Cooperation Programme, Horizon 2020 grant agreement No 690116. Este projeto é resultante da 3a Chamada Coordenada BR-UE em Tecnologias da InFormação e Comunicação (TIC), anunciada pelo Ministério de Ciência, Tecnologia e Inovação (MCTI) Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale research project funded by the European Commission under the Cooperation Programme, and the Ministry of Science and Technology (MCT) of Brazil in the frame of the third European-Brazilian coordinated call. The document has been produced with the co-funding of the European Commission and the MCT. The purpose of this report is the design of the integrated big and fast data eco-system. The deliverable aims at identifying and describing in detail all the key architectural building blocks needed to address the multifaceted data management aspects (data storage, access, analytics and mining) of the EUBra-BIGSEA project. www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 1 Document identifier: EUBRA BIGSEA -WP4-D4.1 Deliverable lead CMCC Related work package WP4 Author(s) Sandro Fiore (CMCC), Donatello Elia (CMCC), Walter dos Santos Filho (UFMG), Carlos Eduardo Pires (UFCG) Contributor(s) Ignacio Blanquer (UPV), Gustavo Avelar (UFMG), Wagner Meira (UFMG), Dorgival Guedes (UFMG), Luiz Fernando Carvalho (UFMG), Monica Vitali (POLIMI), Demetrio Mestre (UFCG), Tiago Brasileiro (UFCG), Nádia P. Kozievitch (UTFPR), Daniele Lezzi (BSC), Igor Oliveira (IBM) Due date 30/06/2016 Actual submission date 01/07/2016 Reviewed by Nádia P. Kozievitch (UTFPR), Cinzia Cappiello (POLIMI) Approved by PMB Start date of Project 01/01/2016 Duration 24 months Keywords Big data eco-system, architecture design, analytics, machine learning Versioning and contribution history Version Date Authors Notes 0.1 02/05/2016 Sandro Fiore (CMCC) Table oF Contents 0.2 17/05/2016 Walter dos Santos Filho (UFMG) Formatting 0.3 30/05/2016 Sandro Fiore, Donatello Elia (CMCC) Requirements, general architecture sections and tools analysis deFinition 0.4 10/06/2016 Sandro Fiore, Donatello Elia (CMCC) Updated ToC, introduction, executive summary, architecture 0.5 15/06/2016 Walter dos Santos Filho (UFMG) Architecture sequence diagrams and management API 0.6 16/06/2016 Monica Vitali (POLIMI), Igor Oliveira Data sources update, data quality as (IBM), Donatello Elia (CMCC), Sandro a service Fiore (CMCC), Luiz Fernando Carvalho (UFMG), Walter dos Santos Filho (UFMG) 0.7 17/06/2016 Carlos Eduardo Pires (UFCG) Entity-Matching, general review 0.8 20/06/2016 All contributors Review of the tools analysis section, tools assessment 0.9 24/06/2016 Sandro Fiore (CMCC) General review oF the document, conclusions www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 2 Copyright notice: This work is licensed under the Creative Commons CC-BY 4.0 license. To view a copy oF this license, visit https://creativecommons.org/licenses/by/4.0. Disclaimer: The content oF the document herein is the sole responsibility oF the publishers and it does not necessarily represent the views expressed by the European Commission or its services. While the inFormation contained in the document is believed to be accurate, the author(s) or any other participant in the EUBra-BIGSEA Consortium make no warranty oF any kind with regard to this material including, but not limited to the implied warranties oF merchantability and Fitness For a particular purpose. Neither the EUBra-BIGSEA Consortium nor any oF its members, their oFFicers, employees or agents shall be responsible or liable in negligence or otherwise howsoever in respect oF any inaccuracy or omission herein. Without derogating From the generality oF the Foregoing neither the EUBra-BIGSEA Consortium nor any oF its members, their oFFicers, employees or agents shall be liable For any direct or indirect or consequential loss or www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 3 TABLE OF CONTENT EXECUTIVE SUMMARY .............................................................................................................................. 7 1. Introduction ...................................................................................................................................... 8 1.1. Scope of the Document ............................................................................................................... 8 1.2. Target Audience .......................................................................................................................... 8 1.3. Structure ..................................................................................................................................... 8 2. EUBra-BIGSEA Architectural Overview ............................................................................................... 9 3. Big and Fast Data Eco-system Requirements ................................................................................... 10 3.1. Use Case Requirements ............................................................................................................. 10 3.2. Technical Requirements ............................................................................................................ 11 3.3. Classes of Users ......................................................................................................................... 12 4. Data Sources ................................................................................................................................... 13 4.1. External Data ............................................................................................................................ 14 4.1.1. Stationary Data ......................................................................................................................... 15 4.1.2. Dynamic Spatial Data ................................................................................................................ 15 4.1.3. Environmental Data .................................................................................................................. 16 4.1.4. Social Data ................................................................................................................................ 17 4.2. Derived Data ............................................................................................................................. 18 4.3. Platform-level Data ................................................................................................................... 18 4.3.1. QoS Monitoring Data ................................................................................................................ 18 4.3.2. Data Quality as a Service Data ................................................................................................. 19 5. Big and Fast Data Eco-system General Architecture ......................................................................... 20 6. Big and Fast Data Eco-system Design ............................................................................................... 22 6.1. Architectural Diagram ............................................................................................................... 22 6.1.1. Data Storage ............................................................................................................................. 23 6.1.2. Big Data Technologies ............................................................................................................... 25 6.1.3. Entity Matching Service ............................................................................................................ 26 6.1.4. Data Quality as a Service .......................................................................................................... 27 6.1.5. Extraction, Transformation and Load ....................................................................................... 27 6.2. Sequence Diagrams ................................................................................................................... 27 6.2.1. User Stories for UC1: Data Acquisition ..................................................................................... 27 6.2.2. User Stories for UC2: Descriptive Models ................................................................................ 29 6.2.3. User Stories for UC3: Predictive Models .................................................................................. 30 6.2.4. Other interactions between WP4 components ....................................................................... 31 6.3. Exposed QoS metrics ................................................................................................................. 32 6.3.1. Java Virtual Machine metrics ................................................................................................... 32 6.3.2. Data Storage Metrics ...............................................................................................................

Design of the Integrated Big and Fast Data Eco-‐System

Introduction to Hbase Schema Design

Apache Hbase, the Scaling Machine Jean-Daniel Cryans Software Engineer at Cloudera @Jdcryans

Final Report CS 5604: Information Storage and Retrieval

Apache Hbase: the Hadoop Database Yuanru Qian, Andrew Sharp, Jiuling Wang

Mapreduce Service

HDP 3.1.4 Release Notes Date of Publish: 2019-08-26

Apache Hadoop Goes Realtime at Facebook

Hbase: the Definitive Guide

Desarrollo De Una Solución Business Intelligence Mediante Un Paradigma De Data Lake

TR-4744: Secure Hadoop Using Apache Ranger with Netapp In

Beyond Macrobenchmarks: Microbenchmark-Based Graph Database Evaluation

Pentaho EMR46 SHIM 7.1.0.0 Open Source Software Packages