D4.1: Design of the integrated big and fast data eco-system Sandro Fiore (CMCC), Donatello Elia (CMCC), Walter dos Author(s) Santos Filho (UFMG), Carlos Eduardo Pires (UFCG) Status Draft/Review/Approval/Final Version v1.0 Date 01/07/2016

Dissemination Level X PU: Public PP: Restricted to other programme participants (including the Commission) RE: Restricted to a group specified by the consortium (including the Commission) CO: Confidential, only for members of the consortium (including the Commission)

EUBra-BIGSEA is funded by the European Commission under the Cooperation Programme, Horizon 2020 grant agreement No 690116. Este projeto é resultante da 3a Chamada Coordenada BR-UE em Tecnologias da Informação e Comunicação (TIC), anunciada pelo Ministério de Ciência, Tecnologia e Inovação (MCTI)

Abstract: Europe - Brazil Collaboration of BIG Data Scientific Research through Cloud-Centric Applications (EUBra-BIGSEA) is a medium-scale research project funded by the European Commission under the Cooperation Programme, and the Ministry of Science and Technology (MCT) of Brazil in the frame of the third European-Brazilian coordinated call. The document has been produced with the co-funding of the European Commission and the MCT. The purpose of this report is the design of the integrated big and fast data eco-system. The deliverable aims at identifying and describing in detail all the key architectural building blocks needed to address the multifaceted data management aspects (data storage, access, analytics and mining) of the EUBra-BIGSEA project.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 1

Document identifier: EUBRA BIGSEA -WP4-D4.1 Deliverable lead CMCC Related work package WP4 Author(s) Sandro Fiore (CMCC), Donatello Elia (CMCC), Walter dos Santos Filho (UFMG), Carlos Eduardo Pires (UFCG) Contributor(s) Ignacio Blanquer (UPV), Gustavo Avelar (UFMG), Wagner Meira (UFMG), Dorgival Guedes (UFMG), Luiz Fernando Carvalho (UFMG), Monica Vitali (POLIMI), Demetrio Mestre (UFCG), Tiago Brasileiro (UFCG), Nádia P. Kozievitch (UTFPR), Daniele Lezzi (BSC), Igor Oliveira (IBM) Due date 30/06/2016 Actual submission date 01/07/2016 Reviewed by Nádia P. Kozievitch (UTFPR), Cinzia Cappiello (POLIMI) Approved by PMB Start date of Project 01/01/2016 Duration 24 months Keywords Big data eco-system, architecture design, analytics, machine learning

Versioning and contribution history

Version Date Authors Notes 0.1 02/05/2016 Sandro Fiore (CMCC) Table of Contents 0.2 17/05/2016 Walter dos Santos Filho (UFMG) Formatting 0.3 30/05/2016 Sandro Fiore, Donatello Elia (CMCC) Requirements, general architecture sections and tools analysis definition 0.4 10/06/2016 Sandro Fiore, Donatello Elia (CMCC) Updated ToC, introduction, executive summary, architecture 0.5 15/06/2016 Walter dos Santos Filho (UFMG) Architecture sequence diagrams and management API 0.6 16/06/2016 Monica Vitali (POLIMI), Igor Oliveira Data sources update, data quality as (IBM), Donatello Elia (CMCC), Sandro a service Fiore (CMCC), Luiz Fernando Carvalho (UFMG), Walter dos Santos Filho (UFMG) 0.7 17/06/2016 Carlos Eduardo Pires (UFCG) Entity-Matching, general review 0.8 20/06/2016 All contributors Review of the tools analysis section, tools assessment 0.9 24/06/2016 Sandro Fiore (CMCC) General review of the document, conclusions

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 2

Copyright notice: This work is licensed under the Creative Commons CC-BY 4.0 license. To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0. Disclaimer: The content of the document herein is the sole responsibility of the publishers and it does not necessarily represent the views expressed by the European Commission or its services. While the information contained in the document is believed to be accurate, the author(s) or any other participant in the EUBra-BIGSEA Consortium make no warranty of any kind with regard to this material including, but not limited to the implied warranties of merchantability and fitness for a particular purpose. Neither the EUBra-BIGSEA Consortium nor any of its members, their officers, employees or agents shall be responsible or liable in negligence or otherwise howsoever in respect of any inaccuracy or omission herein.

Without derogating from the generality of the foregoing neither the EUBra-BIGSEA Consortium nor any of its members, their officers, employees or agents shall be liable for any direct or indirect or consequential loss or

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 3

TABLE OF CONTENT

EXECUTIVE SUMMARY ...... 7 1. Introduction ...... 8 1.1. Scope of the Document ...... 8 1.2. Target Audience ...... 8 1.3. Structure ...... 8 2. EUBra-BIGSEA Architectural Overview ...... 9

3. Big and Fast Data Eco-system Requirements ...... 10 3.1. Use Case Requirements ...... 10 3.2. Technical Requirements ...... 11 3.3. Classes of Users ...... 12 4. Data Sources ...... 13 4.1. External Data ...... 14 4.1.1. Stationary Data ...... 15 4.1.2. Dynamic Spatial Data ...... 15 4.1.3. Environmental Data ...... 16 4.1.4. Social Data ...... 17 4.2. Derived Data ...... 18 4.3. Platform-level Data ...... 18 4.3.1. QoS Monitoring Data ...... 18 4.3.2. Data Quality as a Service Data ...... 19 5. Big and Fast Data Eco-system General Architecture ...... 20 6. Big and Fast Data Eco-system Design ...... 22 6.1. Architectural Diagram ...... 22 6.1.1. Data Storage ...... 23 6.1.2. Big Data Technologies ...... 25 6.1.3. Entity Matching Service ...... 26 6.1.4. Data Quality as a Service ...... 27 6.1.5. Extraction, Transformation and Load ...... 27 6.2. Sequence Diagrams ...... 27 6.2.1. User Stories for UC1: Data Acquisition ...... 27 6.2.2. User Stories for UC2: Descriptive Models ...... 29 6.2.3. User Stories for UC3: Predictive Models ...... 30 6.2.4. Other interactions between WP4 components ...... 31 6.3. Exposed QoS metrics ...... 32 6.3.1. Java Virtual Machine metrics ...... 32 6.3.2. Data Storage Metrics ...... 33 6.3.3. Data Access Metrics ...... 33 6.3.4. Data Ingestion and Streaming Processing Metrics ...... 33

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 4

6.3.5. Data Analytics and Data Mining Metrics ...... 34 6.3.6. Data Mining and Analytical Toolbox ...... 34 6.4. Data Management API ...... 34 6.5. Security Aspects ...... 36 7. Tools evaluation ...... 38 7.1. Procedure to describe components ...... 38 7.2. Data Storage ...... 39 7.2.1. HDFS ...... 39 7.3. Data Access ...... 40 7.3.1. PostGIS ...... 40 7.3.2. MongoDB ...... 42 7.3.3. Apache HBase ...... 43 7.4. Data Ingestion and Streaming Processing ...... 45 7.4.1. ...... 45 7.4.2. ...... 46 7.4.3. ...... 48 7.4.4. Streaming ...... 51 7.5. Data Analytics and Mining ...... 52 7.5.1. Ophidia ...... 52 7.5.2. ...... 54 7.5.3. ...... 55 7.5.4. Druid ...... 57 7.5.5. Spark ...... 59 7.5.6. Hadoop MapReduce ...... 60 7.6. Data Mining and Analytics Toolbox ...... 62 7.6.1. Apache Spark MLlib ...... 62 7.6.2. Ophidia Operators ...... 64 7.6.3. Ophidia Primitives ...... 65 7.7. Final Assessment ...... 66 8. Preliminary Architectural Mapping ...... 69 9. Conclusions ...... 70 10. References ...... 71 GLOSSARY ...... 72

LIST OF TABLES Table 1. List of Use Case requirements related to WP4 ...... 11 Table 2. List of technological requirements related to WP4 ...... 11 Table 3. Summary of external data sources ...... 15 Table 4. Some examples of JVM metrics available ...... 33 Table 5. Distributed file system metrics ...... 33 Table 6. Data access metrics ...... 33

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 5

Table 7. Data ingestion and streaming processing metrics ...... 34 Table 8. Data analytics & mining metrics ...... 34 Table 9. Template used to describe the potential components to be used in the Big Data eco-system...... 39

LIST OF FIGURES Figure 1. High-level view of the EUBra-BIGSEA architecture ...... 9 Figure 2. WRF outer (map) and inner (blue rectangle) domains ...... 17 Figure 3. WP4 general architecture ...... 20 Figure 4. Big and Fast Data eco-system detailed architecture ...... 22 Figure 5. Data sources levels ...... 24 Figure 6. Sequence diagram for scernario 1.1 ...... 28 Figure 7. Sequence diagram for scenario 1.2 ...... 28 Figure 8. Sequence diagram for scenario 2.1 ...... 29 Figure 9. Sequence diagram for scenario 2.2 ...... 30 Figure 10. Sequence diagram for scenario 3.1 ...... 30 Figure 11. Sequence diagram for scenario 3.2 ...... 31 Figure 12. Sequence diagram for scenario 4.1 ...... 32 Figure 13. Preliminary architectural mapping ...... 69

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 6

EXECUTIVE SUMMARY EUBra-BIGSEA project aims at developing a set of cloud services empowering Big Data analytics to ease the development of massive data processing applications. EUBra-BIGSEA will develop models, predictive and reactive cloud infrastructure QoS techniques, efficient and scalable Big Data operators and a privacy and quality analysis framework, exposed to several programming environments. EUBra-BIGSEA aims at covering general requirements of multiple application areas, although it will showcase in the treatment of massive connected society information, and particularly in traffic recommendation. The integrated fast and Big Data eco-system represents the central component devoted to data management aspects (i.e., access, analytics/mining, and quality) of the EUBra-BIGSEA platform. Its architecture has been defined in this document starting from the requirements gathered from the main project use cases and highlighted in D7.1. To fulfil the requirements, the proposed architecture integrates multiple classes of big data systems to address fast data analysis over continuous streams from external data sources, general purpose data mining and machine learning tools as well as OLAP-based systems for multidimensional data analysis. A storage and access layer has been also defined to provide low-level key functionalities. Aspects related to the API exposed by the WP4 have been also reported in this document as they are strongly connected to the programmability aspects of the data management part. Also relevant to this document are both (i) a comprehensive evaluation and assessment of the big data tools available in the general landscape from data storage, access, analytics and mining standpoint and (ii) a deep data sources analysis in terms of data model, formats, volume, metadata, and functional needs. In the former case, a comprehensive description of the data tools has been provided whereas in the latter one a complete description of the data sources (from raw - level0 - to derived - level2 - data) and the links with the functional components of the architecture have been described. The analysis of the big data landscape has also included the technical evaluation of the tools as well as their final assessment based on evaluation criteria linked to the use case requirements. As highlighted in the document, key features of the designed data management architecture are the integration of different classes of big/fast data tools to address multifaceted use cases requirements, the dynamicity and elasticity of the environment (which are linked to QoS metrics/policies) jointly with a secured by design eco-system (e.g. w.r.t. privacy aspects). The proposed architecture joins all these elements in a cloud environment aiming at providing, to some extent, a general approach to deal with high social impact use cases and scenarios like the ones proposed in the project. A preliminary mapping of the main architectural blocks into infrastructural components has been also proposed at the end of the document.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 7

1. INTRODUCTION

1.1. Scope of the Document This document provides a complete overview about the design of the integrated big and fast data eco- system. It aims at identifying and describing in detail all the key architectural components needed to address the multifaceted data management aspects (data storage, access, analytics and mining) of the project. The document also includes the full list of the data-related requirements from D7.1 jointly with a comprehensive description of the data sources and big and fast data tools. In addition, UML diagrams are proposed to clarify architectural aspects and interactions among components. The links to the other WPs from a security, quality of service, user requirements and programming framework standpoints are also highlighted in the text.

1.2. Target Audience The document is mainly intended for internal use, although it is publicly released. The main target of this document is the global team of technical experts of the EUBra-BIGSEA, including WP3, WP4, WP5 and WP6. This document goes beyond the data management aspects to understand the global architecture of WP4.

1.3. Structure The rest of the document is structured into 8 main parts. First, Section 2 provides a general introduction to the EUBra-BIGSEA architecture. Section 3 provides a summary of the requirements in terms of: use cases, technical requirements and classes of users. In Section 4, a complete description of the data sources according to the identified three classes (raw, derived and platform-level) is presented and discussed. Section 5 presents the general architecture of the big and fast data eco-system highlighting the main building blocks of the system as well as the links among the different components and the relationships with the other work packages; it provides a general conceptual view of the proposed data management eco-system. Section 6 provides a detailed view of the architecture providing information about the internal components (storage, ETL, big data technologies, Entity-Matching and Data Quality services), sequence diagrams, QoS metrics to be exposed at the WP4 level, data management APIs and data-related security aspects. Section 7 provides a comprehensive tools evaluation based on the following characterization: storage, access, analytics/mining and related toolbox, ingestion and streaming processing components. A final assessment based on key dimensions coming from the D7.1 requirements is also presented at the end of the section. Section 8 provides a preliminary mapping of the components onto the architectural view to provide some initial insights about the infrastructural implementation of the data eco-system. Finally Section 9 draws the main conclusions of the deliverable.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 8

2. EUBRA-BIGSEA ARCHITECTURAL OVERVIEW The EUBra-BIGSEA general architecture, as described in deliverable D7.1, comprises four main blocks: ● QoS Cloud Infrastructure services, which integrate the modelling of the workload, the monitoring of the resources, the implementation of vertical and horizontal elasticity and the contextualization. ● Big Data Analytics services, which provide operators to process huge datasets and which can be integrated in the programming models. Analytics services are characterized in the QoS cloud infrastructure models of the underlying layer, which will automatically (or explicitly driven by the analytics services) adjust resources to the expected workload and considering its specificities. This document will mainly focus on the big data eco-system block. ● Programming Models, which provide a higher-level programmatic framework and are also characterized by the models of the infrastructure. The programming models will ease the parallelization of the applications developed on top of them. ● Privacy and Security framework, which provides the means to annotate data and processing and ensures the proper protection of privacy and security. On top of those four blocks, applications are developed using the programming models and the data analytics extensions. Application developers are expected to use the programming models and may use other features of underlying layers, such as the user-level QoS metrics. Figure 1 shows the high-level view of the EUBra-BIGSEA architecture depicting the interactions among the main blocks.

Figure 1. High-level view of the EUBra-BIGSEA architecture

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 9

3. BIG AND FAST DATA ECO-SYSTEM REQUIREMENTS The requirements analysis is an essential preliminary step necessary for the design of a software environment. It defines the features, objectives and constraints that the system is expected to guarantee and comply with. This section provides a summary of the requirements, both functional and non-functional, of the big and fast data eco-system developed within the WP4 of the project. End-user requirements represent the initial set of requirements to be addressed, however, since the big data framework will also interact with other entities, such as programming frameworks, external data sources and infrastructure management systems, additional requirements, not directly connected to the end users, should be also targeted. In particular, since data have a key role in the whole eco-system, a special attention has been devoted to the examination of the various data sources necessary for the end-user analysis in order to identify and characterize requirements and constraints. A complete data sources description is provided in the next section, whereas this section will mainly describe the end-user requirements of the big and fast data eco- system. End-user requirements are essential for the design and implementation of the big data eco-system since user applications will act as the main validator of the features provided by the whole EUBra-BIGSEA platform. Deliverable D7.1, “End-User Requirements Elicitation”, provides a complete description of the requirements elicitation phase and specifies a set of requirements, from the end-user point of view, that must or should be addressed by the various work packages. Some of them, the ones related to the data part, will be highlighted in this document.

3.1. Use Case Requirements The projects’ general user scenario has been split into different use cases to ease the elicitation phase. Hence, three use cases have been identified starting from the type of operations required on data. Briefly, as reported in more detail in D7.1, the use cases are: ● Use Case 1 (UC1), devoted to the acquisition of data in the system using different data sources; ● Use Case 2 (UC2), related to the processing of the data to extract the historical knowledge from it; ● Use Case 3 (UC3), devoted to the creation of new knowledge by projecting existing models in the future or under different conditions. Table 1 lists the requirements for the three use cases that must or should be taken into account during the design of the big data eco-system (those related to WP4); these cover both functional and non-functional aspects. The full description of the requirements and the use cases is available in D7.1.

Requirement Req # UC # Description WP Level

R1.1. UC1. To integrate GIS data sources MUST WP4

R1.2. UC1 To integrate meteorological/climate data sources MUST WP4

R1.3. UC1 Metadata must be included into the application MUST WP4

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 10

R1.5. UC1 Availability of an API MUST WP4/5

R2.5. UC2 Selection of data sources for time-series analysis MUST WP7/4

R2.6. UC2 Selection of the area of interest MUST WP7/4

R2.8. UC2 Reuse aggregated results SHOULD WP7/4

R3.5. UC3 Download of aggregated results MUST WP7/4

R3.6. UC3 Selection of data sources MUST WP7/4

Table 1. List of Use Case requirements related to WP4

3.2. Technical Requirements From the analysis of the Use Case requirements, general functional requirements regarding the project’s infrastructure have been identified. These requirements, defined as “Technological Requirements”, refer to data access, execution, and security and logging. A complete reference to the full description of the requirements is provided by D7.1. Table 2 provides a list of the technological requirements that must or should be tackled by the big data eco- system (those related to WP4); these fall within the context of data access and security aspects. It is worth mentioning that, although execution requirements are mainly addressed by WP3, the big data eco-system provides the components necessary to perform big data processing on the underlying infrastructure as well as the adaptations required by the QoS cloud infrastructure models. Hence, these aspects must also be considered during the WP4 design phase.

Requirement Req # Description WP Level

RD.1. Integrate external existing data sources MUST WP4

RD.2. Automatic synchronization with original data sources MUST WP4

RD.3. Storage of processing products MUST WP4/6

RD.4. Authentication and Authorization MUST WP4/6

RD.5. Data Access MUST WP4

RD.6. Deal with poor-Internet connection limitations SHOULD WP4

RA.2. Data and applications ACL MUST WP6/4/5

RA.4. Data privacy protection MUST WP6/4

Table 2. List of technological requirements related to WP4

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 11

As displayed in Tables 1 and 2, several requirements are shared among different work packages, thus it is key and critical to identify links and dependencies of the big data eco-system with respect to security, QoS cloud infrastructure services, programming frameworks and end-user algorithms/applications, in order to correctly model these aspects in the eco-system design. These kind of aspects have been the main purpose of several cross-WP telcos held during the first period.

3.3. Classes of Users

Different types of users could potentially exploit the functionalities provided by the big data eco-system. The following classes of users have been identified, given the type of functionality and privileges required: ● Administrators, have control over the big data eco-system at the infrastructural level. Different roles could be defined at different levels and granularity; ● Developers, use the big data eco-system API to develop end-users applications, to test the system and perform data inspection activities; ● Programming models, exploit the eco-system technologies and the integrated data for the execution of data mining and analytics processing.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 12

4. DATA SOURCES The urban mobility scenario targeted by the EUBra-BIGSEA project requires the integration of very heterogeneous data sources to provide information regarding urban traffic, environmental conditions and people sentiment/opinions (from social networks sources). Even though this data mainly focuses on the pilot case "the city of Curitiba" the EUBra-BIGSEA framework should be independent of a specific geographical area, or at least it should be easily re-used in other scenarios with minor adaptation activities. The WP4 copes with the management of several external data sources, addressing the challenges associated with these data to provide a fast and scalable environment for big data analytics and mining. The data source analysis provides important input for the WP4 architecture. Deliverable D7.1 gives an overview of some main, preliminary, external data source types identified for the project scenario; in this report, additional data sources have been identified in subsequent analysis and included. The external data sources types are classified as: ● Stationary data describing static elements that compose the infrastructure and mobility in the city. It includes urban geographic information (e.g., legal limits, land cover, land use, hydrography), transportation infrastructure (e.g., street map, topology of the traffic network, bus stops), points of interest (e.g., schools, hospitals, squares, stadiums) and other information that is relevant to understand the location of the components present within the scenario; ● Dynamic spatial data containing information valid for a specific point in time. It includes traffic geo- referenced information (e.g., vehicle GPS, routes of public transportation users), traffic status and news (e.g., accidents, traffic jam), existence of events (e.g., high concentrations of people, concerts, protests), and all types of temporal useful information to measure the mobility conditions; ● Environmental data presenting information about the environmental conditions and the weather forecasts that are relevant for understanding citizens’ mobility; ● Social network data providing streams of data useful to extracts information about sentiments and unpredictable events; External data could require some preliminary steps to get integrated into the big data eco-system. After these pre-processing steps, data can be used for access and processing by exploiting the big data technologies available in the eco-system. Derived data can be produced by running analytics and machine learning algorithms defined for the UC2 and UC3. This data can be then stored into the system to be used for subsequent processing and analysis.

Additionally, the big data eco-system could be exploited to integrate “platform-level” data sources that are mainly required for internal use. These are: (i) the monitoring data produced to evaluate QoS at the infrastructure level and (ii) the information produced by the Data Quality as a Service (DQaS) to annotate data. These data can, in fact, be stored in the eco-system and managed by the same big data technologies used for the other available data sources. The management and analysis of this (big) data raises several challenges that should be properly handled, such as: ● Data velocity: data can be stationary and valid for a long period, generated from periodical runs of weather forecast models or produced by continuous streams; ● Data variety: data sources are very heterogeneous and comes in different types, such as tabular, structured, non-structured, multi-dimensional, spatial or a mix of the aforementioned;

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 13

● Data volume: especially dynamic and environmental sources are expected to continuously produce data, resulting in a big volume of data to be managed; ● Data veracity: data sources could be pre-processed, filtered and aligned before being integrated in the eco-system in order to avoid affecting the overall quality and accuracy of stored data. The following sections provide a brief description of the main aspects of the (i) external data (Section 4.1), (ii) derived data (Section 4.2), and (iii) platform-level data (Section 4.3) from a WP4 perspective.

4.1. External Data Table 3 summarizes the external data that will be potentially integrated in the data eco-system during the project lifetime, providing for each a set of characteristics. The following subsections describe, briefly, the data sources considered for each of the classes defined before.

Data Source Source Type Domain Expected Data Availability Data Policy Storage Name Volume Format and services (size/freq.)

Brazilian Stationary Geographic 189 files Shape files Public Not applicable PostGIS boundaries 397,639 records

2 GB

Curitiba Stationary Geographic 75,882 Shape files Public Not applicable PostGIS Infrastructure records

100 MB

Events from Dynamic Events 1 record / day JSON Public Not applicable NoSQL globo.com

Events from Dynamic Events 1 record / day JSON Public Not applicable NoSQL Search

Events from Social Data Events 26 records / JSON Public, API Non distributable. NoSQL day access (it Restricted to the requires project. authN)

Twitter Social Data Traffic / 150,000 JSON Public, API Non distributable NoSQL Events records / day access (it requires 630 MB / day authN)

Curitiba Dynamic Traffic 20 records / JSON Public Not applicable NoSQL traffic news day

Traffic news Dynamic Traffic 15 records / JSON Public Not applicable NoSQL day

Traffic status - Dynamic Traffic 3,670 records JSON Public Non distributable. NoSQL "MapLink" / day Restricted to the project.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 14

6.5 MB / day

Point of Stationary POI 2,537 records JSON Public Restricted to the NoSQL Interest - project. URBS

Points of Stationary POI 309 records JSON Public Restricted to the NoSQL interest - project. Tripadvisor

Curitiba Bus Dynamic Mobility Sample of 6 CSV By request Restricted to the PostGIS Cards days. (sensitive project. 3,970,059 information) records

Curitiba bus Dynamic Mobility 20,000 JSON By request Non distributable. PostGIS service - URBS records / hour Restricted to the project. 2.6 MB / hour

WRF Climate Environmental Weather 2GB NetCDF Public Non applicable Filesystem Data Data / day

Table 3. Summary of external data sources

4.1.1. Stationary Data Stationary data provides long-living information describing the topology of the traffic network of the city, the street map, relevant city spots and other geographic information useful to identify the location of the components present in the urban mobility scenario. Possible external data sources belonging to this category are: ● Curitiba Infrastructure: this is composed by multiple tables with information about land cover and land use in Curitiba, such as streets layout, rivers, squares, along with polygon boundaries. Most of information are geo located in the PostGIS format; ● Brazilian boundaries: contains the name, ID and polygon points of the boundaries of all states, meso regions, micro regions, cities, districts, sub districts and sectors of Brazil. It is an official database provided by IBGE (Brazilian Institute of Geography and Statistics), available on the Web. The data are static and rarely change; ● Points of interest from URBS: information about the points of interest from Curitiba. It includes the name, coordinates and type of each POI (e.g. schools, hospital, hotels). The data are provided by URBS (Curitiba public transportation company) through API requests and updated once a week; ● Point of interest from Tripadvisor: this database contains information about all the places from Curitiba crawled listed in the Tripadvisor website. In addition to the name and location, it also provides popularity and evaluation metrics. The data acquisition is performed with a web crawler and updated once a week.

4.1.2. Dynamic Spatial Data Dynamic spatial data provides georeferenced information of the vehicles and users for a specific point in time. Possible dynamic spatial data sources include:

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 15

● Curitiba bus cards: consists of information about the use of bus cards of Curitiba. Each record includes the bus , bus vehicle, date, time, card ID and coordinates in which the card has been used. We have a sample of 6 days, provided by the URBS (Curitiba public transportation company); ● URBS bus service: contains realtime information about the geographic position of the bus vehicles. Each record describes the position of a bus vehicle in a specific date and time. The data are provided by URBS (Curitiba public transportation company); ● Events from globo.com: information about the most important concerts scheduled in Curitiba from the Globo website. The information are the event name and place. The data are crawled from the Globo website once a day; ● Events from google.com: information about the most important concerts scheduled in Curitiba from the top box shown during the Google search. The information are the event name and place. The data are crawled from the Google search once a day; ● Traffic status - "MapLink": this database contains the traffic status in the main avenues and streets of Curitiba. The information is crawled from the MapLink website and updated every 15 minutes. It includes the street name, geo location and their current average speed and traffic status; ● Curitiba traffic news: composed of news about the traffic in Curitiba from an official source. The records are composed of text with traffic information and date/time. It is collected once a day though RSS request from the traffic news feed on the Curitiba City Hall website; ● Traffic news: composed of news about the traffic in Curitiba from news websites. The data are crawled in real time from the main news websites through RSS requests. Some of the information are the news text, url, keyworks, date/time and source.

4.1.3. Environmental Data Environmental data provides information about the weather conditions and forecasts that are also relevant for understanding citizens mobility. This data will be produced using the Weather Research and Forecasting Model (WRF). WRF is a state-of-the-art regional numerical weather prediction system developed and maintained by several institutions and available as open source code to the whole community [R01]. In order to produce forecasts for the city of Curitiba, two nested domains centered in the city were configured with horizontal resolutions of 12km and 4km (Figure 2). The WRF model uses input data from the Global Forecasting System (GFS). GFS is a global model developed by the National Center for Environmental Prediction and freely available for community use [R02]. Estimated data volume is 2GB/day, being 1.5GB input data from GFS (which is only leveraged by WRF model) and 0.5GB output data from WRF which is leveraged by the whole system. WRF output data consists of files in NetCDF format [R03] containing geo-referenced information of meteorological variables like temperature, precipitation, winds, etc.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 16

Figure 2. WRF outer (map) and inner (blue rectangle) domains

4.1.4. Social Data Social data refer to data produced by the interaction between users in an online social network. Such data are very dynamic and can be obtained shortly after being published, making them excellent source of near real time information. Extracted information may include facts about situations or events happening right now, for instance, traffic jams or floods, used to estimate public attendance or to collect users’ demands, sentiments or opinions. To extract information of social data it is necessary to filter, transform and enrich raw data. In general, one of most important attributes available in social data is the text of the message. For example, Twitter allows users to send text up to 140 characters. There are some other well structured fields, such as tweet date or user login, but in general, message text contains important information merged with poor written text, misspellings, ambiguities and other incomprehensive expressions. We foresee social data as being an important source of real time information and an excellent input to processing algorithms in areas as NLP (Natural Language Processing, for example, to extract entities referred in the text), machine learning (for example, correlating social data with other data sources) and data mining (finding frequent patterns, finding sequences). Data sources from social networks include: ● Twitter: composed of real time messages from Twitter. The data are extracted through API [R04] requests, which fetch three types of tweets: i. tweets crawled from accounts related to traffic information. These accounts were manually selected and there is no guarantee that the tweet is actually about traffic status. Just a small piece of such messages are geo located; ii. tweets with keywords related to traffic. The keywords were manually selected and there is no guarantee that the tweet is actually about traffic status. Just a small piece of such messages are geo located;

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 17

iii. all geolocated tweets from Brazil. It includes tweets related to all topics, therefore just a small fraction of the messages are related to traffic and mobility. ● Events from Facebook: contains information about the events on Facebook occurring in a 1000 meters radius from each subdistrict centroid in Curitiba. The dataset does not cover all events of the city. It includes information about the event location and popularity (e.g. number of invited people, interested people and people attending to the event).

4.2. Derived Data Derived data are produced by the execution of algorithms that implement UC2 and UC3, using the pre- processed integrated data or even other derived data. The main aim of UC2 - Descriptive Models - is to extract, characterize trajectories and to obtain correlations among them and other associated metadata. Such models will characterize trajectories, deriving aggregated and non-aggregated statistics about routes and will identify characteristics probability distributions for the various statistics, as a strategy to extrapolate the findings and the underlying phenomena. To build the Descriptive Model, data from different sources will be consumed, including stationary, spatial, environmental and social data. The resulting Descriptive Model will be stored in a such way that favors fast query (including queries with geo-spatial operators) and incremental updates. Any storage technology used in analytics solutions is good candidate to be chosen as Descriptive Model storage. But also a more “traditional” record or document oriented database will be used to store metadata. Model updates may be performed by batch or stream execution service. Some data mining techniques, such as frequent patterns mining, can lead to an explosion in the number of derived data and this should be a concern if that technique is used to realize some scenario in UC2. The main goal of UC3 - Predictive Models - is to build an applications for the recommendation of routes to people that will help citizens to check how mobility conditions are and give hints on how to better reach destinations with regards to multiple criteria, such as time, predicted traffic (and hence stress), pleasantness to walk, sights, and interestingness (taken from social media). The project will implement two tasks: classification and regression. Predictive models will construct models using all previous kinds of data, using a batch-model execution service. Predictive Models can have their own file format and store data in a distributed file system or use another WP4 storage service such as NoSQL [R05].

4.3. Platform-level Data

4.3.1. QoS Monitoring Data QoS Monitoring System, defined in WP3, collects, processes, stores and displays information about metrics, alarms and logs related to applications or infrastructure components. A specific component, metrics and alarms database, organizes data in a structure optimized to store time series in different time scales. In QoS Monitoring System, there are other components that may exploit WP4 in order to scale. Such components include analytics engine, anomaly and prediction engine and transform and aggregation engine. Analytics engine consumes alarm state transitions and metrics from the message queue and does anomaly detection and alarm clustering/correlation. Anomaly and prediction engine evaluates prediction and anomalies and generates predicted metrics as well as anomaly likelihood and anomaly scores. Transform

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 18

and aggregation engine transforms metric names and values, such as delta or time-based derivative calculations, and creates new metrics that are published to the message queue (optional). WP4 components also generate metrics and logs at application level. A detailed description of the monitoring system is available in deliverable D3.1.

4.3.2. Data Quality as a Service Data The Data quality service provides additional information about the data sources managed by the platform. In general, data quality aims to evaluate the suitability of data for the processes and applications in which they are involved. Such “suitability” is assessed by means of a set of quality dimensions, which selection and definition are partially dependent on the context of use. In details, the data quality service can be considered as composed of a module that aims (i) to provide general information about data, such as values ranges and uniqueness degree of each attribute and number of represented objects (ii) to evaluate of specific data quality dimensions, e.g., accuracy, completeness, and timeliness. Outputs are a set of metadata that can be used to: ● Trigger data cleaning activities: the definition of quality levels and related acceptability thresholds can help in activating activities that aim to improve data; ● Let the users be aware of the data quality level of the accessed data: data quality metadata can be shown together with the application results to let the users understand the trustworthiness of the information they are looking at; ● Drive the data integration: data quality levels can be used as a driver in the selection of equivalent or similar sources.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 19

5. BIG AND FAST DATA ECO-SYSTEM GENERAL ARCHITECTURE A general architecture regarding the Big and Fast Data eco-system is shown in Figure 3. This diagram highlights the main building blocks of the system as well as the links among the big and fast data eco- system internal components and the relationships with the other work packages. It provides a general conceptual description of the big data eco-system, outlining the logical components that will be involved to address the requirements specified in the previous sections. The architectural view, along with the end- user requirements and data sources description, have been used to drive the big data tools analysis phase providing a base for tools identification.

Figure 3. WP4 general architecture

The WP4 block contains the main components required by the big data eco-system: ● External data sources: it consists of the raw data described in Section 4.1 and being integrated into the system. QoS monitoring data are also included at this level, as a platform-level data source, since it can undergo the same type of pre-processing required for the other sources; ● Data Storage: it includes several types of databases and storage systems necessary for storing and handling efficiently: (i) pre-processed data coming from the external data sources, (ii) derived data, (iii) data concerning monitoring metrics related to infrastructure/application QoS and information from DQaS describing the quality of the data being integrated. These systems include: relational databases, OLAP data warehouses, geo-spatial databases, NoSQL databases, and distributed storage; ● Fast and Big Data technologies: it is a macro-block including: ○ Data Ingestion and Streaming processing: it comprises the tools to (i) ingest and synchronize the data stored at the system with the external data sources and (ii) to

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 20

continuously process streams of data. This block provides mainly the tools to run pre- processing and ETL steps before loading the data into the storage systems; ○ Data Access: it consists of the technologies that allow selection, filtering and querying of the data stored into the system. It also provides functionalities to save and access derived data produced by descriptive and predictive model services. These features are exploited both by external users and the programming models; ○ Data Analytics and Mining systems: they provide the engines for data processing in order to perform data analytics and mining tasks. In particular, they include a toolbox with a set of analytics routines, mining functions and machine learning algorithms required for the execution of the processing tasks. Programming models, as well as DQaS and Entity Matching services will use the features of the technologies adopted for this block. The architecture view also displays the relationships with other work packages: ● Programming models (WP5), developers, system administrators, as well as end-user applications (WP7), use the tools provided by the big data eco-system to access the data and run data analytics and machine learning models; ● Big data applications interact with QoS cloud infrastructure services (WP3) to provide information (e.g. metrics, alarms, logs) to the cloud infrastructure useful to elastically adjust the resources on which the applications are being run in order to meet the expected QoS levels based on the current workload; ● Security (WP6) is orthogonal to the whole architecture and defines the measures and technologies required in several blocks of the big data eco-system, mainly referring to: privacy and protection of the data managed by the eco-system, and authentication and authorization across the various big data tools involved.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 21

6. BIG AND FAST DATA ECO-SYSTEM DESIGN This section focuses on the design of the big and fast data eco-system. The main components of the architecture and their relationships, also with other WPs, are thoroughly described (Section 6.1). Moreover, UML [R06] sequence diagrams derived from a set of user stories are provided to describe how the components interact with each other (Section 6.2). Finally, possible QoS metrics to be exposed (Section 6.3), the data management API (Section 6.4) and some security aspects to address (Section 6.5) are illustrated too.

6.1. Architectural Diagram The detailed architectural diagram of the big and fast data eco-system is displayed in Figure 4. This view provides an insight into the main blocks defined in the WP4 general architecture (Figure 3) and has been modeled taking into account the end-users requirements and the data sources needs for the implementation of the use cases. It highlights the relationship among the big data eco-system components and the other work packages of the EUBra-BIGSEA platform.

Figure 4. Big and Fast Data eco-system detailed architecture

In particular, programming models, defined in the context of the WP5, developers, administrators and applications, defined in the context of WP7, represent the main users of the eco-system. They will exploit the big data technologies to access, query and process the external data sources integrated into the system.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 22

Security solutions, identified in the context of WP6, should be exploited to efficiently handle AAA on the different blocks defined in the picture and to properly address privacy and protection of sensitive data. These solutions could be required at different levels in the eco-system. QoS cloud infrastructure services, designed in the context of WP3, are also orthogonal to the architecture since most big data component will interact with these services providing feedbacks to proactively adjust the resources on which the applications are being run in order to meet expected QoS levels.

The various data sources analyzed in section 4.1 compose the External Data Sources macro-block, whereas the pre-processed, derived, QoS and DQaS data are integrated and physically stored within the eco-system storage layer. The Big Data Technologies macro-block provides the features to handle the data life-cycle including ingestion, streaming processing, pre-processing, access, selection through queries, calculation of statistics, metadata management, and computation and storage of data derived from machine learning and analytics operations. The storage layer, along with the big data block represent the actual components that define the big and fast data eco-system. The following subsections will provide a detailed description of the main blocks and aspects defined in the architecture. It is worth mentioning that additional internal modules addressing specific operations could be developed during the project lifetime.

6.1.1. Data Storage As depicted in section 4.1, the set of external data sources available is very heterogeneous in terms of data format, volume and frequency. The big data eco-system is going to integrate these data sources types, exploiting several storage technologies with different data models and capable to deal with the data variety, volume and allow fast access to this data. Moreover, the storage layer will also handle derived data and platform-level data that can be produced by the eco-system components. Various levels of data have been defined according to the number of processing tasks applied to the related data sources. Figure 5 shows the flow of data transformation and the levels of the data: ● Level-0 data: comprises the raw data from external data sources described in section 4.1. These are grouped in the External Data Sources block, which includes also the monitoring data gathered by the QoS cloud infrastructure service (see Figure 4); ● Level-1 data: includes the integrated data that is stored into the system after the execution of pre- processing steps (e.g. ETL). Entity Matching service can be exploited during this phase to match entities in different external sources and produce additional data. Level-1 data are then used for analysis and mining; ● Level-2 data: consists of the data stored into the system derived from the integrated level-1 data. The data are produced as a result of the execution of descriptive and predictive models on level-1 data. Additionally, the models could also use level-2 data as their input; ● Platform-level data: includes sources required for internal use; i.e. data quality and QoS monitoring. DQaS can be executed during the preprocessing phase to identify the quality of data and annotate it. This metadata information is stored in a specific level-1 database and can be accessed to check the quality of data stored in the system. Metrics and data for QoS can require an ETL phase, as in the case of the raw data, to integrate the information in a level-1 data warehouse that can be, then, used for subsequent analysis and mining operations, producing level-2 data.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 23

Figure 5. Data sources levels

Storage technologies should be capable of dealing with the various types of level-1, level-2 and platform- level data. Technologies exploited for this data may include: ● Distributed file systems (e.g., HDFS [R07, R08]); ● NoSQL databases (e.g., MongoDB); ● Data stores for multidimensional scientific data (e.g., Ophidia [R09, R10]); ● Databases to handle geo-spatial data (e.g., PostGIS). The data storage component and the technologies employed have to address the following requirements (defined in D7.1): R1.1. The application must integrate the GIS data sources (dynamic and spatial). The integration of data should be done in a standardized way to facilitate the future integration of any other data source. The information should be updated accordingly. The data integration procedure should be clearly described, as well as the storage architecture required. R1.2. The application must integrate meteorological/climate data sources. This information should be attached to historic records and new (forecast) information should be accessible. R1.3. Metadata must be included into the application to describe the area covered, the year of acquisition of the data, the type/format of the data and any other technical specification that is necessary. R2.8. It should be possible to download aggregated results and products to be used in subsequent analysis.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 24

R3.5. Download of aggregated results and products must also be supported.

RD.3. The infrastructure must store the data processing products, taking the necessary steps to ensure data persistence and data protection, when necessary.

6.1.2. Big Data Technologies This macro-block comprises a set of big data systems required to load, handle and process data. These can be grouped in classes of components according to the main features provided.

Data Ingestion and Streaming Processing modules These modules mainly take care of the loading and synchronization of the data stored in the eco-system, targeting in particular streams of data. They will be used in the first phase of the data management process to extract level-1 data from the external data sources, especially in the case of streaming data from social networks. Moreover, this block will also provide the features to synchronize the data stored into the storage layer with the external sources.

Several technologies, for example Apache Kafka, Apache Storm, Spark Streaming or Apache Flink, can be used for the ingestion phase and the processing of real-time streaming data. These systems can also be used to continuously perform ETL pipelines. General requirements related to this component are (defined inD7.1): RD.1. The infrastructure must support the integration of external data from existing data sources. This integration must be complemented with methods for referencing the data in their original locations, and to pre-process and annotate the data with additional information. Metadata standards must be used when available to annotate the data. RD.2. Automatic synchronization with original data sources must be addressed (updating the infrastructure with the latest releases of the data), considering the individual needs of each case, which range from simply discovering and downloading new data when it becomes available, to running complex data pre-processing before storing the data in the infrastructure. RD.6. User Internet connection is a potential bottleneck for performance, especially low bandwidths as expected in field conditions. Therefore, the infrastructure should facilitate the access to the data even in poor Internet connections.

Data Access and Query modules Data access and query modules provide the features to access the data available in the system, search and filter the information, perform basic aggregations and store the results of pre-processing phase and analytics/mining computations. These systems will allow the execution of specific types of queries for the various data sources integrated. Access to metadata related to the data will also be available. The features will be directly exploited by the application developers and administrators to get access to the information available in the eco-system. Some examples of technologies that could be potentially included in this block are: Ophidia, Apache HBase, PostGIS and MongoDB. Requirements addressed by this component are (provided in D7.1):

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 25

R1.5. An API must be exposed to deal with the storage resources to authenticate, populate data, retrieve and filter data, update data. Same operations for metadata. Data access should have a short latency (near real-time access).

R2.5. The service must facilitate the end-user to select the data sources, temporal and spatial scales and output format for historical time-series analysis. R2.6. The service must facilitate the end-user to select an area of interest (e.g. the Batel District in Curitiba, Brazil), for that area the available data (e.g. GIS stationary data) must be retrieved from the application and the end-user must have the possibility through the service to select derived information for that area. Trajectory analysis algorithms may be implemented in an incremental way, therefore processing just recent data available as the whole data will not fit in memory. R3.6. The user interface must facilitate the end-user to select the data sources, temporal and spatial scales and output format for historical time-series analysis. RD.5. The infrastructure must facilitate the end-user to access the data, providing the most appropriate protocols and data formats to enable developers with the necessary means to build usable user interfaces. Data must be queried in a variable granularity.

Data Analytics and Mining toolbox The toolbox is the repository of the machine learning algorithms, analytics operators, array-based primitives and scientific libraries available in the eco-system for the user analysis. It will also feature a market-place for user communities. Data mining algorithms will include, for example, regression, classification, clustering and correlation, whereas data analytics operators will include subsetting, reduction, aggregation and intercomparison. Scientific libraries will allow complex mathematical and statistical computations. Spark MLlib, Ophidia primitives and Ophidia operators are some examples that fall within this component.

Data Analytics and Mining modules The modules include the processing engines and computing frameworks to run data mining and analytics tasks on big volumes of data. The block includes several submodules like Entity Matching (EM), Data quality as a service (DQaS), predictive and descriptive model services and On-Line Analytical Processing (OLAP). These submodules will execute their computation through the engines and frameworks exploiting a set of libraries and functionalities available in the Data Analytics and Mining toolbox. Several technologies could be used to implement analytics and mining modules, such as for instance, Apache Spark [R11], Hadoop MapReduce [R12, R13], Ophidia, Apache Hive [R14] and Druid.

6.1.3. Entity Matching Service The Entity Matching (EM) task, i.e., the problem of identifying records that refer to the same entity of the real world, is known to be challenging due to its pair-wise comparison nature, especially when the datasets involved in the matching process have a high volume (big data). Since the EM task has critical importance for data cleaning and integration, e.g., to find duplicate points of interest in different databases, studies about challenges and possible solutions of how EM can benefit from modern parallel computing programming models, such as Apache Spark (Spark), have become an important demand nowadays. For this reason, the EM service, to be provided by the main API of the WP4 architecture, consists of a bag of

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 26

tools and functions that can process the EM task (e.g., geo-matching) in parallel by using Apache Spark and Hadoop's MapReduce (MR). The EM service will attend the requests from applications/systems interested in submitting EM tasks to the cluster environment. To this end, the service will establish a connection to the Hadoop Eco-system (WP3) to perform the necessary operations such as submitting artifacts (e.g. datasets) to the HDFS or starting the execution of MR and Spark jobs.

6.1.4. Data Quality as a Service The Data Quality (DQ) task assesses the quality level of the sources addressing the veracity issues related to big data scenario. The DQ service annotates the sources with metadata able to provide knowledge about the reliability and usefulness of the data values involved in the various platform applications. The data quality values will be calculated periodically or the service can be triggered by applications/systems interested in updated quality information. In order to meet big data requirements related to velocity, algorithms will be implemented by using Apache Spark that supports parallel computing programming. DQ metadata are stored in the DQ repository included in the Platform-level data.

6.1.5. Extraction, Transformation and Load As depicted in the data source section (Section 4), the eco-system can potentially include several heterogenous data sources. This data could require some pre-processing steps in order to transform or normalize the data before loading it into the (level-1) storage: streaming data could require tools for continuous pre-processing; some data sources could also require anonymization techniques to protect personal information and comply with the data owner policies; data sources providing information from the same domain could require various types of transformation to uniform data to a common view. Extraction, Transformation and Load (ETL) procedures will mainly exploit data ingestion and streaming processing components and, to a lesser extent, also analytical and mining functionalities (with optionally some security support regarding privacy). For example, the Entity Matching service could be exploited during the ETL phase to identify potential matches in records from different sources, whereas streaming processing could be used to filter and transform streams of social networks data. Spatial data could require location filtering, format change and projection to a standard coordinate system. Data quality information from DQaS will provide a solid basis for tagging the different data sources in terms of quality. DQaS could run on the data during the ETL phase to annotate data quality-based metadata.

6.2. Sequence Diagrams Sequence diagrams are used to represent interactions between objects and their order, along the lifeline. In this section, objects are high-level architectural components defined in Section 6.1. Use cases and all their user stories are defined in the document D7.1. Starting from these user stories, for each of the use cases defined in D7.1 some scenarios and the corresponding UML sequence diagrams, relevant to highlight the interactions among WP4 components, have been provided.

6.2.1. User Stories for UC1: Data Acquisition The type of user stories in the data acquisition and integration focus on retrieving the data, the periodicity and mechanism for retrieving the data, the basic filtering of the data, raw and filtered data visualization.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 27

This use case is mainly intended for data curators and data scientists that have to prepare and understand the main features of the data sources.

Scenario 1.1: Data ingestion This scenario has been derived from US1.1 (see D7.1). Data load process retrieves the data from the original sources (level-0 data), including metadata, transforms it into a common format (level-1 data) and stores it into a database system. In case of streaming data, a streaming processing technology consumes data, transforms and stores it into a proper database.

Figure 6. Sequence diagram for scernario 1.1

Scenario 1.2: Data selection and filtering This scenario has been derived from user story US1.2. A filtering process, invoked by a user/developer, selects data from storage according to a particular filter. For example, in the case of dynamic spatial data, filters can include transportation lines, geographic zones, specific users or data periods.

Figure 7. Sequence diagram for scenario 1.2

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 28

6.2.2. User Stories for UC2: Descriptive Models The type of user stories in the descriptive models focus on the analysis of trajectories and the associated variables that could affect their distribution (weather, date and time, social networks information, etc.). The Descriptive Models are built as a service (DM service), targeting data scientists on traffic management to discover correlations and build up higher-level services.

Scenario 2.1: Trajectory extraction and analysis This scenario has been derived from user story US2.1. The DM service queries the storage to retrieve (level- 1) data for trajectories and statistical analysis and stores the result of the analysis (level-2 data) in a database in order to reuse this data to perform statistical analysis. Queries may specify a period of time or a geographical region to filter the trajectories.

Figure 8. Sequence diagram for scenario 2.1

Scenario 2.2: Trajectory clustering This scenario has been derived from user story US2.3. The DM service requests to perform a clustering on trajectory data (level-1 data) available in the storage. The service can specify the type of clustering algorithm and the parameters to be used. The next diagram (Figure 9) is a specialization of the previous one for trajectory clustering.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 29

Figure 9. Sequence diagram for scenario 2.2

6.2.3. User Stories for UC3: Predictive Models The user stories in the Predictive Models deal with the training of the predictive models on the descriptive data obtained in the previous Use Case. These user stories are apparently more computing-intensive than data-intensive bounded. Prediction will also include projection of models, which is not computing intensive but should work in interactive time.

Scenario 3.1: Prediction model training This scenario has been derived from user story US3.1. The PM service requests the training of a predictive model with the most recent data available in storage (level-2). This request can include model type (e.g. random forest, recurrent ANN), training procedures (e.g 10-fold cross validation) and training data (e.g. last month) to be considered. After the training phase, the model is stored in the system (as level-2 data) to be accessed later to run predictions.

Figure 10. Sequence diagram for scenario 3.1

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 30

Scenario 3.2: Execute prediction

This scenario has been derived from the user story US3.2. The PM service request a prediction to be made using a model previously trained (level-2 data). Feature data and the model are loaded from the storage.

Figure 11. Sequence diagram for scenario 3.2

6.2.4. Other interactions between WP4 components During the requirement analysis, new interactions between WP4 components were found. Such interactions are related to background processing aimed to support the realization of previously identified use cases.

Scenario 4.1: Data acquisition and online streaming processing for data mining Figure 12 shows a data acquisition and online streaming processing for data mining scenario where a data producer ingests data (level-0) in the processing pipeline. In this interaction, data are continuously being processed as soon they arrive in the data ingestion component. The process is started by time (scheduled) and then data ingestion component reads data (zero to many records) from producers and stores them in a message queueing system. Stream processing components consume messages from queue containing one processing item (one record). Some operations can require loading of different predictive models (level-2), for instance, to classify a processing item. Other available operations include filter (removes item if it is considered invalid or irrelevant to next stages in pipeline), map (transformation of processing item, for example, by applying a predictive model) and reduce (aggregation of data). Finally, derived data (level-1) is updated and in some cases, descriptive models (level-2) is also updated (if an aggregation was performed). This interaction is similar to the user story US1.1 defined in D7.1. However, the user story US1.1 in D7.1 only mentions stationary data and a process started by a user, whereas Figure 12 also includes other types of data (i.e. social data) and it represents a synchronous automatic batch process.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 31

Figure 12. Sequence diagram for scenario 4.1

6.3. Exposed QoS metrics The QoS IaaS (WP3) focuses on the deployment and configuration of the infrastructure where the data analytics jobs run, and the execution of elasticity rules to adapt the resources. QoS profiles for the applications are defined in advance by measuring the applications performance. The performance of the data analytics and mining applications will be followed by the monitoring system to request additional resources when needed. WP4 applications will provide metrics, alarms and logs to the WP3 monitoring system so it can estimate the performance level of the current execution. It this way the service can provide proactive elasticity in order to adjust the system for guaranteeing the QoS of the applications. WP4 components provide different application level QoS metrics related to their function in the system. It is difficult to enumerate all QoS metrics in early stage of the project because not all technologies are defined nor all possible uses for the metrics are comprehensively clear. Generically, we can define a set of common metrics for each type of WP4 component that might be relevant. Future documentation will describe specific metrics used in the project and their purpose.

6.3.1. Java Virtual Machine metrics Many WP4 technologies are implemented in programming languages targeting the Java Virtual Machine (JVM). These technologies share a common set of metrics related to the JVM itself. Table 4 contains some examples of metrics available for the JVM. For a more complete list of possible JVM metrics, see [R15]. For each metric, a set of tags or naming convention is needed to identify the JVM process generating it.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 32

Metric Description

MemHeapUsedM Current heap memory used in MB

MemHeapMaxM Max heap memory size in MB

GcCount Total GC (Garbage Collector) count

GcTimeMillis Total GC time in msec

Table 4. Some examples of JVM metrics available

6.3.2. Data Storage Metrics Data storage metrics are related to distributed file system. Metrics record contains tags such as HAState (high availability state) and Hostname (e.g. data node or name node hostname) as additional information along with metrics. Metrics contain information about cluster, blocks and file state and space used, available and total in the distributed file system. Some potential metrics are shown in table below.

Metric Description

CapacityTotal Current raw capacity of data nodes in bytes

CapacityUsed Current used capacity across all data nodes in bytes

CapacityRemaining Current remaining capacity across all data nodes in bytes

CorruptBlocks Current number of blocks with corrupt replicas

Table 5. Distributed file system metrics

6.3.3. Data Access Metrics Data access metrics are used to monitor databases size growth and concurrent user sessions in data storage technologies used in the project. To a QoS monitoring system, it can be useful to identify situations where a data storage system needs to be scaled, for instance, by data replication and sharding.

Metric Description

DatabaseSize Database size, in bytes

ConcurrentUsers Number of concurrent users connected to server

Table 6. Data access metrics

6.3.4. Data Ingestion and Streaming Processing Metrics Data ingestion is the process of obtaining and importing data for immediate use or storage in a database. In the context of WP4 components, data ingested will be consumed by stream processing components and is temporarily stored in processing queues. Some WP4 components may require different processing times. For example, it takes less time to execute an algorithm that simply classifies a record using a precomputed model than to execute a route

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 33

optimization. Message queues help these tasks to operate efficiently by offering a buffer layer. This buffer controls and optimizes the data flow speed through the system. Metrics in this category should have tags or use naming conventions to identify queue associated to the metric. Below, some potential metrics:

Metric Description

QueueLength Quantity of items waiting in the queue for processing

ProducerRequestRate How many items per second are being stored in the queue

MessagesPerSec How many items per second are leaving the queue (processed)

WaitingAck Quantity of items waiting for consumer acknowledgement (commited)

Table 7. Data ingestion and streaming processing metrics

6.3.5. Data Analytics and Data Mining Metrics For the QoS monitoring system, data analytics and data mining algorithms can be seen as processes or jobs running in WP3 infrastructure. This abstraction allows the separation of metrics related to QoS from those related to the application. QoS metrics are used by other components of WP3 to model static and pro- active policies for horizontal and vertical elasticity (T3.2 and T3.4). Required metrics are listed below. Complementary information should be provided as tags or prepended/appended to metric name to segment jobs in different types, categories and domains.

Metric Description

JobExecutionTime Job execution time in seconds

JobTotalOfTasks Total of tasks of a job

JobStartTime Job start time (timestamp)

Table 8. Data analytics & mining metrics

6.3.6. Data Mining and Analytical Toolbox We do not foresee any relevant metric for tools and libraries in this toolbox regarding QoS monitoring. All metrics are very specific, algorithm and technology dependent and under the QoS monitoring perspective, could be replaced by one or more system metrics such as CPU or memory usage. Very specific metrics should be stored in another type of repository, not in the QoS monitoring system.

6.4. Data Management API Data management API allows administrators, developers and programming models to manage data sources in the project infrastructure. Users can upload, download, create, update, delete and query both data and their associated metadata. This API is mainly intended to address the requirement R1.5 from D7.1. This API is based on REST principles and it is possible to develop programs in different programming languages using the HTTP/HTTPS protocols. A command line tool that interacts with the API may be developed.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 34

Requests must be authenticated using a client token. The first version of the API deals only with simple token authentication that can be obtained by an anonymous request containing a system user and her/his password as parameters. Future versions may use more sophisticated authentication processes, such as OAuth. In this document, the data management API is described in a high-level manner. We foresee API will evolve during the project and a better tool, such as Swagger [R16] will be used to document the API.

6.4.1. Resources URI In addition to utilizing the HTTP verbs appropriately, resource naming is the most important concept to grasp when creating an understandable, easily leveraged service API. In a RESTful API, a resource is identified by a URI (Uniform Resource Identifier). The final version of the API may have changes, but the general URI format is protocol://server:port/v1.0/resource, where protocol is http or https and server and port are API HTTP server address and listen port, respectively. Notice v1.0 defines the current API version.

6.4.2. Authentication Calls Requests to the API are authenticated using a token. To obtain this token, a call to the authentication resource is required. This call needs to inform a valid user and password as parameters in order to receive a valid token. Authentication parameters are checked against services provided by WP6 infrastructure. Authentication calls follow the next definition:

Resource path HTTP Parameters Return Description Method

/v1/authenticate POST User, Valid authentication Authenticates user and ensures password token or error status he/she can manage data.

6.4.3. Database Management Calls These calls allow to create, update, delete, edit, upload and download data and metadata stored in level-1 and level-2 WP4 infrastructure. In all calls, to have success, the caller (user) must have required permission.

Resource path HTTP Parameters Return Description Method

/v1/databases POST Auth. token, Status of the Creates a new empty database id, operation database using metadata other (success, information. metadata failure)

/v1/databases GET Auth. token, Status of the List all databases visible to filter operation. If authenticated user according parameters success, to the filter parameters. returns the list of databases

/v1/databases/id PATCH Auth. token, Status of the Updates an existing database database id, operation information or metadata

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 35

other (success, using provided parameters. metadata failure)

/v1/databases/id DELETE Auth. token, Status of the Drop an existing database. database id operation (success, failure)

/v1/databases/grant/id POST Auth. token, Status of the Grants a permission database id, operation associated to the database to permission, (success, user. user failure)

/v1/databases/revoke/id POST Auth. token, Status of the Revokes a permission database id, operation associated to the database to permission, (success, user. user failure)

/v1/databases/upload/id POST Auth. token, Status of the Uploads a compressed file to database id, operation API server and imports it to database (success, the database, replacing or content, action failure) appending to existing records. (replace, append)

/v1/databases/export/id POST Auth. token, Status of the Exports a database to a target database id, operation (existing database or web target (success, download). failure)

6.5. Security Aspects The main security concerns of the big and fast data eco-system are related to user authentication/authorization to the various big data technologies as well as data privacy. In particular: I. each eco-system component and, in turn, each technology/software implementing the component, may require user authentication/authorization in order to identify the user, its role and privileges, and grant access to the data stored in the system. Some tools, on the other hand, may not provide any features for user authentication. Hence, since the eco-system is going to integrate several different big data and storage technologies, user authentication and authorization should transparently deal with the different modes provided by the various technologies in the eco- system. To this end, a system that transparently authenticates and authorizes a user across the different components in the eco-system would represents a preferable solution; II. data coming from external sources or derived from this through the execution of data mining and analytics could present sensitive information that must be treated accordingly to avoid data breach and unintentional data disclosure. Additionally, even though raw data may not contain sensitive information, data derived from them could expose this property. For example, in the case of descriptive models, derived data describing trajectories could be used to infer a user habits. Data

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 36

privacy and protection measures (e.g. anonymization, obfuscation, encryption, etc.) are, hence, required to avoid these type of threats. Privacy measures should also try to preserve, as much as possible, the original information content, reducing the loss of potentially useful information.

As depicted in Figure 4 security features could be possibly required in two different sections of the WP4 eco-system: when a data source is ingested and when it is accessed, queried of processed. Data policies constraints and sensitive information in the external data sources could require the necessity to apply some changes in the data before storing it into the eco-system. These policies, defined by the source provider, could also define the conditions to access the data. Moreover, sensitive information could be inferred when accessing the data or when applying particular types of mining and analytical operations. This should be also taken into account when storing derived data. Security threats and possible solutions are described more in detail in deliverable D6.1. The requirements of the big data eco-system related to security are (as defined in D7.1): RD.3. The infrastructure must store the data processing products, taking the necessary steps to ensure data persistence and data protection, when necessary. RD.4. The infrastructure must provide access to authorized applications to access and process the data, supporting the application data processing model. RA.2. The infrastructure must support end-user authorization for accessing the data and the applications deployed with the infrastructure.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 37

7. TOOLS EVALUATION The big data world comprises a wide range of software and technologies that provide, among the others, OLAP, streaming processing, parallel computing frameworks, NoSQL databases, distributed storage, machine learning libraries, data analytics, job scheduling, system management and visualization. Hence, it is very important to identify in this crowded environment the right set of tools that could potentially fit the project needs. To this end, Section 7 is organized as it follows: first a procedure to describe the different components is presented in sub-section 7.1, whereas a complete list of tools is provided in sub-sections 7.2 (Data Storage), 7.3 (Data Access), 7.4 (Data Ingestion and Streaming Processing), 7.5 (Data Analytics and Mining), 7.6 (Data Mining and Analytics Toolbox). A final assessment is then provided in Section 7.7.

7.1. Procedure to describe components Table 8 defines a template to describe the potential components to be used in EUBra-BIGSEA big data eco- system. A comprehensive set of candidate technologies have been evaluated (for each block depicted in the Figure 3) in order to identify those more suitable for the EUBra-BIGSEA project purposes. The following subsections will use this template for the description and evaluation of some state-of-the-art big data technologies for the different components identified.

Identification Name and layer where the component will be applied (a component may be applied to different layers)

Type Database, processing engine, machine learning library, storage system, module, subprogram, control procedure, framework, service, etc.

License License model.

Current Version Current version and release date (at the time of the analysis).

Website URL to official website, documentation, references and to source code repositories.

Purpose Brief description of the key features that could be relevant to the project, what the component does, its main purpose, the transformation process, the specific processed inputs, the used algorithms, the produced outputs, where the data items are stored, and which data items are modified.

High Level The internal structure of the component and their inner interactions that are relevant Architecture for the project requirements.

Dependencies Other components required by the components and how this component is used by other components. Interaction details such as timing, interaction conditions (such as order of execution and data sharing), and responsibility for creation, duplication, use, storage, and elimination of components.

Interfaces and Detailed descriptions of all external and internal interfaces as well as of any languages mechanisms for communicating through messages, parameters, or common data supported areas. This includes possible programming languages supported by the system.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 38

Security support Security mechanisms provided by the component (if any) relevant for the project, such as AAA, data encryption, ACL, etc.

Data source Type of data, within the BIGSEA project, handled or processed by this software.

Potential usage How the tool can be exploited in the context of the WP4 of the project. within BIGSEA

Table 9. Template used to describe the potential components to be used in the Big Data eco-system.

7.2. Data Storage

7.2.1. HDFS

Identification Distributed File System (HDFS)

Type Distributed file system - Distributed storage

License : V2.0

Current Release Stable: V2.7.2 - 25 January. Version Note: HDFS is integrated with Apache Hadoop.

Website Website/Documentation: https://hadoop.apache.org/docs/r2.7.2/hadoop-project- dist/hadoop-hdfs/HdfsUserGuide.html Download / Source code: http://hadoop.apache.org/version_control.html

Purpose HDFS (Hadoop Distributed File System) provides high-throughput access to application data, it runs on commodity hardware and is suitable for Big Data. Besides that, it can store up to terabyte or petabyte files. It was built from Google File System (GFS [R17]). The main features are fault tolerance with data replication.

High Level HDFS uses WORM (write-once-read-many) model, which enables data to be written to a Architecture disk a single time. Scalability, robustness and accessibility make it suitable for use like

file system. The data stored on HDFS are replicated in the cluster to ensure fault tolerance. HDFS ensures data integrity and can detect loss of connectivity when a node is down. The main concepts: ● Datanode: Nodes that own data; ● Namenode: Node that manages the file access operations.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 39

[Image source: http://www.ibm.com/developerworks/br/library/wa-introhdfs/fig1.gif] There is only one Namenode in a cluster and many Datanodes. Namenode stores information about number of blocks, on which Datanode the data are stored, number of data replications, and others aspects. Finally, Datanode stores data.

Dependencies SSH connection between nodes of the cluster, which establish communication and data transfer.

Interfaces and HDFS was designed in Java for Hadoop Framework, therefore any machine that supports languages Java is able to run it. supported HDFS Java API, WebHDFS REST API and libhdfs C API, as well as a Web interface and CLI shells are available.

Security Security is based on file authentication (user identity). In addition, HDFS accepts support network protocols like Kerberos (for users) and encryption (for data). Whole permission configurations can be seen in the Documentation.

Data source HDFS is the source of many processing systems like Hadoop and Spark. Then, it is possible to store different data types to be processed. All data stored into HDFS become “sequencefile” files.

Potential Increase the capability of partner tools to handle big data, considering the scalable usage within ambient with fault tolerance and data replication. BIGSEA

7.3. Data Access

7.3.1. PostGIS

Identification PostGIS

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 40

Type extender for PostgreSQL

License GNU GPL v2.0

Current v2.2.2 - Mar/2016 Version

Website Website: http://postgis.net Documentation: http://postgis.net/documentation/ Download/Source code: http://postgis.net/install/

Purpose PostGIS is an open sources spatial database extension for PostgreSQL ORDBMS adding support for geographic objects and allowing location queries to be run in SQL. It follows the Open Geospatial Consortium’s “Simple Features for SQL Specification” [R18] and provides several features such as processing of vector and raster data, spatial reprojection, import/export of ESRI shapefiles, 3D object support. In particular, PostGIS adds extra types (geometry, geography, raster and others) and functions, operators, and index enhancements, that apply to these spatial types, to the PostgreSQL database.

High Level PostGIS extensions run within the PostgreSQL DBMS environment. Architecture

Dependencies PostgreSQL and some spatial tools or libraries (Proj4, GEOS, GDAL).

Interfaces and SQL, with the additional geo-spatial functions and types, is the language used to languages perform query on the data. supported Interfaces and clients are those provided by PostgreSQL. Psql is the PostgreSQL CLI interactive terminal. A RESTful API as well as a C client library (libpq) are also provided. EPCG allows SQL to be embedded in C code.

Security Database access permissions in PostgreSQL is role-based. support Different authentication methods are available to authenticate a client to the server, these include: password-based, GSSAPI, SSPI, Kerberos, ident-based, LDAP, RADIUS, certificate-based, PAM. SSL can be used to secure client-server connections exploiting also client-side certificates. SSH tunnels can be used to encrypt the communication when SSL can not be used. Encryption is available for password fields (by default), table columns and when data are transferred over the network.

Data source Provides features to work with data in GIS formats (e.g. shapefiles).

Potential It could provide useful features for the management of geo-spatial data, such as usage within stationary and dynamic spatial data sources. Integration of existing tools to show geo- BIGSEA spatial data as layers, including layers with boundaries and city infrastructure.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 41

7.3.2. MongoDB

Identification MongoDB

Type NoSQL Database

License - GNU AGPL drivers: Apache License v2.0 - Commercial license

Current 3.2.7 - June/2016 Version

Website Website: https://www.mongodb.com/ Documentation: https://docs.mongodb.com/ Download/Source code: https://www.mongodb.com/download-center#community

Purpose MongoDB is a multi-platform document-oriented database that provides high performance, high availability and easy scalability. MongoDB works on the concept of document collections.

High Level Architecture

[Image source: https://www.mongodb.com/assets/images/products/application-architecture.png] With MongoDB’s flexible storage architecture, the database automatically manages the movement of data between storage engine technologies using native replication. This approach significantly reduces developer and operational complexity when compared to running multiple distinct database technologies. Users can leverage the same MongoDB query language, data model, scaling, security, and operational tooling across different parts of their application, with each powered by the optimal storage engine. Through the use of a flexible storage architecture, MongoDB can be extended with new capabilities, and configured for optimal use of specific hardware architectures. MongoDB uniquely allows users to mix and match multiple storage engines within a single deployment. This flexibility provides a more simple and reliable approach to meeting diverse application needs for data. Traditionally, multiple database

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 42

technologies would need to be managed to meet these needs, with complex, custom integration code to move data between the technologies, and to ensure consistent, secure access.

Dependencies None

Interfaces and The official drivers (distributed under the Apache License, Version 2.0) of the MongoDB languages allow developers to connect from various programming languages. supported

Security The MongoDB offer a basic set of security features: support 1. control of read and write access to data; 2. protection of the integrity and confidentiality of the data stored; 3. control of modifications to the database system configuration; 4. privilege levels for different user types, administrators, applications, etc.; 5. auditing of sensitive operations; 6. stable and secure operation in a potentially hostile environment. These security requirements can be achieved in different ways. A database is often placed unprotected on a “secured”, internal network. This is an idealized scenario since no network is entirely secure, architecture changes over time, and a considerable number of successful breaches are from internal sources. A defense-in-depth approach is therefore recommended when implementing an application’s infrastructure. While MongoDB’s security features help to improve the overall security posture, security is only as strong as the weakest link in the chain.

Data source MongoDB will store weather data and information from social networks. Other raw data data needing intermediate and temporary storage before transformation.

Potential MongoDB should be used to store and manipulate data in JSON format such as social usage within networks data. BIGSEA

7.3.3. Apache HBase

Identification Apache HBase

Type NoSQL

License Apache License V2.0

Current V1.1.5 - May/2016 Version

Website Website: https://hbase.apache.org Documentation: https://hbase.apache.org/book.html Download: http://www.apache.org/dyn/closer.cgi/hbase/

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 43

Purpose Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. It can be used when random, realtime read/write access to Big Data is required. The project's goal is the hosting of very large tables, composed of billions of rows per millions of columns, on top of clusters of commodity hardware. HBase is an open-source, distributed, versioned, non-relational database modeled after Google's “: A Distributed Storage System for Structured Data” by Chang et al. [R19]. Similarly to Bigtable, which leverages the distributed data storage provided by GFS, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

High Level In a distributed configuration, a HBase cluster contains multiple nodes, each of which Architecture runs one or more HBase daemon. The following types of nodes are available: ● Master Server (HMaster): is a server responsible for monitoring all RegionServer instances in the cluster, and is the interface for all metadata changes. A multi- Master environment includes a primary and backup Master instance; ● Region Server (HRegionServer): are multiple servers responsible for serving and managing regions. Regions are the basic element of availability and distribution for tables consisting in a subsets of the table’s data. Data are stored on HDFS (in the Hadoop DataNodes); ● ZooKeeper nodes: distributed Apache HBase installation depends on a running ZooKeeper cluster to coordinate the whole cluster.

Dependencies Apache Zookeeper for the coordination of the cluster. Fully-distributed mode requires also a HDFS cluster, since in this mode Hbase can only run on HDFS. MapReduce and Spark can be integrated with HBase.

Interfaces and The Apache HBase Shell is (J)Ruby's Interactive Ruby Shell (IRB) with some HBase languages commands added. supported HBase native API is in Java, however access through non-JVM languages and through custom protocols is possible. It includes a REST and Thrift API as well as C/C++, Scala and Jython client drivers. It also provides a web interface.

Security HBase provides mechanisms to secure various components and aspects of HBase and support how it relates to the rest of the Hadoop infrastructure, as well as clients and resources outside Hadoop: ● secure HTTP (HTTPS) connections to the Web UI; ● optional SASL authentication of clients; ● secure HBase requires secure ZooKeeper and HDFS so that users can not access and/or modify the metadata and data from under HBase; ● several strategies for securing data are available: Role-based Access Control (RBAC), controls which users or groups can read and write to a given HBase resource or execute a coprocessor endpoint; Visibility Labels, which allow to label cells and control access to labelled cells; transparent encryption of data at rest on the underlying file system.

Data source HBase can read data from HDFS file system and, only in stand-alone mode, also from the local filesystem. It can be used for data sources that can be better handled through a

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 44

NoSQL database. Some stored data will be replicated from other primary storage systems aiming to be used in batch and online processing.

Potential HBase can be exploited at the data access layer to access and store dynamic, spatial and usage within social network data (or other semi-structured/unstructured data). It can also be BIGSEA exploited in conjunction with Hive to run HiveQL queries on the data.

7.4. Data Ingestion and Streaming Processing

7.4.1. Apache Kafka

Identification Apache Kafka

Type Publish-subscribe messaging system

License Apache License V2.0

Current v0.10.0.0 - May/2016 Version

Website Website: http://kafka.apache.org Documentation: http://kafka.apache.org/documentation.html Download/Source code: http://kafka.apache.org/code.html

Purpose Kafka is a distributed, partitioned and fault-tolerant commit log service. It can handle hundreds of megabytes of reads/writes per second with low latencies. Its scalable design allows streams to be partitioned over a cluster of multiple nodes. Messages are persisted on disk and replicated in the cluster to ensure durability and recoverability. Kafka maintains published messages in categories called topics. Producers are the processes that publish these messages to Kafka topics, whereas Consumers can subscribe to topics and process each message. So, at a high level, producers send messages over the network to the Kafka cluster which, in turn, serves them up to consumers as displayed in the picture.

[Image source: http://kafka.apache.org/images/producer_consumer.png]

High Level A Kafka cluster is composed of one or more “broker” nodes and, additionally, of

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 45

Architecture Zookeeper nodes that coordinate the cluster.

Dependencies Apache Zookeeper for the coordination of the cluster. Apache Storm can be additionally used as a consumer of Kafka data.

Interfaces and Communication between the clients and the servers is done with a simple, high- languages performance, language agnostic TCP protocol. Kafka provides Java clients, however supported clients in other languages are available (e.g. C/C++, PHP, Python, Ruby, Clojure, etc.): https://cwiki.apache.org/confluence/display/KAFKA/Clients

Security Security measures supported are (for additional information on configuration support http://kafka.apache.org/documentation.html#security): 1. Authentication of connections to brokers from clients (producers and consumers), other brokers and tools, using either SSL or SASL (Kerberos). SASL/PLAIN can also be used from release 0.10.0.0 onwards; 2. Authentication of connections from brokers to ZooKeeper; 3. Encryption of data transferred between brokers and clients, between brokers, or between brokers and tools using SSL; 4. Authorization of read / write operations by clients; 5. Authorization is pluggable and integration with external authorization services is supported.

Data source Kafka can be used with streams of data (e.g. from social networks).

Potential Kafka could be used for the ingestion process to track particular terms/keys relevant for usage within the urban mobility scenario from social networks in order to manage streams of BIGSEA messages that can be consumed by the streaming processing modules.

7.4.2. Apache Storm

Identification Apache Storm

Type Real-time computation system

License Apache License V2.0

Current V1.0.1 - Apr/2016 Version

Website Website: http://storm.apache.org Documentation: http://storm.apache.org/releases/current/index.html Download/Source code: http://storm.apache.org/downloads.html

Purpose Apache Storm is a distributed realtime computation system. It allows to reliably process unbounded streams of data. Storm can be used with any programming language and provides a simple and easy to use API, with few types of abstractions:

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 46

● Spouts, sources of streams in a computation (e.g., from Kafka); ● Bolts, process any number of input streams and produces any number of new output streams; ● Topologies, are networks of spouts and bolts, with each edge in the network representing a bolt subscribing to the output stream of some other spout or bolt. The picture displays two spouts and multiple bolts.

[Image source: http://storm.apache.org/images/storm-flow.png] Storm provides inherent parallelism to process high throughputs of messages with very low latency, thus allowing applications to scale over the resources available. A Storm topology consumes streams of data and processes those streams in arbitrarily complex ways, repartitioning the streams between each stage of the computation however needed.

High Level A Storm cluster consists of: Architecture ● Master node (running Nimbus daemon), a single node that is responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures; ● Worker nodes (running Supervisor daemon), that listen for work assigned to its machine and starts/stops worker processes as necessary based on what Nimbus has assigned to it. Each worker process executes a subset of a topology; ● Zookeeper nodes, that coordinate the whole cluster.

[Image source: http://storm.apache.org/releases/current/images/storm-cluster.png]

Dependencies Apache Zookeeper for the coordination of the cluster.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 47

Additionally, it can be integrated with Apache Kafka in order to consume data coming from it.

Interfaces and Storm has a simple and easy to use API. Its components (bolts, spouts, topologies) can languages be defined with any programming language. Non-JVM languages communicate to Storm supported over a JSON-based protocol. Adapters that implement the communication protocol are available for Ruby, Python, Javascript and Perl. It provides a CLI client to interact and manage a remote cluster, while the Storm UI daemon provides a REST API to interact with a Storm cluster. Additionally, it provides a high-level abstraction (Trident) to perform real time computing on top of Storm. It allows both stream analytics operations and transactional queries.

Security Several options are available, by default all authentication/authorizathion are disabled support [http://storm.apache.org/releases/1.0.1/SECURITY.html]: 1. Services allow user to configure SSL (also 2-way) for the connection; 2. Pluggable authentication support through thrift and SASL (e.g. Kerberos); 3. Different authorization plugins are available for the various components/services; 4. Isolation in multi-tenancy; 5. User/group roles management.

Data source Storm spouts read from different sources to produce streams of data. Spouts typically read from queueing brokers (e.g. Kafka, Kestrel, RabbitMQ) but can also generate its own stream or read from streaming API.

Potential Storm can be used to execute a set of operations on streaming data. In particular, it usage within could be used to apply pre-processing and ETL operations before storing data into the BIGSEA system. It could be coupled with Apache Kafka to ingest and process streams of data (e.g., produced by social networks).

7.4.3. Apache Flink

Identification Apache Flink

Type Execution Framework

License Apache License V2.0

Current Stable: V1.0.3 Version Latest: V1.1

Website Website: https://flink.apache.org/ Documentation: https://ci.apache.org/projects/flink/flink-docs-master/ Download/ Source code:

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 48

https://flink.apache.org/downloads.html https://flink.apache.org/community.html#source-code

Purpose Apache Flink is an open source platform for distributed stream and batch data processing. Its core is an engine to process data stream that provides communication, fault tolerance and data distribution. Moreover, “it has native support for iterations, incremental iterations and programs consisting of large DAGs of operations”. In Flink batch processing runs as special case of stream processing. A Flink program is split into "Streams" and "Transformations". Each of these parts are naturally parallel and distributed. A Stream is a set of transformations, and a Transformation is an operation over input streams (e.g., Map, FlatMap, Filter, and so on). Both are necessary to have a Flink program, and compute results about any streams.

[Image source: https://ci.apache.org/projects/flink/flink-docs-master/concepts/fig/program_dataflow.svg] Many other operations can be done at this stage to process streaming data (define time and windows of each process, describe state and fault tolerance, and define checkpoints). Data streaming does not have a bound, that can be counted. It is necessary to define a window to help aggregating many data streams and process them. Windows can be delimited by time (for example, each 10 seconds) or data (for example, each 1000 elements). During the processing, the states of each operations are stored in a key/value file, helping to do right operations across the cluster, and keeping it stateful. A checkpoints is required to save a consistent point, that can be used to restore a state.

High Level A Flink cluster consists of 2 runtime processes: Architecture ● Master (so-called JobManager) - Organizes resources of distributed execution (whole Flink System). The main assignments are schedule tasks, perform

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 49

checkpoints, and recovery on failure; ● Worker (so-called TaskManager) - Executes tasks or subtasks of parallel program. There has to be at least 1 worker process; ● Client - It is a component responsible for planning (turns the program into the parallel data flow form) and sending dataflows to the Master (JobManager).

[Image source: https://ci.apache.org/projects/flink/flink-docs-release-0.8/img/ClientJmTm.svg]

Dependencies HDFS and YARN are necessary to Flink, both dependencies are from the Apache Hadoop project. By default, Flink is using Hadoop 2.x dependencies. For High Availability is necessary to use Zookeeper and YARN.

Interfaces and Flink has 3 APIs to create an application. DataStream API (streaming processing) to languages handle streams that can be written in Java or Scala. DataSet API for static data (batch supported processing) embedded in Java, Scala, and Python. Table API to interpret SQL-like expressions embedded in Java and Scala. In addition, Flink has many libraries for specific domains like Complex Event Processing (CEP), Flink Machine Learning (ML) and Graph processing (called Gelly).

Security Flink supports Kerberos authentication of Hadoop services such as HDFS, YARN, or support HBase.

Data source Apache Flink accesses different data sources. For example, Hadoop Distributed File System (HDFS), Amazon S3, MapR file system, and Tachyon. Flink allows to access MongoDB, but it is not so mature (https://github.com/okkam-it/flink-mongodb-test). Also, data (streams) can be consumed from Apache Kafka, and connected to several data storage systems.

Potential Flink can be used to process real-time data streams and integrate these data with usage within historical data to get insights, generate knowledge, and produce predictions in the BIGSEA context of smart cities.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 50

7.4.4. Apache Spark Streaming

Identification Apache Spark Streaming

Type Execution Framework

License Apache License V2.0

Current Stable: Apache Spark V1.6.1 Version Latest: Preview V2.0.0

Website Website: http://spark.apache.org/streaming/ Documentation: http://spark.apache.org/docs/latest/streaming-programming- guide.html Download/ Source code: http://spark.apache.org/downloads.html

Purpose Spark Streaming allows stream processing with high scalability, throughput, and fault tolerance. Data streams are absorbed from many sources, then processed by Spark Streaming. Data processed are then dumped to a file system, databases and presented via dashboards.

[Image source: http://spark.apache.org/docs/latest/img/streaming-arch.png]

High Level When Spark Streaming receives a "data stream", it is split into batches to be processed Architecture by the spark engine.

[Image source: http://spark.apache.org/docs/latest/img/streaming-flow.png] Spark streaming represents data as a DStream (Discretized Stream) [R20]. High-level operations can be performed on a DStream (like a sequence of RDDs in Apache Spark).

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 51

Each RDD contains chunks of a data stream from a short lead time.

[Image source: http://spark.apache.org/docs/latest/img/streaming-dstream.png] If necessary, RDD can be transformed/processed, leading to a new RDD. The framework hides most of these DStream/RDD transformation details.

Dependencies To ingest data from external sources, it will be necessary to add corresponding artifacts of each data source (http://spark.apache.org/docs/latest/streaming-programming-guide.html#linking). Please, read full list of supported sources and artifacts on Maven Repository.

Interfaces and Spark Streaming programs can be created in Scala, Java and Python. languages supported

Security Options about Security are configured in Apache Spark. Also, it is necessary to configure support TCP ports (standalone/cluster) for network security. [See Section 7.4.5 Apache Spark]

Data source Spark streaming allows data to be ingested from 2 types of sources: (1) Basic: File systems (HDFS), socket connections, and Akka actors; (2) Advanced: Kafka, Flume, Kinesis, Twitter, ZeroMQ, and MQTT.

Potential Spark streaming can be integrated with Kafka to get streaming data from Twitter and usage within process them. This helps in the recognition of events about traffic jams and other BIGSEA situations/events in big cities. There are many usages for this tool like: ● Extract, Transformation, Load (ETL) over streaming data; ● Detect traffic problems; ● Data enrichment - compare real-time data with historical data (Weather and Traffic). The choice depends only on the data to which we have access or we get through the internet.

7.5. Data Analytics and Mining

7.5.1. Ophidia

Identification Ophidia

Type Big data framework for scientific data analysis

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 52

License GNU GPL v3.0

Current v0.9 - Feb/2016 Version

Website Website: http://ophidia.cmcc.it Documentation: http://ophidia.cmcc.it/documentation/ Download/Source code: http://ophidia.cmcc.it/download/

Purpose Ophidia provides a complete environment for the execution of data-intensive analysis exploiting advanced parallel computing techniques and smart data distribution methods. It exploits an array-based storage model and a hierarchical storage organization to partition and distribute multidimensional scientific datasets over multiple nodes.

High Level An Ophidia cluster is composed by the following components: Architecture ● Ophidia Server, the cluster front-end. It provides multiple interfaces for client- server interactions and manages job scheduling, submission and monitoring; ● Compute nodes, one or more nodes running the Ophidia parallel operators; ● I/O nodes, multiple nodes running one or more I/O servers responsible for the execution of array-based analytical primitives; ● Storage layer, comprises the resources to physically store data.

Dependencies MySQL server for metadata storage and management, as well as for datacube storage. SLURM scheduler for the management and execution of analytics jobs. MPI environment for the execute parallel jobs.

Interfaces and The Ophidia server provides a multi-interface server-side front-end. Interfaces available

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 53

languages are: (i) web service interface compliant with WS-I Basic Profile v1.2, (ii) GSI interface supported with support for Virtual Organizations (VOMS). The Ophidia Terminal can be used to run interactive and batch analysis sessions using the interfaces provided by the server. Additionally, a Python binding library is available.

Security The system provides mechanisms to authenticate users and log the commands support executed. It also defines different roles (administrator/user) and allows users to share a working session granting privileges to other users. Encryption is provided to secure client-server communications.

Data source Scientific multi-dimensional data like environmental (climate/weather) data.

Potential Ophidia can be used to store, manage, access and analyse environmental data providing usage within array-based primitives for time-series analysis. It natively supports the NetCDF format BIGSEA and features climate/weather domain operations for analytics.

7.5.2. Apache Kylin

Identification Apache Kylin

Type Distributed Analytics Engine

License Apache License V2.0

Current v1.5.2.1 - Jun/2016 Version

Website Website: http://kylin.apache.org Documentation: http://kylin.apache.org/docs15/ Download/Source code: http://kylin.apache.org/download/

Purpose Apache Kylin™ is an open source Distributed Analytics Engine designed to provide SQL interface and multidimensional analysis (OLAP) on Hadoop supporting large datasets. Apache Kylin™ allows to query big Hive tables at sub-second latency in 3 steps: 1. Identify a set of Hive tables in star schema; 2. Build a cube from the Hive tables in an offline batch process; 3. Query the Hive tables using SQL and get results in sub-seconds, via Rest API, ODBC, or JDBC.

High Level Kylin is commonly installed on a Hadoop client machine to allow interaction with the Architecture Hadoop cluster through the command lines. The picture shows this scenario. The application (e.g. Kylin Web) contains a web interface for cube building, querying and management. Kylin Web launches a query engine for querying the data and a cube build engine for building data cubes starting from a star schema. The two engines interact with the Hadoop components, like hive for

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 54

cube building and hbase for cube storage.

[Image source: http://kylin.apache.org/images/install/on_cli_install_scene.png]

Dependencies Hadoop cluster, Apache Hive to read data from it and Apache HBase to store data cubes into it.

Interfaces and Kylin provides ODBC and JDBC drivers as well as a RESTful API. languages Kylin Web Interface allows to run queries (exploiting SQL) and visualize the results (also supported in charts).

Security Kylin supports LDAP authentication for enterprise or production deployment. support Additionally, from v1.5, Kylin provides SSO with SAML. The implementation is based on Spring Security SAML Extension.

Data source Kylin can read data stored in HDFS from Hive and store data cubes in HBase.

Potential It can be used for scenarios where OLAP analysis is required. It also allow OLAP on usage within streaming cubes (still prototype). BIGSEA

7.5.3. Apache Hive

Identification Apache Hive

Type Data warehouse software

License Apache License V2.0

Current v2.0.1 - May/2016 Version

Website Website: https://hive.apache.org

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 55

Documentation: https://cwiki.apache.org/confluence/display/Hive/Home Download/Source code: https://hive.apache.org/downloads.html

Purpose Apache HiveTM is a data warehouse software facilitating reading, writing, and managing of large datasets residing in distributed storage using SQL. It is built on top of Apache HadoopTM and provides: ● Tools to enable easy access to data via SQL, allowing data warehousing tasks such as ETL, reporting, and data analysis; ● A mechanism to impose structure on a variety of data formats; ● Access to files stored directly in Apache HDFSTM or in other data storage systems like Apache HBaseTM; ● Query execution via various frameworks (i.e. Apache TezTM, Apache SparkTM or MapReduce). Hive provides standard SQL functionality, including many of the later 2003 and 2011 features for analytics.

High Level The figure displays Hive main components and its interactions with Hadoop. The Architecture components are: ● UI: is the user interface to submit queries and other operations to the system; ● Driver: is the component which receives the queries. It implements the notion of session handles and provides execute and fetch APIs; ● Compiler: it parses the query, does semantic analysis on the query blocks and expressions and, eventually, generates an execution plan based on the metadata available in the metastore; ● Metastore: it stores all the information related to the structure of the various tables and partitions in the warehouse including column and column type information, the serializers and deserializers necessary to write and read data and the corresponding HDFS files where the data are stored. Deserializers/Serializers provide the logic to read/write from a custom formats; ● Execution Engine: is the component that executes the execution plan (a DAG of stages) created by the compiler.

[Image source: https://cwiki.apache.org/confluence/download/attachments/27362072/system_architecture.png?version=1

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 56

&modificationDate=1414560669000&api=v2]

Dependencies Hadoop cluster

Interfaces and Hive defines a SQL-like language called HiveQL (HQL) to perform queries on data. It can languages be extended with user code through user defined functions (UDFs), user defined supported aggregates (UDAFs), and user defined table functions (UDTFs). In terms of clients, Hive provides a shell CLI and a GUI (Hive Web Interface), as well as several client libraries (JDBC, ODBC, Python, PHP, Thrift). Additionally, it provides HiveServer2 (HS2), a server interface that enables remote clients to execute queries against Hive and retrieve the results.

Security Hive provides three types of authorization modes: support ● Storage Based Authorization in the Metastore Server. In this case the metastore controls access to the different metadata objects (database, tables, partitions) by checking the file system permissions for the corresponding folders; ● SQL Standards Based Authorization in HiveServer2. It allows fine grained control and is based on the SQL standard for authorization, using common grant/revoke statements; ● Default Hive Authorization (Legacy Mode). Is the authorization mode that has been available in earlier versions of Hive. It does not have a complete access control model. Strong authentication for tools like the Hive command line is provided through Kerberos, whereas HiveServer2 provides additional authentication options (cookie- based, SASL, PAM, LDAP and Kerberos).

Data source Data stored in file systems (e.g. HDFS, Amazon S3) or in Apache HBase database.

Potential Apache Hive can be used on top of Hadoop (or HBase) to access, filter and process data usage within stored in the HDFS storage. In particular, it could be used both at the Data Access level, BIGSEA to select the pre-processed data from social networks (or other unstructured data) and at the Data Analytics levels, to process the data.

7.5.4. Druid

Identification Druid

Type Big data solution for data analytics

License Apache License, Version 2.0.

Current 0.9.0 - Apr/2016 Version

Website Website: http://druid.io/ Documentation: http://druid.io/docs/0.9.0/design/index.html

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 57

Download/Source code: http://druid.io/downloads.html

Purpose Druid is an open source data store designed for OLAP queries on event data. Data are organized as dimension and metric columns. A separated column type, timestamp, is treated separately because all queries center around the time axis. Druid provides a language with operations to load, index, query and group (roll-up data). Data are partitioned in segments. Segments are key to provide high availability in a cluster.

High Level Architecture

[Image source: https://upload.wikimedia.org/wikipedia/commons/0/0f/Druid_Open- Source_Data_Store%2C_architecture%2C_DruidArchitecture3.svg] In a fault-tolerant architecture, a Druid cluster allows to partition and replicate data. A cluster coordinator (Zookeeper) is needed to keep cluster information synchronized, allowing identification of new or removed nodes and leader election. MySQL or Postgres is used as storage of metadata and for deep storage (historical); a distributed file system such as HDFS or S3 or even Cassandra NoSQL Database is required.

Druid provides data ingestion tools to load data directly into its real-time nodes, or in batch into historical nodes. Real-time nodes accept JSON-formatted data from a streaming data source, like RabbitMQ or other message queueing system. Batch-loaded data formats can be JSON, CSV, or TSV.

Data are partitioned in segments and designed to be easily moved out to deep storage. The location of segments is stored in the relational database (MySQL) and all transfer is coordinated by Zookeeper.

Broker nodes are responsible for receiving client queries and forward them to appropriate data node (historical or real-time). Brokers interact with metadata database and Zookeeper in order to know in which nodes segments reside. After each data node has processed the query, broker nodes merge partial results before returning the aggregated result.

Dependencies MySQL (or Postgres) as a metadata storage; HDFS, Amazon S3 or any sharable and mountable file system as deep storage; Apache ZooKeeper, "a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services."

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 58

Interfaces and REST API, with clients implemented for Ruby, Python, R, Javascript (Node.js) and others. languages Data ingestion can be done by stream pull or push. supported

Security No support for authentication nor authorization. support

Data source Data ingestion can be done by stream pull or push, directly to or from an application.

Potential Potential analytics scenarios where an OLAP tool would be enough. Fast operation on usage within TOP-N queries. BIGSEA

7.5.5. Spark

Identification Apache Spark

Type Framework for processing big data

License Apache License, Version 2.0.

Current 1.6.1 - Mar/2016 Version

Website Website: http://spark.apache.org Documentation: http://spark.apache.org/docs/latest/ Download/Source code: http://spark.apache.org/downloads.html

Purpose In-memory engine for large-scale data processing. Apache Spark is a fast and general- purpose cluster computing system and an optimized engine that supports general execution graphs.

High Level Architecture

[Image source: http://spark.apache.org/docs/latest/img/cluster-overview.png] In a standalone cluster deployment, the cluster manager is a Spark master instance.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 59

When using Mesos, the Mesos master replaces the Spark master as the cluster manager. Spark can be used for batch jobs through spark-submit, which can use local, YARN or Mesos resources, among others. Spark-submit can be used to execute binaries remotely. Spark-shell is a Scala interactive console that can use as back-end a Mesos cluster. This way, one can execute data analytic operations and execute them interactively on a remote system.

Dependencies A Hadoop, YARN or Mesos cluster.

Interfaces and languages It provides high-level APIs in Java, Scala, Python and R. supported

Security Spark currently supports authentication via a shared secret. Spark supports SSL for Akka support and HTTP (for broadcast and file server) protocols. SASL encryption is supported for the block transfer service. Encryption is not yet supported for the WebUI. Encryption is not yet supported for data stored by Spark in temporary local storage, such as shuffle files, cached data, and other application files. If encrypting this data is desired, a workaround is to configure the cluster manager to store application data on encrypted disks.

Data source Data stored in file systems (ext4, HDFS). There are many other connectors that allows Spark read/write data to other data sources/storage.

Potential Spark is one of the supported programming models in BigSea. usage within BIGSEA

7.5.6. Hadoop MapReduce

Identification Hadoop MapReduce

Type Framework for processing big data

License Apache License 2.0

Current Release 2.7.2 - Jan/2016 Version

Website Website: http://hadoop.apache.org/ Documentation: http://hadoop.apache.org/docs/r2.7.2/ Download/Source code: http://hadoop.apache.org/#Download+Hadoop

Purpose MapReduce is a programming model that allows the processing of massive data in a

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 60

parallel and distributed algorithm, usually in computer clusters. Hadoop is a collection of sub-projects, hosted by the Apache Software Foundation, related to distributed computing. Although the best known Hadoop subprojects are MapReduce and its distributed file system (HDFS), other subprojects (e.g., Avro, Pig, HBase, Hive and ZooKeeper) offer complementary services or add a higher-level abstraction.

High Level Architecture

Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS): A distributed file system that provides high- throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Dependencies Java installed in all nodes (master and slaves). If a high security level is required, API Kerberos is a possible solution.

Interfaces and languages Hadoop MapReduce supports high-level APIs or algorithms in Java. supported

Security Security features of Hadoop consist of authentication, service level authorization, support authentication for Web consoles and data confidentiality. Authorization: when service level authentication is turned on, end users using Hadoop in secure mode needs to be authenticated by Kerberos. Service level authorization: it is the initial authorization mechanism to ensure clients connecting to a particular Hadoop service have the necessary, pre-configured, permissions and are authorized to access the given service. Authentication for Web consoles: by default Hadoop HTTP web-consoles (JobTracker, NameNode, TaskTrackers and DataNodes) allow access without any form of authentication. On the other hand, Hadoop HTTP web-consoles can be configured to require Kerberos authentication using HTTP SPNEGO protocol (supported by browsers

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 61

like Firefox and Internet Explorer). Data confidentiality: the data transferred between hadoop services and clients are encrypted. Furthermore, the data transfer between Web-console and clients are protected by using SSL(HTTPS).

Data source The data can be stored in file systems (e.g., HDFS). Using the YARN, different types of data (e.g., text, jar, json) are read/write on the file systems. Furthermore, Hadoop can be integrated with other data sources (e.g., HBase, MySQL and MongoDB).

Potential Hadoop MapReduce is one of the supported programming models (to process big data usage within on batch mode) in BigSea. BIGSEA

7.6. Data Mining and Analytics Toolbox

7.6.1. Apache Spark MLlib

Identification Apache Spark MLlib

Type Apache Spark’s scalable machine learning (ML) library.

License GNU AGPL drivers: Apache License v2.0

Current 1.6.1 - March/2016 Version

Website Website: https://spark.apache.org/mllib/ Documentation: https://spark.apache.org/docs/latest/mllib-guide.html Download/Source code: https://github.com/apache/spark/tree/master/mllib

Purpose Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 62

High Level Architecture

Spark MLlib is a module (a library / an extension) of Apache Spark to provide distributed machine learning algorithms on top of Spark’s RDD abstraction. Its goal is to simplify the development and usage of large scale machine learning. The following types of machine learning algorithms are available in MLlib: ● Classification ● Regression ● Frequent itemsets (via FP-growth Algorithm) ● Recommendation ● Feature extraction and selection ● Clustering ● Statistics ● Linear Algebra The following can also be done using MLlib: ● Model import and export ● Pipelines

Dependencies - Apache Spark 2.0. MLlib is included as a module. - MLlib uses the linear algebra package Breeze, which depends on netlib-java for optimized numerical processing. - To use MLlib in Python, NumPy version 1.4 or newer is required.

Interfaces and Usable in Java, Scala, Python, and SparkR. languages supported

Security Spark currently supports authentication via a shared secret. Spark supports SSL for Akka support and HTTP (for broadcast and file server) protocols. SASL encryption is supported for the block transfer service. Encryption is not yet supported for data stored by Spark in temporary local storage, such as shuffle files, cached data, and other application files. If encrypting this data is desired, a workaround is to configure the cluster manager to store application data on encrypted disks.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 63

Data source All Data sources that can be connected to Spark (there are many connectors that allows Spark read/write data to other data sources/storage).

Potential The key benefit of MLlib to BIGSEA is that it allows data scientists to focus on their data usage within problems and models instead of solving the complexities surrounding distributed data BIGSEA (such as infrastructure, configurations, and so on). Just as important, Spark MLlib is a general-purpose library, providing algorithms for most ML use cases while at the same time allowing the data scientists to build upon and extend it for specialized ML use cases.

7.6.2. Ophidia Operators

Identification Ophidia analytics operator

Type Datacube-oriented analytics operators

License GNU GPL v3.0

Current v0.9 - Feb/2016 Version

Website Website: http://ophidia.cmcc.it Documentation: http://ophidia.cmcc.it/documentation/ Download/Source code: http://ophidia.cmcc.it/download/

Purpose It provides around 50 parallel (MPI-based) operators that allow datacube-oriented analytics and metadata management, supporting natively the NetCDF format. These include: ● Data import/export ● Subsetting ● Reduction/aggregation ● Data exploration ● Cube intercomparison ● Metadata handling ● Script execution ● Run Ophidia primitives

High Level Ophidia operators run on the compute nodes of an Ophidia cluster (see Ophidia table in Architecture Section 7.5.1)

Dependencies Ophidia framework and MPI environment

Interfaces and Can be executed in the Ophidia environment languages supported

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 64

Security See Ophidia table in Section 7.5.1 support

Data source Scientific multi-dimensional data (NetCDF format).

Potential Ophidia operators allow the execution of a wide range of OLAP-oriented tasks on usage within scientific multi-dimensional data. They can be used for running data analytics BIGSEA experiments on climate/weather data. Additionally, some operators provide support for metadata management. Data operators are based on MPI for parallel processing.

7.6.3. Ophidia Primitives

Identification Ophidia primitives

Type Array-based analytical primitives

License GNU GPL v3.0

Current v0.9 - Feb/2016 Version

Website Website: http://ophidia.cmcc.it Documentation: http://ophidia.cmcc.it/documentation/ Download/Source code: http://ophidia.cmcc.it/download/

Purpose Ophidia primitives provide around 100 array-based primitives, based on well-known scientific libraries (e.g GSL, matheval, etc.), that allow analytics, statistical and mathematical operations. Primitives are implemented as User Defined Functions (UDF) and can be executed directly from the Ophidia I/O server in SQL queries. Additionally more primitives can be nested into a single query. Among the functions provided by the primitives there are: ● Array subsetting and extraction ● Arithmetic/mathematical (e.g. multiplication, addition, absolute value) ● Array aggregation ● Statistical (e.g. max, min, average, quantiles, std. deviation, boxplot, histogram) ● Array manipulation (e.g. shift, permutation, concatenation) ● Data conversion and cast ● Mathematical computations (e.g. linear regression, interpolation, predicate evaluation, etc.)

High Level Ophidia primitives run in the I/O servers of an Ophidia cluster (see Ophidia table 7.5.1) Architecture

Dependencies Ophidia framework

Interfaces and Can be executed in the Ophidia environment and also standalone in SQL queries in

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 65

languages MySQL. supported

Security See Ophidia table in Section 7.5.1 support

Data source Scientific multi-dimensional data (NetCDF format).

Potential Ophidia primitives provide a set of low-level functions to perform analytics on data usage within stored in arrays (e.g. time series). They are especially suited for scientific array-based BIGSEA data. In the context of the project these features can be used in conjunction with the Ophidia operators to perform statistical computation and analytics through the Ophidia framework on climate/weather data.

7.7. Final Assessment

The following tables provide a summary of the various technologies analyzed for each block of the WP4 architecture. In particular, for each of them, a set of features necessary for the big data eco-system has been highlighted. In the following tables, Yes (Partially supported) means that the component provides the proper (Partial) support for that feature, so it is technically sound with regard to the project requirements. An empty cell is associated to components that would require too many adaptation/extension activities to address that feature or that are not able to support that feature at all. These have been identified from the end-users, data sources and analytical requirements.

Data Storage Components

Hadoop HDFS PostGIS* MongoDB* Ophidia* (PostgreSQL) storage storage

Store files from data sources Yes

Store GIS Yes Yes (stationary/dynamic) data sources

Store environmental data Yes sources

Store social network data Yes Yes sources

Store derived and platform- Yes Yes level data

Store metadata related to Yes Yes Yes level-1 data

* Ophidia, PostGIS and MongoDB are mainly classified in Section 7.5 as data analytics and access tools. However, they also provide in their stack storage capabilities which explains their role in this table.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 66

Data Access Components

Apache HBase PostGIS MongoDB Ophidia*

Import/export NetCDF data Yes

Import/export JSON data Yes Yes

Import/export Shapefile Yes data

Select/filter data Yes Yes Yes Yes

Temporal/spatial queries Partially Yes Yes Yes

Aggregation queries Yes Yes Yes Yes

* Ophidia is classified in Section 7.5 as a data analytics tool. However, it also provides in its stack access capabilities which explains its role in this table.

Data Ingestion and Streaming Processing Components

Apache Kafka Apache Storm Apache Flink Spark Streaming

Continuous data ingestion Yes

Processing of data stream Yes Yes Yes

Streaming analytics Requires custom Requires custom Requires custom code code code

Data Analytics and Mining Components

Ophidia Apache Apache Hadoop Hive Kylin Spark MapReduce

OLAP Analysis Yes Yes Yes Yes Requires Requires custom custom code code

Batch processing Yes Yes Yes

Data Analytics and Mining Toolbox

Spark MLlib Ophidia Primitives Ophidia Operators

· Data Mining Algorithms: Yes Only clustering Clustering, Classification, Regression

Statistical/mathematical Yes Yes Yes, exploiting Ophidia analytics primitives

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 67

Time series analysis Partially supported Yes Yes

Spatial analysis Partially supported Partially supported

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 68

8. PRELIMINARY ARCHITECTURAL MAPPING The following diagram (Figure 13) shows a preliminary mapping of some technologies on the fast and big data eco-system architecture. These tools and systems have been selected based on the analysis and the assessment provided in the previous section. The mapping highlights clearly the different big data technologies that could be exploited to address the use cases requirements.

Figure 13. Preliminary architectural mapping

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 69

9. CONCLUSIONS This document has provided a complete overview about the design of the integrated big and fast eco- system. In particular it has presented the general conceptual view of the proposed data management eco- system describing in detail the key architectural components needed to address the multifaceted data management aspects (data storage, access, analytics and mining) of the project. Additionally, a detailed view of the architecture in terms of internal components (storage, ETL, big data technologies, Entity Matching and Data Quality services), sequence diagrams (UML notation), QoS metrics to be exposed at the WP4 level, data management APIs and data-related security aspects have been also presented. The document has also included the full list of the data-related requirements from D7.1 jointly with a comprehensive description of the main data sources classified as raw, derived and platform-level and user classes. Worth of mentioning is the presentation of the main big and fast data tools currently available in the data landscape from a (i) storage, (ii) access, (iii) analytics/mining and related toolbox, (iv) ingestion and streaming processing components points of view, that could fit into the WP4 software architecture. The links to the other WPs from a security, quality of service, user requirements and programming framework standpoints have been also highlighted in the text to make clear how the WP4 architecture is linked to the overall project picture. This deliverable provides a comprehensive view about the main architectural aspects of the fast and big data management eco-system and provides a solid basis to move forward in the implementation of the software stack.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 70

10. REFERENCES [R01] Michalakes, J., et al. "The weather research and forecast model: software architecture and performance." Proceedings of the 11th ECMWF Workshop on the Use of High Performance Computing In Meteorology. 2005. [R02] Climate, Global, and Weather Modeling Branch. Environmental Modeling Center 2003. The GFS atmospheric model. Vol. 442. Office Note. [R03] Rew, R., E. Hartnett, and J. Caron. "NetCDF-4: software implementing an enhanced data model for the geosciences." 22nd International Conference on Interactive Information Processing Systems for Meteorology, Oceanography, and Hydrology. 2006. [R04] https://dev.twitter.com/ [R05] Jing Han, Haihong E, Guan Le and Jian Du, "Survey on NoSQL database," Pervasive Computing and Applications (ICPCA), 2011 6th International Conference on, Port Elizabeth, 2011, pp. 363-366. doi: 10.1109/ICPCA.2011.6106531. [R06] The Unified Modeling Language User Guide, Grady Booch, James Rumbaugh, Ivar Jacobson. Publisher: Addison Wesley, First Edition October 20, 1998, ISBN: 0-201-57168-4, 512 pages. [R07] Borthakur D (2008) HDFS architecture guide. HADOOP APACHE PROJECT. http://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf [R08] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,” in Proc. IEEE 26th Symp. Mass Storage Syst. Technol., 2010, pp. 1–10. [R09] S. Fiore, C. Palazzo, A. D’Anca, I. T. Foster, D. N. Williams, G. Aloisio, “A big data analytics framework for scientific data management”, IEEE BigData Conference 2013: 1-8 [R10] Donatello Elia, Sandro Fiore, Alessandro D'Anca, Cosimo Palazzo, Ian T. Foster, Dean N. Williams: An in-memory based framework for scientific data analytics. Conf. Computing Frontiers 2016: 424-429. [R11] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI'12, pages 2{2, Berkeley, CA, USA, 2012. USENIX Association. [R12] Ekanayake J, Pallickara S, Fox G (2008) Mapreduce for data intensive scientific analyses. In: Proceesings of IEEE, Fourth International Conference on eScience., pp 277–284. [R13] Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113 [R14] Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R (2009) Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2(2):1626– 1629. [R15] https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/Metrics.html [R16] Swagger - http://swagger.io [R17] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google file system,” in Proc. 19th ACM Symp. Operating Syst. Principles, 2003, pp. 29–43. [R18] http://www.opengeospatial.org/standards/sfs [R19] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2006. “Bigtable: a distributed storage system for structured data.” In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7 (OSDI '06), Vol. 7. USENIX Association, Berkeley, CA, USA, 15-15. [R20] M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica, “Discretized streams: Fault-tolerant streaming computation at scale,” in Proc. 24th ACM Symp. Operating Syst. Principles, 2013, pp. 423–438.

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 71

GLOSSARY

Acronym Explanation Usage Scope

AAA Authentication, Authorization and Accounting Security

ACL Access Control List Security

Amazon S3 Amazon Simple Storage Service Storage technology

API Application Programming Interface Interfaces

CLI Command Line Interface Interfaces

CSV Comma Separated Value Data type

DAG Directed Acyclic Graph Execution plan

DBMS Database Management System Storage technology

DQaS Data quality as a service Service

EM Entity Matching Service

ESRI Environmental Systems Research Institute Data type

ETL Extraction, Transformation and Load Data integration

GDAL Geospatial Data Abstraction Library Software library

GEOS Geometry Engine Open Source Software library

GFS The Google File System Storage technology

GIS Geographic Information System Data Type

GNU GPL GNU General Public License License Type

GSL GNU Scientific Library Software library

GSSAPI Generic Security Service Application Program Interface Security

GUI Graphical User Interface Interfaces

IaaS Infrastructure as a Service WP3

LDAP Lightweight Directory Access Protocol Security

JDBC Java DataBase Connectivity Database API

JSON JavaScript Object Notation Data Type

MPI Message Passing Interface Parallel Computing

NetCDF Network Common Data Form Data Type

NoSQL Not Only SQL Database Paradigm

ODBC Open DataBase Connectivity Database API

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 72

OLAP On-line Analytical Processing Type of processing

PAM Pluggable Authentication Module Security

QoS Quality of Service WP3

RADIUS Remote Authentication Dial-In User Service Security

RDD Resilient Distributed Dataset Data structure

REST REpresentational State Transfer Interfaces

RSS Rich Site Summary Data type

SAML Security Assertion Markup Language Security

SASL Simple Authentication and Security Layer Security

SSL Secure Sockets Layer Security

SSH Secure Shell Security

SSO Single sign-on Security

SSPI Security Support Provider Interface Security

TSV Tab-separated values Data Type

UI User Interface Interfaces

UML Unified Modeling Language Modelling language

WP Work package Project management

WRF Weather Research and Forecasting Weather Forecast Model

www.eubra-bigsea.eu | [email protected] |@bigsea_eubr 73