Design and Evaluation of a Cloud Native Data Analysis Pipeline for Cyber Physical Production Systems

Producción Académica Ferrer Daub, Facundo Javier Design and evaluation of a cloud native data analysis pipeline for cyber physical production systems Tesis para la obtención del título de posgrado de Magister en Dirección de Empresas Director: Srur, Leandro Documento disponible para su consulta y descarga en Biblioteca Digital - Producción Académica, repositorio institucional de la Universidad Católica de Córdoba, gestionado por el Sistema de Bibliotecas de la UCC. Esta obra está bajo licencia 2.5 de Creative Commons Argentina. CC |®@©I Atribución-No comercial-Sin obras derivadas 2.5 ÉICDA U'IVnSIDAD CATOLICA .'i c Okcxjüa UNIVERSIDAD CATOLICA DE CORDOBA INSTITUO DE CIENCIAS DE LA ADMINISTRACIÓN TRABAJO FINAL DE MAESTRÍA EN DIRECCIÓN DE EMPRESAS DESIGN AND EVALUATION OF A CLOUD NATIVE DATA ANALYSIS PIPELINE FOR CYBER PHYSICAL PRODUCTION SYSTEMS AUTOR: FACUNDO JAVIER FERRER DAUB EMAIL: [email protected] DIRECTOR: Leandro Srur CÓRDOBA 2017 ABSTRACT Since 1991 with the birth of the World Wide Web the rate of data growth has been growing with a record level in the last couple of years. Big companies tackled down this data growth with expensive and enormous data centres to process and get value of this data. From social media, Internet of Things (IoT), new business process, monitoring and multimedia, the capacities of those data centres started to be a problem and required continuos and expensive expansion. Thus, Big Data was something that only a few were able to access. This changed fast when Amazon launched Amazon Web Services (AWS) around 15 years ago and gave the origins to the public cloud. At that time, the capabilities were still very new and reduced but 10 years later the cloud was a whole new business that changed for ever the Big Data business. This not only commoditised computer power but it was accompanied by a price model that let medium and small players the possibility to access it. In consequence, new problems arised regarding the nature of these distributed systems and the software architectures required for proper data processing. The present job analyse the type of typical Big Data workloads and propose an architecture for a cloud native data analysis pipeline. Lastly, it provides a chapter for tools and services that can be used in the architecture taking advantage of their open source nature and the cloud price models. UNIVERSIDAD CATOLICA DE CORDOBA INSTITUTO DE CIENCIAS DE LA ADMINISTRACION TRABAJO FINAL DE MAESTRÍA EN DIRECCIÓN DE EMPRESAS DESIGN AND EVALUATION OF A CLOUD NATIVE DATA ANALYSIS PIPELINE FOR CYBER PHYSICAL PRODUCTION SYSTEMS AUTOR: FACUNDO JAVIER FERRER DAUB DIRECTOR: LEANDRO SRUR CORDOBA 2017 CONTENTS I Introduction 4 1.1 Organization ..................................................................................... 4 1.2 Problem statem ent............................................................................. 4 1.3 State of the art and open problems.................................................. 6 II Data Types and Sources 8 2.1 Data types ........................................................................................ 8 2.1.1 Nondependency-oriented data ............................................ 9 2.1.2 Dependency-oriented d a t a .................................................. 11 2.2 Data properties ................................................................................... 12 2.3 Data sources ..................................................................................... 14 2.3.1 Information Technology (IT) .................................................. 14 2.3.2 Physical W o rld ....................................................................... 16 2.4 Metadata ........................................................................................... 17 2.4.1 What means metadata........................................................... 17 2.4.2 Importance of m etadata........................................................ 18 III Workloads 19 3.1 Streams and Batch ............................................................................. 19 IV Architecture Building Blocks 24 4.1 Introduction ........................................................................................ 24 4.2 Building Blocks ................................................................................... 25 4.2.1 Acquisition and Recording..................................................... 25 4.2.2 Extraction, Cleaning and Annotation................................... 27 4.2.3 Integration, Aggregation and Representation..................... 27 4.2.4 Modeling and Analysis........................................................... 28 4.2.5 Interpretation.......................................................................... 29 V Architecture Allocation to Tools and Services 30 5.1 Tools.................................................................................................... 30 5.1.1 Apache F lu m e ....................................................................... 30 5.1.2 Apache Sqoop ....................................................................... 31 5.1.3 Apache Kafka ....................................................................... 32 5.1.4 Apache Storm ....................................................................... 33 5.1.5 Apache H a d o o p .................................................................... 33 5.1.6 Apache Spark ....................................................................... 35 1 5.1.7 Apache F lin k ..............................................................................36 5.1.8 Apache Hive ..............................................................................37 5.1.9 Apache Imapala ........................................................................37 5.1.10 Apache P ig .................................................................................38 5.2 Cloud Providers and Services...............................................................39 5.2.1 A W S S 3 .......................................................................................41 5.2.2 AWS Kinesis F a m ily ..................................................................41 5.2.3 AWS Lambda..............................................................................41 5.2.4 AWS Elastic M apR educe.........................................................42 5.2.5 AWS Q u icksig h t........................................................................42 5.2.6 DataTorrent.................................................................................43 VI Summary and Outlook 44 6.1 Conclusions ........................................................................................ ....44 6.2 Limitations ........................................................................................ ....45 6.3 Outlook .............................................................................................. ....45 2 3 I. INTRODUCTION 1.1 Organization This dissertation shows how a cloud native architecture, can help system designers to implement analytic data pipelines. This thesis is organized as fol- lows. Chapter 1 explains the statement of problem, basic principles of data analytics are outlined in order to better define specifics aspects of the data pipeline modeling addressed in this work. In addition, the current state of the art and the open problems are explained here too. Chapter 2, introduces the different data types that are described in the the sis. Among them, the related data sources reviewed are explaining here. Finally, a mention to the importance of the metadata, the data about the data, is included. Handling of the different data types can be done in two main ways, stream and batch processing. The characterization of these workloads, with its related data pedigree is discussed in Chapter 3. Furthermore, the privacy of data col- lected and processed is an important area explained as well in this chapter. In Chapter 4 describes, the architecture building blocks, not only from a technical point of view but in regards of the business value of the presented ar chitecture; and the importance of a cost-efficient pipeline by a proper selection of the building blocks. Chapter 5 and 6, try to give a short overview of different tools and services that can be used in each of the different phases of the data pipeline. Besides, an experimental setup of the data pipeline is done with the analysis of real-world data-sets and its corresponding results. Chapter 7 discusses the results of the whole work as well as presents directions for future work. Last but not the least, other topics like, data backup, replication and normalization are contained within the Appendices. 1.2 Problem statement In 1991 computer scientist Tim Berners-Lee announced the birth of what would become the World Wide Web as we know it today, interconnected web of data, accessible to anyone from anywhere. Since then, an explosion of information took place in the digital world (Berners-Lee, 1991, p. 3). Almost ten years later, Doug Laney, was defining data growth challenges and opportunities as the three of what will come to be the commonly-accepted characteristics of Big Data, volume, velocity and variety (Laney, 2001, p. 70). 4 And later, some scholars added veracity and value as a 5 Vs model (Assuncao, Calheiros, Bianchi, Netto, & Buyya, 2015). IDC started a research in 2005 on how big is the digital world and in the partial results published in the 2007 report, the size by 2006 was 161 exabytes and forecasted

Design and Evaluation of a Cloud Native Data Analysis Pipeline for Cyber Physical Production Systems

Working with Storm Topologies Date of Publish: 2018-08-13

Apache Flink™: Stream and Batch Processing in a Single Engine

DSP Frameworks DSP Frameworks We Consider

Apache Apex: Next Gen Big Data Analytics

The Cloud‐Based Demand‐Driven Supply Chain

Apache Storm Tutorial

Pohorilyi Magistr.Pdf

HDP 3.1.4 Release Notes Date of Publish: 2019-08-26

Apache Calcite: a Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

Hdf® Stream Developer 3 Days

ADMI Cloud Computing Presentation

Installing and Configuring Apache Storm Date of Publish: 2018-08-30