Runtime Monitoring of Security Slas for Big Data Pipelines

Total Page:16

File Type:pdf, Size:1020Kb

Runtime Monitoring of Security Slas for Big Data Pipelines City Research Online City, University of London Institutional Repository Citation: Mantzoukas, K. (2020). Runtime monitoring of security SLAs for big data pipelines: design implementation and evaluation of a framework for monitoring security SLAs in big data pipelines with the assistance of run-time code instrumentation. (Unpublished Doctoral thesis, City, University of London) This is the accepted version of the paper. This version of the publication may differ from the final published version. Permanent repository link: https://openaccess.city.ac.uk/id/eprint/25619/ Link to published version: Copyright: City Research Online aims to make research outputs of City, University of London available to a wider audience. Copyright and Moral Rights remain with the author(s) and/or copyright holders. URLs from City Research Online may be freely distributed and linked to. Reuse: Copies of full items can be used for personal research or study, educational, or not-for-profit purposes without prior permission or charge. Provided that the authors, title and full bibliographic details are credited, a hyperlink and/or URL is given for the original metadata page and the content is not changed in any way. City Research Online: http://openaccess.city.ac.uk/ [email protected] Runtime Monitoring of Security SLAs for Big Data Pipelines Design, implementation and evaluation of a framework for monitoring security SLAs in Big Data pipelines with the assistance of run-time code instrumentation Konstantinos Mantzoukas Supervisor: Prof. George Spanoudakis Dr. Christos Kloukinas Department of Computer Science City University of London This dissertation is submitted for the degree of Doctor of Philosophy November 2020 I would like to dedicate this thesis to my loving wife Anna and beautiful son Orestis Declaration I hereby declare that except where specific reference is made to the work of others, the contents of this dissertation are original and have not been submitted in whole or in part for consideration for any other degree or qualification in this, or any other university. This dissertation is my own work and contains nothing which is the outcome of work done in collaboration with others, except as specified in the text and Acknowledgements. Konstantinos Mantzoukas November 2020 Acknowledgements I would like to express my gratitude to my supervisor Professor George Spanoudakis for his unabating support and invaluable guidance throughout my research that led up to the authoring of this PhD thesis. I also wish to sincerely thank my second supervisor Dr. Chrtistos Kloukinas for all the assistance that he offered me during the conception, design and implementation of this dissertation. Finally I would like to assert my heartfelt appreciation to my family, friends and colleagues who never stopped believing in me and constantly encouraged me to keep going even in the darkest of hours. Abstract The Big Data processing ecosystem has been constantly growing in recent years. This has been significantly reinforced by the advent of cloud computing platforms where BigData analytics can be offered on an as-a-service basis. The ease with which users can leverage the capabilities of Big Data processing frameworks in the cloud has made them a popular solution with low up-front expenditure and a flexible deployment model. In spite of their cost benefits and flexibility of use, Big Data services in cloud platforms present uswithan array of new challenges compared to traditional web services especially in the domain of data security and privacy. Their distributed nature makes them more dynamic with regards to deployment and execution but at the same time it exacerbates challenges related to data and operation security since both data and operations are shared across multiple nodes. Inevitably, distributing data and operations on multiple nodes leads to an increase in the attack surface. Given the need for systems that react fast and produce results as quickly as possible, more emphasis has been placed on performance and less so on security. Having said that, as the use of cloud computing is becoming more widespread, concerns with regards to non-functional properties such as data security are becoming more pronounced for the users. Runtime security monitoring is a mechanism that can be employed to alleviate some of the issues that emerge with respect to the activity of security monitoring for Big Data analytics services that are outsourced in the cloud. In this thesis we make the case for a monitoring framework where monitoring events are collected and evaluated against a set of monitoring rules that describe monitorable security properties of the system. The framework that we put forward can be used to assess the level of security of Big Data analytics pipelines at runtime. For our proof of concept we examine three security properties namely the service response time, the location of execution of service operations and the integrity of the intermediate data produced during the service execution. Table of contents List of figures xv List of tables xxiii 1 Introduction1 1.1 Overview . .1 1.2 Motivation and Research Challenges . .1 1.3 Summary of Research Aims and Objectives . .3 1.3.1 Review the literature . .4 1.3.2 Identify the monitoring framework’s components . .4 1.3.3 Identify monitorable security properties . .4 1.3.4 Automate the translation of SLAs into monitoring rules . .5 1.3.5 Automate the deployment of the event captors . .5 1.3.6 Create an integrated SLA manager platform . .5 1.4 Research Assumptions . .6 1.5 Research Contributions . .6 1.6 Publications . .9 1.7 Thesis Outline . .9 2 Literature Review 11 2.1 Overview . 11 2.2 Security and Privacy Properties for Big Data . 12 2.2.1 Data Availability . 13 2.2.2 Data Privacy . 18 xii Table of contents 2.2.3 Data Integrity . 23 2.2.4 Data Confidentiality . 28 2.3 Monitoring Service Level Agreements . 33 2.4 Metrics for Service Level Agreement . 41 2.5 Monitoring Frameworks for the Cloud . 50 2.5.1 Commercial monitoring frameworks . 53 2.5.2 Open source monitoring frameworks . 57 2.6 Big Data Processing Frameworks . 64 2.7 Big Data Workflow Definition Tools and Frameworks . 80 2.8 Gap Analysis . 87 2.9 Summary . 89 3 Monitoring Framework for Big Data Security SLAs 91 3.1 Introduction . 91 3.2 Framework Architecture . 92 3.2.1 Composite Service Definition . 97 3.2.2 Security Requirements Specification . 100 3.2.3 Translation of Security Requirements into Monitoring artefacts . 100 3.2.4 Installation of Monitoring Rules on the monitor . 101 3.2.5 Definition and Installation of Event Captors on Apache Spark. 101 3.3 Monitoring Rules . 110 3.3.1 Monitoring Rules for Response Time . 111 3.3.2 Monitoring Rules for Location of Execution . 120 3.3.3 Monitoring Rules for Data Integrity During Service Execution . 132 3.4 Summary . 173 4 SLA Management Web Dashboard 175 4.1 Application Architecture Overview . 175 4.2 Application Repository . 175 4.3 Application REST API . 178 4.4 Energy producer use-case . 182 4.5 Screenshots for the energy provider use-case . 185 Table of contents xiii 4.6 Summary . 198 5 Framework Evaluation 199 5.1 Experimental setup . 199 5.2 Quantitative Evaluation . 202 5.2.1 Event captor deployment overhead . 203 5.2.2 Event captor execution overhead . 217 5.3 Evaluation Summary and Discussion . 246 5.4 Summary . 249 6 Conclusions and Future Work 251 6.1 Overview . 251 6.2 Summary of Research Work . 251 6.3 Contributions . 252 6.4 Limitations . 253 6.5 Future Work . 253 References 257 Appendix A Composed Task Runner for Spark Submit Command 271 A.1 Spring Cloud Data Flow . 280 A.1.1 Overview . 280 A.1.2 Application Types . 281 A.1.3 Workflow Specification Language . 281 A.1.4 Application for the Execution of Apache Spark Jobs . 286 A.2 Apache Spark . 288 A.2.1 Overview . 289 A.2.2 Framework Architecture . 293 A.2.3 Execution Model . 297 A.2.4 Deployment Model . 299 A.3 EVEREST . 299 A.3.1 Event Calculus . 299 A.3.2 Framework Architecture . 301 xiv Table of contents A.4 Apache Velocity . 302 A.4.1 Overview . 303 A.4.2 Velocity Template Language . 303 A.4.3 Velocity Template Engine . 306 A.5 Byte Byddy . 308 A.5.1 Overview . 308 A.5.2 Java’s Instrumentation API . 309 A.5.3 Runtime code instrumentation and Code Generation in Byte Buddy 311 List of figures 2.1 Lifecycle stages of data in the Cloud . 19 2.2 QoSMONaaS system architecture . 40 2.3 Apache Hadoop architecture overview . 66 2.4 Map Reduce algorithm overview . 67 2.5 An example of a Apache Strom topology . 71 2.6 Worker processes for the topology presented in figure 2.5.......... 72 2.7 Overview of stream in Apache Samza . 75 2.8 An example of a Samza dataflow graph . 76 2.9 Overview of the task state persistence mechanism in Apache Flink . 77 2.10 Architecture of the Pinball workflow manger . 83 2.11 State diagram for job statuses in Pinball . 84 3.1 Big Data Pipeline Monitoring Framework Architecture . 93 3.2 Use Case UML diagram of the Big Data monitoring framework . 96 3.3 Sequence diagram of the Big Data monitoring framework . 98 3.4 Spring Cloud DataFlow pipelines . 99 3.5 UML class diagram of the factory pattern for the implementation of the different emitter types supported by the event captors . 107 3.6 Visual representation of events for monitoring response time . 113 3.7 List of actions supported by the event captor for response time . 118 3.8 Example of events that occur over time during the monitoring activity of the location of execution of computations .
Recommended publications
  • Programming Models to Support Data Science Workflows
    UNIVERSITAT POLITÈCNICA DE CATALUNYA (UPC) BARCELONATECH COMPUTER ARCHITECTURE DEPARTMENT (DAC) Programming models to support Data Science workflows PH.D. THESIS 2020 | SPRING SEMESTER Author: Advisors: Cristián RAMÓN-CORTÉS Dra. Rosa M. BADIA SALA VILARRODONA [email protected] [email protected] Dr. Jorge EJARQUE ARTIGAS [email protected] iii ”Apenas él le amalaba el noema, a ella se le agolpaba el clémiso y caían en hidromurias, en salvajes ambonios, en sustalos exas- perantes. Cada vez que él procuraba relamar las incopelusas, se enredaba en un grimado quejumbroso y tenía que envul- sionarse de cara al nóvalo, sintiendo cómo poco a poco las arnillas se espejunaban, se iban apeltronando, reduplimiendo, hasta quedar tendido como el trimalciato de ergomanina al que se le han dejado caer unas fílulas de cariaconcia. Y sin em- bargo era apenas el principio, porque en un momento dado ella se tordulaba los hurgalios, consintiendo en que él aprox- imara suavemente sus orfelunios. Apenas se entreplumaban, algo como un ulucordio los encrestoriaba, los extrayuxtaba y paramovía, de pronto era el clinón, la esterfurosa convulcante de las mátricas, la jadehollante embocapluvia del orgumio, los esproemios del merpasmo en una sobrehumítica agopausa. ¡Evohé! ¡Evohé! Volposados en la cresta del murelio, se sen- tían balpamar, perlinos y márulos. Temblaba el troc, se vencían las marioplumas, y todo se resolviraba en un profundo pínice, en niolamas de argutendidas gasas, en carinias casi crueles que los ordopenaban hasta el límite de las gunfias.” Julio Cortázar, Rayuela v Dedication This work would not have been possible without the effort and patience of the people around me.
    [Show full text]
  • Presto: the Definitive Guide
    Presto The Definitive Guide SQL at Any Scale, on Any Storage, in Any Environment Compliments of Matt Fuller, Manfred Moser & Martin Traverso Virtual Book Tour Starburst presents Presto: The Definitive Guide Register Now! Starburst is hosting a virtual book tour series where attendees will: Meet the authors: • Meet the authors from the comfort of your own home Matt Fuller • Meet the Presto creators and participate in an Ask Me Anything (AMA) session with the book Manfred Moser authors + Presto creators • Meet special guest speakers from Martin your favorite podcasts who will Traverso moderate the AMA Register here to save your spot. Praise for Presto: The Definitive Guide This book provides a great introduction to Presto and teaches you everything you need to know to start your successful usage of Presto. —Dain Sundstrom and David Phillips, Creators of the Presto Projects and Founders of the Presto Software Foundation Presto plays a key role in enabling analysis at Pinterest. This book covers the Presto essentials, from use cases through how to run Presto at massive scale. —Ashish Kumar Singh, Tech Lead, Bigdata Query Processing Platform, Pinterest Presto has set the bar in both community-building and technical excellence for lightning- fast analytical processing on stored data in modern cloud architectures. This book is a must-read for companies looking to modernize their analytics stack. —Jay Kreps, Cocreator of Apache Kafka, Cofounder and CEO of Confluent Presto has saved us all—both in academia and industry—countless hours of work, allowing us all to avoid having to write code to manage distributed query processing.
    [Show full text]
  • Summary Areas of Interest Skills Experience
    Gonçalo N. Paiva Amador Lisbon, Portugal · Portuguese · 27/07/1983 · Male · ♂ Single (+351) 962-816-858 | [email protected] github.com/g-amador | www.linkedin.com/in/g-amador "First, solve the problem. Then, write the code." John Johnson Summary Software developer in several programming languages, also with experience in primary technical contact point withone or more clients, assisting in planning, debugging, and supervising ongoing critical business applications. Former Project Researcher, Lab Instructor, and Scientific Presenter. Highly motivated, communicative, and self-sufficient professional with solid academic background in C.S. & Engineering.Known as a team player and constant self-driven learner; striving to address novel and exciting challenges. Preference to work with teams in order to grow personally and professionally. Areas of Interest • Computer/Video Games & Gamification. • Artificial Intelligence, Robotics, and Cybernetics. • Touch/Voice/Camera-based HCI technologies. • Multi-Core CPU/GPU and Cloud computing. • 3D Animation/Modelling & Geometrical Computing. • Workforce scheduling & management software. • Mobile Applications Development. Skills Programming/meta Languages & APIs/Frameworks: C/C++[4], C#[2][6], Java[4][5], ASP Classic/.NET (Standard)/.NET Core[6], React Native[5], PL/SQL[5], T-SQL[2][6], Ajax/jQuery/Bootstrap[5], Angular & Node.js[5], Apache Math Commons[4], Apache Struts[5], OpenGL[4], CUDA/OpenMP/OpenCL/MPI[3], HTML/CSS/JavaScript/JSON/XML[3][6], UML[2], LaTeX[4]. Productivity tools: Azure DevOps & Team Foundation Server (TFS)[6], Zendesk[5], Redmine[5], Git[3][6], SVN[2][5], Apache AirFlow[5], Apache Maven[2][5], Apache Ant[4][5], Apache Tomcat[5], Blender[2], JMonkeyEngine[2], Eclipse[3][5], Netbeans[4], Oracle SQL de- veloper[6], Visual Studio 2005-current[4][6], Visual Studio Code[6], MS SQL-server 2012-Current[2][6], PowerBI[5], Google Chrome[4][7], MS Edge[6], Internet Explorer[4][6], SoapUI[5], Postman[6], Swagger[5], Wireshark[1][5], Fiddler[5], IIS 6-8[5].
    [Show full text]
  • TR-4798: Netapp AI Control Plane
    Technical Report NetApp AI Control Plane Pairing Popular Open-Source Tools with NetApp to Enable AI, ML, and DL Data and Experiment Management Mike Oglesby, NetApp October 2020 | TR-4798 Abstract As organizations increase their use of artificial intelligence (AI), they face many challenges, including workload scalability and data availability. This document demonstrates how to address these challenges through the use of NetApp® AI Control Plane, a solution that pairs NetApp data management capabilities with popular open-source tools and frameworks that are used by data scientists and data engineers. In this document, we show you how to rapidly clone a data namespace just as you would a Git repo. We demonstrate how to define and implement AI training workflows that incorporate the near-instant creation of data and model baselines for traceability and versioning. We also show how to seamlessly replicate data across sites and regions and swiftly provision Jupyter Notebook workspaces with access to massive datasets. TABLE OF CONTENTS 1 Introduction ........................................................................................................................................... 5 2 Concepts and Components ................................................................................................................. 6 2.1 Artificial Intelligence ........................................................................................................................................ 6 2.2 Containers ......................................................................................................................................................
    [Show full text]
  • Contrail Software Is the Visualization Tool of Choice When Managing and Controlling Access to Your Environmental Data Is Critical
    Contrail® Total Control of Your Environmental Network Data and Information Distribution Contrail software is the visualization tool of choice when managing and controlling access to your environmental data is critical. Contrail supports the real-time data collection, processing, archiving and dissemination of your hydrological, meteorological and other environmental data in one place. Contrail gives users instant access to what they need, when they need it, on any Web-enabled device. HIGHLIGHTS Contrail collects, validates, processes for alarming and notification, displays on maps, graphs and tables, archives, exports and disseminates hydro-meteorological Seamless integration of data from data and information, including gauge-adjusted radar rainfall and inundation many sensor types and 35+ source maps. Encompassed are tools and reports for sensor management, rainfall and types (e.g., ALERT2, SCADA, Modbus, stream-related reporting, maintenance, and custom alarm and notification features. and many others) OneRain’s solutions enable management of and quick access to water-related Complete data management and emergency action plan (EAP) content, links to any outside resources, webcam video automated archiving feeds from difficult sites, and many other web-hosted tools. Unlimited number of users–desktop and mobile Web-accessible (no special Makes Decision-Critical Support Data Highly Available app to install except on the hosting Contrail is used for operational decision support and emergency operations, post- servers) event analysis, model calibration and planning in hydrology and flood early warning, Advanced multi-sensor alarm rules dam safety and reservoir operations, water resource management, road weather, and and notification management environmental monitoring applications. Historical and real-time data The Contrail application is configurable to suit specific user needs, for multiple user Ingests information such as USGS, groups and different types of users simultaneously.
    [Show full text]
  • IBM Tivoli Netcool/Omnibus Probe for Juniper Contrail: Reference Guide Chapter 1
    IBM® Tivoli® Netcool/OMNIbus Probe for Juniper Contrail 1.0 Reference Guide December 10, 2015 IBM SC27-8705-00 Note Before using this information and the product it supports, read the information in Appendix A, “Notices and Trademarks,” on page 19. Edition notice This edition (SC27-8705-00) applies to version 1.0 of IBM Tivoli Netcool/OMNIbus Probe for Juniper Contrail and to all subsequent releases and modifications until otherwise indicated in new editions. © Copyright International Business Machines Corporation 2015. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents About this guide.................................................................................................... v Document control page................................................................................................................................v Conventions used in this guide.................................................................................................................... v Chapter 1. Probe for Juniper Contrail..................................................................... 1 Summary...................................................................................................................................................... 1 Installing probes.......................................................................................................................................... 2 SSL-based connectivity................................................................................................................................3
    [Show full text]
  • Using Amazon EMR with Apache Airflow: How & Why to Do It
    Using Amazon EMR with Apache Airflow: How & Why To Do It In this introductory article, I explore Amazon EMR and how it works with Apache Airflow. What is Amazon EMR? Amazon EMR is an orchestration tool to create aSpark or Hadoop big data cluster and run it on Amazon virtual machines. That’s the original use case for EMR: MapReduce and Hadoop. We’ll take a look at MapReduce later in this tutorial. What is Apache Airflow? Apache Airflow is a tool for defining and running jobs—i.e., a big data pipeline—on: Apache Hadoop Snowflake (data warehouse charting) Amazon products including EMR,Redshift (data warehouse), S3 (file storage), and Glacier (long term data archival) Many other products Airflow can also start and takedown Amazon EMR clusters. That’s important because your EMR clusters could get quite expensive if you leave them running when they are not in use. Benefits of Airflow For most use cases, there’s two main advantages of Airflow running on an Apache Hadoop and Spark environment: Cost management Optional surge capacity (which, of course, is one element of cost) Plus, it’s simpler to use Airflow and its companion product Genie (developed by Netflix) to do things like run jobs using spark-submit or Hadoop queues, which have a lot of configuration options and an understanding of things like Yarn, a resource manager. In this setup, Airflow: Lets you bundle jar files, Python code, and configuration data into metadata Provides a feedback loop in case any of that goes wrong So, it might be better to use Airflow than the alternative: typing spark-submit into the command line and hoping for the best.
    [Show full text]
  • Migrating from Snowflake to Bigquery Data and Analytics
    Migrating from Snowflake to BigQuery Data and analytics Date:​ February 2021 Contents About this document 3 1. Introduction 4 1.1 Terminology map 4 1.2 Architectural comparison 5 2. Pre-migration 5 2.1 Capacity planning 6 2.2 Data security 7 2.3 Migration considerations 8 2.3.1 Fully supported scenarios 9 2.3.2 Partially supported scenarios 9 2.3.3 Unsupported scenarios 9 3. Migrating your data 10 3.1 Migration strategy 10 3.1.1 Migration architecture 10 3.1.2 Final state 11 3.2 Preparing your Cloud Storage environment 12 3.3 Building your schema 13 3.3.1 Schema changes 13 3.3.2 Clustered tables 13 3.3.3 Updating a schema 13 3.4 Supported data types, properties, and file formats 14 3.4.1 Considerations for using CSV 14 3.4.2 Considerations for using Parquet 15 3.4.3 Considerations for using JSON 15 3.5 Migration tools 16 3.6 Migrating the data 17 3.6.1 Migration using pipelines 17 3.6.1.1 Extract and load (EL) 18 3.6.1.2 Extract, transform, and load (ETL) 19 3.6.1.3 ELT 20 3.6.1.4 Partner tools for migration 20 3.6.2 Example of the export process for migrating Snowflake to BigQuery 20 3.6.2.1 Preparing for the export 20 3.6.2.2 Exporting your Snowflake data 21 1 3.6.2.3 Load data into BigQuery 23 4. Post-migration 23 4.1 Reporting and analysis 23 4.2 Performance optimization 23 2 About this document This document provides guidance on how to migrate your database from a Snowflake data warehouse to BigQuery.
    [Show full text]
  • Spring Boot AMQP Starter 1.5.8.RELEASE
    Veritas Information Studio 1.0 Third-party Software License Agreements April 2019 Third-party software license agreements This document includes the following topics: ■ Third-party software license agreements ■ kubernetes-cni 0.5.1-00 ■ Jackson-annotations 2.9.5 ■ k8s 1.14.5 ■ kubernetes/ingress-nginx 0.9.0-beta.15 ■ pause- amd64 3.0 ■ kubernetes-client-python 3.0.0 ■ kubernetes/api 1.8.0 ■ kubernetes/apiserver 1.10.0-alpha.1 ■ Code Generation Library - cglib:cglib 3.1 ■ Apache CXF Runtime WS Security 3.1.4 ■ jackson-databind jackson-databind-2.9.8 ■ defaultbackend 1.3 ■ go-openapi/jsonpointer ■ Apache Commons Lang 3.2.1 ■ hadoop 2.9.0 Third-party software license agreements 3 ■ smbj 0.5.1 ■ ng2-idle 2.0.0-beta.13 ■ RxJS - ReactiveX/rxjs 5.5.2 ■ JSR-311 - JAX-RS - The Java API for RESTful Web Services (Jersey) 1.1.1-API ■ WSDL4J - wsdl4j:wsdl4j 1.6.3 ■ Jetty extensions to the Google OAuth Client Library for Java. 1.23.0 ■ allblobstore 2.0.3 ■ jclouds blobstore core 2.0.3 ■ Admin Directory API directory_v1-rev86-1.23.0 directory_v1-rev86-1.23.0 ■ Drive API v3-rev87-1.23.0 ■ Apache CXF Runtime JAX-WS Frontend 3.1.11 ■ OpenCMIS Client Implementation 1.1.0 ■ flexboxgrid 6.3.1 ■ matttproud-golang_protobuf_extensions v1.0.0 ■ AWS Java SDK for Amazon CloudWatch 1.11.136 ■ DataStax Java Driver for Apache Cassandra - Object Mapping 3.2.0 ■ DataStax Java Driver for Apache Cassandra - Extras 3.2.0 ■ Bean Validation API - javax.validation:validation-api 2.0.0.CR1 ■ Apache HttpClient 4.5 ■ Apache Log4j API 2.8.2 ■ Apache Kafka - org.apache.kafka:kafka-clients
    [Show full text]
  • Amazon EMR Migration Guide How to Move Apache Spark and Apache Hadoop from On-Premises to AWS
    Amazon EMR Migration Guide How to Move Apache Spark and Apache Hadoop From On-Premises to AWS December 2, 2020 Notices Customers are responsible for making their own independent assessment of the information in this document. This document: (a) is for informational purposes only, (b) represents current AWS product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers. © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Contents Overview .............................................................................................................................. 1 Starting Your Journey .......................................................................................................... 3 Migration Approaches ...................................................................................................... 3 Prototyping ....................................................................................................................... 6 Choosing a Team ............................................................................................................
    [Show full text]
  • Building a Google Cloud Data Platform
    Building a Google Cloud AData blueprint Platform for success with BigQuery April 2020 Table of Contents 1. Overview 4 1.1 Introduction 4 1.2 Objective 4 1.3 Approach 4 2. Cloud Data Platforms - Key Concepts 5 2.1 Platform objectives 5 2.2 Conceptual architecture 6 2.3 Platform user considerations 6 2.4 Information management 7 2.4.1 Data classification 7 2.4.2 Data governance 8 2.4.3 Data quality 8 2.4.3 Data risks and privacy 9 3. Google Cloud as a Data Platform 10 3.1 Google Cloud’s data capabilities 10 3.2 Solution architecture - Google Cloud data capabilities 11 3.3 BigQuery 12 3.3.1 Overview 12 3.3.2 Data modelling and structuring BigQuery 14 3.3.3 Ingestion 15 3.3.4 Enrichment, processing and analysis 16 3.3.5 Performance and cost optimisation 16 3.4 Data transformation - ELT / ETL 17 3.4.1 Cloud Dataflow 17 3.4.2 Dataprep 17 3.4.3 Other data manipulation tools on Google Cloud 18 Table of Contents 3.4.4 Scheduling and orchestration 18 3.3.5 Data risk 18 3.5 Machine Learning & AI 20 3.5.1 Google ML & AI tooling with used with applied data Science 20 3.5.2 Kubernetes for ML payloads 21 3.6 Data accessibility, reporting and visualisation 22 3.6.1 Data accessibility tools 22 3.6.2 Data Studio 23 4. Building a Cloud Data Blueprint 24 4.1 Principles of constructing a blueprint 24 4.2 People, process and technology considerations 24 4.3 Foundation cloud capabilities 25 5.
    [Show full text]
  • Quality of Analytics Management of Data Pipelines for Retail Forecasting
    Krists Kreics Quality of analytics manage ment of data pipelinesfor retailforecasting School of Science Thesis sub mittedfor exa minationfor the degree of Master of Sciencein Technology. Espoo 29.07.2019 Thesis supervisor: Prof. Hong-Linh Truong Thesis advisors: Dr.Sci. ( Tech.) Mikko Ervasti M.Sc. Teppo Luukkonen aalto university abstract of the school of science master’s thesis Author: Krists Kreics Title: Quality of analytics manage ment of data pipelinesfor retailforecasting Date: 29.07.2019 Language: English Numberof pages: 54+3 Degree progra m me: Master’s Progra m meinI C TInnovation Major: Data Science Code: S CI3095 Supervisor: Prof. Hong-Linh Truong Advisors: Dr.Sci. ( Tech.) Mikko Ervasti, M.Sc. Teppo Luukkonen This thesis presents a fra mework for managing quality of analytics in data pipelines. The main research question of this thesisis the trade-off manage ment bet ween cost, ti me and data qualityin retail forcasting. Generally this trade-off in data analyticsis defined as quality of analytics. The challenge is addressed by introducing a proof of concept fra me work that collects real ti me metrics about the data quality, resource consu mption and other relevant metrics fro m tasks within a data pipeline. The data pipelines within thefra me work are developed using Apache Airflow that orchestrates Dockerized tasks. Different metrics of each task are monitored and stored to ElasticSearch. Cross-task co m municationis enabled by using an event driven architecture that utilizes a Rabbit M Q as the message queue and custo m consu meri mages written in python. With the help of these consu mers the syste m can control the result with respect to quality of analytics.
    [Show full text]