Runtime Monitoring of Security Slas for Big Data Pipelines

City Research Online City, University of London Institutional Repository Citation: Mantzoukas, K. (2020). Runtime monitoring of security SLAs for big data pipelines: design implementation and evaluation of a framework for monitoring security SLAs in big data pipelines with the assistance of run-time code instrumentation. (Unpublished Doctoral thesis, City, University of London) This is the accepted version of the paper. This version of the publication may differ from the final published version. Permanent repository link: https://openaccess.city.ac.uk/id/eprint/25619/ Link to published version: Copyright: City Research Online aims to make research outputs of City, University of London available to a wider audience. Copyright and Moral Rights remain with the author(s) and/or copyright holders. URLs from City Research Online may be freely distributed and linked to. Reuse: Copies of full items can be used for personal research or study, educational, or not-for-profit purposes without prior permission or charge. Provided that the authors, title and full bibliographic details are credited, a hyperlink and/or URL is given for the original metadata page and the content is not changed in any way. City Research Online: http://openaccess.city.ac.uk/ [email protected] Runtime Monitoring of Security SLAs for Big Data Pipelines Design, implementation and evaluation of a framework for monitoring security SLAs in Big Data pipelines with the assistance of run-time code instrumentation Konstantinos Mantzoukas Supervisor: Prof. George Spanoudakis Dr. Christos Kloukinas Department of Computer Science City University of London This dissertation is submitted for the degree of Doctor of Philosophy November 2020 I would like to dedicate this thesis to my loving wife Anna and beautiful son Orestis Declaration I hereby declare that except where specific reference is made to the work of others, the contents of this dissertation are original and have not been submitted in whole or in part for consideration for any other degree or qualification in this, or any other university. This dissertation is my own work and contains nothing which is the outcome of work done in collaboration with others, except as specified in the text and Acknowledgements. Konstantinos Mantzoukas November 2020 Acknowledgements I would like to express my gratitude to my supervisor Professor George Spanoudakis for his unabating support and invaluable guidance throughout my research that led up to the authoring of this PhD thesis. I also wish to sincerely thank my second supervisor Dr. Chrtistos Kloukinas for all the assistance that he offered me during the conception, design and implementation of this dissertation. Finally I would like to assert my heartfelt appreciation to my family, friends and colleagues who never stopped believing in me and constantly encouraged me to keep going even in the darkest of hours. Abstract The Big Data processing ecosystem has been constantly growing in recent years. This has been significantly reinforced by the advent of cloud computing platforms where BigData analytics can be offered on an as-a-service basis. The ease with which users can leverage the capabilities of Big Data processing frameworks in the cloud has made them a popular solution with low up-front expenditure and a flexible deployment model. In spite of their cost benefits and flexibility of use, Big Data services in cloud platforms present uswithan array of new challenges compared to traditional web services especially in the domain of data security and privacy. Their distributed nature makes them more dynamic with regards to deployment and execution but at the same time it exacerbates challenges related to data and operation security since both data and operations are shared across multiple nodes. Inevitably, distributing data and operations on multiple nodes leads to an increase in the attack surface. Given the need for systems that react fast and produce results as quickly as possible, more emphasis has been placed on performance and less so on security. Having said that, as the use of cloud computing is becoming more widespread, concerns with regards to non-functional properties such as data security are becoming more pronounced for the users. Runtime security monitoring is a mechanism that can be employed to alleviate some of the issues that emerge with respect to the activity of security monitoring for Big Data analytics services that are outsourced in the cloud. In this thesis we make the case for a monitoring framework where monitoring events are collected and evaluated against a set of monitoring rules that describe monitorable security properties of the system. The framework that we put forward can be used to assess the level of security of Big Data analytics pipelines at runtime. For our proof of concept we examine three security properties namely the service response time, the location of execution of service operations and the integrity of the intermediate data produced during the service execution. Table of contents List of figures xv List of tables xxiii 1 Introduction1 1.1 Overview . .1 1.2 Motivation and Research Challenges . .1 1.3 Summary of Research Aims and Objectives . .3 1.3.1 Review the literature . .4 1.3.2 Identify the monitoring framework’s components . .4 1.3.3 Identify monitorable security properties . .4 1.3.4 Automate the translation of SLAs into monitoring rules . .5 1.3.5 Automate the deployment of the event captors . .5 1.3.6 Create an integrated SLA manager platform . .5 1.4 Research Assumptions . .6 1.5 Research Contributions . .6 1.6 Publications . .9 1.7 Thesis Outline . .9 2 Literature Review 11 2.1 Overview . 11 2.2 Security and Privacy Properties for Big Data . 12 2.2.1 Data Availability . 13 2.2.2 Data Privacy . 18 xii Table of contents 2.2.3 Data Integrity . 23 2.2.4 Data Confidentiality . 28 2.3 Monitoring Service Level Agreements . 33 2.4 Metrics for Service Level Agreement . 41 2.5 Monitoring Frameworks for the Cloud . 50 2.5.1 Commercial monitoring frameworks . 53 2.5.2 Open source monitoring frameworks . 57 2.6 Big Data Processing Frameworks . 64 2.7 Big Data Workflow Definition Tools and Frameworks . 80 2.8 Gap Analysis . 87 2.9 Summary . 89 3 Monitoring Framework for Big Data Security SLAs 91 3.1 Introduction . 91 3.2 Framework Architecture . 92 3.2.1 Composite Service Definition . 97 3.2.2 Security Requirements Specification . 100 3.2.3 Translation of Security Requirements into Monitoring artefacts . 100 3.2.4 Installation of Monitoring Rules on the monitor . 101 3.2.5 Definition and Installation of Event Captors on Apache Spark. 101 3.3 Monitoring Rules . 110 3.3.1 Monitoring Rules for Response Time . 111 3.3.2 Monitoring Rules for Location of Execution . 120 3.3.3 Monitoring Rules for Data Integrity During Service Execution . 132 3.4 Summary . 173 4 SLA Management Web Dashboard 175 4.1 Application Architecture Overview . 175 4.2 Application Repository . 175 4.3 Application REST API . 178 4.4 Energy producer use-case . 182 4.5 Screenshots for the energy provider use-case . 185 Table of contents xiii 4.6 Summary . 198 5 Framework Evaluation 199 5.1 Experimental setup . 199 5.2 Quantitative Evaluation . 202 5.2.1 Event captor deployment overhead . 203 5.2.2 Event captor execution overhead . 217 5.3 Evaluation Summary and Discussion . 246 5.4 Summary . 249 6 Conclusions and Future Work 251 6.1 Overview . 251 6.2 Summary of Research Work . 251 6.3 Contributions . 252 6.4 Limitations . 253 6.5 Future Work . 253 References 257 Appendix A Composed Task Runner for Spark Submit Command 271 A.1 Spring Cloud Data Flow . 280 A.1.1 Overview . 280 A.1.2 Application Types . 281 A.1.3 Workflow Specification Language . 281 A.1.4 Application for the Execution of Apache Spark Jobs . 286 A.2 Apache Spark . 288 A.2.1 Overview . 289 A.2.2 Framework Architecture . 293 A.2.3 Execution Model . 297 A.2.4 Deployment Model . 299 A.3 EVEREST . 299 A.3.1 Event Calculus . 299 A.3.2 Framework Architecture . 301 xiv Table of contents A.4 Apache Velocity . 302 A.4.1 Overview . 303 A.4.2 Velocity Template Language . 303 A.4.3 Velocity Template Engine . 306 A.5 Byte Byddy . 308 A.5.1 Overview . 308 A.5.2 Java’s Instrumentation API . 309 A.5.3 Runtime code instrumentation and Code Generation in Byte Buddy 311 List of figures 2.1 Lifecycle stages of data in the Cloud . 19 2.2 QoSMONaaS system architecture . 40 2.3 Apache Hadoop architecture overview . 66 2.4 Map Reduce algorithm overview . 67 2.5 An example of a Apache Strom topology . 71 2.6 Worker processes for the topology presented in figure 2.5.......... 72 2.7 Overview of stream in Apache Samza . 75 2.8 An example of a Samza dataflow graph . 76 2.9 Overview of the task state persistence mechanism in Apache Flink . 77 2.10 Architecture of the Pinball workflow manger . 83 2.11 State diagram for job statuses in Pinball . 84 3.1 Big Data Pipeline Monitoring Framework Architecture . 93 3.2 Use Case UML diagram of the Big Data monitoring framework . 96 3.3 Sequence diagram of the Big Data monitoring framework . 98 3.4 Spring Cloud DataFlow pipelines . 99 3.5 UML class diagram of the factory pattern for the implementation of the different emitter types supported by the event captors . 107 3.6 Visual representation of events for monitoring response time . 113 3.7 List of actions supported by the event captor for response time . 118 3.8 Example of events that occur over time during the monitoring activity of the location of execution of computations .

Runtime Monitoring of Security Slas for Big Data Pipelines

Programming Models to Support Data Science Workflows

Presto: the Definitive Guide

Summary Areas of Interest Skills Experience

TR-4798: Netapp AI Control Plane

Contrail Software Is the Visualization Tool of Choice When Managing and Controlling Access to Your Environmental Data Is Critical

IBM Tivoli Netcool/Omnibus Probe for Juniper Contrail: Reference Guide Chapter 1

Using Amazon EMR with Apache Airflow: How & Why to Do It

Migrating from Snowflake to Bigquery Data and Analytics

Spring Boot AMQP Starter 1.5.8.RELEASE

Amazon EMR Migration Guide How to Move Apache Spark and Apache Hadoop from On-Premises to AWS

Building a Google Cloud Data Platform

Quality of Analytics Management of Data Pipelines for Retail Forecasting