Sheriffo Ceesay Phd Thesis

AGrey-boxApproachtoBenchmarkingandPerformance Modelling of Data-Intensive Applications

Sheriffo Ceesay

University of St Andrews

This thesis is submitted in partial fulﬁlment for the degree of

Doctor of Philosophy (PhD)

at the University of St Andrews

June 2020

Abstract

The advent of big data about a decade ago, coupled with its processing and storage challenges gave rise to the development of a multitude of data-intensive frameworks. These distributed parallel processing frameworks can be used to process petabytes of data stored in a cluster of computing nodes. Companies and organisations can now process massive amounts of data to drive innovation and gain a competitive advantage. However, these new paradigms have resulted in several research challenges due to their inherent difference from the more mature traditional data processing and storage systems. Firstly, they are comparatively more modern, supporting the execution of a wide variety of new data-intensive workloads with varying performance requirements. Therefore, there is a clear need to study and standardise ways to benchmark and compare them to identify and improve performance bottlenecks. Secondly, they are highly configurable; enabling users the freedom to tune the execution environment based on the application’s performance requirements. However, this freedom and the ubiquity of the configuration parameters present an additional challenge by shifting the tuning and optimisation responsibilities of these numerous configuration parameters to the users. To address the above broad challenges, in this research, we enabled a grey-box benchmarking and performance modelling framework focusing on two of the most common communication patterns for data-intensive applications. The use of communication patterns allowed us to classify and study varying but related data-intensive workloads using the same sets of requirements. Furthermore, we enabled a multi-objective performance prediction framework that can be used to answer various performance-related questions such as the time it takes to execute an application, the best configuration parameters to satisfy constraints such as deadline, and recommendation of optimal cloud instances to minimise monetary cost. To gauge the generality of this work, we have validated the results on two internal clusters, and the results are consistent across both setups. We have also provided a REST API and web implementation for validation. The primary take way result is that the research showcase a comprehensive approach that can be used to benchmarking and modelling the performance of data-intensive applications.

Acknowledgements

Alhamdulillah! The beginning of this journey did not start in 2016, but it started the ﬁrst day I stepped my foot in school. Mentioning all those who contributed to this success would be another thesis; however, I acknowledge and appreciate the efforts of everyone who has helped me in one way or the other. Below are few of these appreciations.

• First, I sincerely want to thank my supervisor, Prof. Adam Barker, for rigorously guiding me throughout the entire journey of this research. I cannot forget how he helped me narrow down my initial research idea to a speciﬁc topic. Furthermore, when my funding ended in October 2019, he generously approved a six months stipend; without which it would have been impossible for me to cope ﬁnancially. I am greatly indebted to him forever!

• I want to thank Dr Yuhui Lin for the numerous and brilliant constructive suggestions he had made to improve my research. He has also taken time from his busy schedule to provide the much needed and useful feedback on my thesis and the papers I wrote. I can’t thank him enough!

• I want to thank my parents and my entire extended family for their continuous support and prayers. I want to thank my wife and kids, for their endless love and understanding, especially during the tough moments of my studies. Special thanks go to my brother, Balla Ceesay, for the ﬁnancial and emotional support he accorded to me in difﬁcult times. I also want to thank Balla, Nyambi, Fatou and other visitors for their visits to Scotland and the quality times they have spent with us. I can’t forget Mr Douglas Ricket, a USA peace corps volunteer and my high school Physics teacher, who has always been supportive of my educational goals.

• I will not do justice without showing appreciation to the School of Computer Science, especially to Dr Stuart Norcross and the entire FixIt team for providing a conducive and standard research environment. Finally, I want to thank my lab mates, friends and colleagues at the University of St Andrews and Gambians in Dundee for making this long journey exciting and less lonely for me.

• This work was supported by the Islamic Development Bank and the University of St Andrews (School Computer Science).

Declaration

Candidate’s Declarations I, Sheriffo Ceesay, do hereby certify that this thesis, submitted for the degree of PhD, which is approximately 39,409 words in length, has been written by me, and that it is the record of work carried out by me, or principally by myself in collaboration with others as acknowledged, and that it has not been submitted in any previous application for any degree. I was admitted as a research student at the University of St Andrews in October 2016. I received funding from an organisation or institution and have acknowledged the funder(s) in the full text of my thesis.

Date Signature of candidate:

Supervisor’s Declaration I hereby certify that the candidate has fulﬁlled the conditions of the Resolution and Regulations appropriate for the degree of PhD in the University of St Andrews and that the candidate is qualiﬁed to submit this thesis in application for that degree.

Date Signature of supervisor:

ADEhr

Permission for Electronic Publication

In submitting this thesis to the University of St Andrews we understand that we are giving permission for it to be made available for use in accordance with the regulations of the University Library for the time being in force, subject to any copyright vested in the work not being affected thereby. We also understand, unless exempt by an award of an embargo as requested below, that the title and the abstract will be published, and that a copy of the work may be made and supplied to any bona ﬁde library or research worker, that this thesis will be electronically accessible for personal or research use and that the library has the right to migrate this thesis into new electronic forms as required to ensure continued access to the thesis.

I, Sheriffo Ceesay, conﬁrm that my thesis does not contain any third-party material that requires copyright clearance.

The following is an agreed request by candidate and supervisor regarding the publication of this thesis:

Printed copy No embargo on print copy.

Electronic copy No embargo on electronic copy.

Date Signature of candidate:

Date Signature of supervisor: ADBahr Underpinning Research Data or Digital Outputs Candidate’s declaration I, Sheriffo Ceesay, hereby certify that no requirements to deposit original research data or digital outputs apply to this thesis and that, where appropriate, secondary data used have been referenced in the full text of my thesis.

Date Signature of candidate

This thesis is dedicated to my lovely parents! I love you all.

CONTENTS

Contents i

List of Figures vi

List of Tables viii

Acronyms xi

Glossary xiii

1 Introduction 1 1.1 Rise of Big Data ...... 1 1.2 Big Data Frameworks ...... 2 1.3 The Problem ...... 3 1.4 Research Questions ...... 5 1.5 Main Contributions ...... 8 1.6 Publications ...... 10 1.7 Thesis Structure ...... 11 1.8 Summary ...... 12

2 Background 13 2.1 Introduction ...... 13 2.2 Big Data ...... 13 2.2.1 Data Intensive Applications ...... 14 2.2.2 Core Properties of Big Data ...... 15 2.2.3 Cloud Computing ...... 16 2.3 Dataﬂow Communication Patterns ...... 17 2.3.1 MapReduce Communication Pattern ...... 18 2.3.2 Dataﬂow With Cycle Pattern (DFWC) ...... 19 2.3.3 Data Flow With Barriers, Chained Jobs ...... 20 2.3.4 Data Flow Without Explicit Barriers ...... 21 2.4 Parallel Programming ...... 21 2.4.1 Amdhal’s Law ...... 22 2.5 Parallel Patterns ...... 23 2.5.1 Common Algorithmic Skeleton Implementations ...... 24 2.5.2 Cost Models For Parallel Patterns ...... 24

i II CONTENTS

2.6 Big Data Frameworks ...... 25 2.7 Hadoop ...... 25 2.7.1 Hadoop Distributed File System ...... 25 2.7.2 Hadoop MapReduce Programming Model ...... 26 2.7.3 YARN (Yet Another Resource Negotiator) ...... 27 2.8 Apache Spark ...... 27 2.9 Benchmarking ...... 29 2.9.1 Approaches to Benchmark Design ...... 29 2.9.2 Benchmarking Suites ...... 30 2.10 Performance Modelling ...... 31 2.10.1 Approaches to Performance Modelling ...... 31 2.10.2 What to Model ...... 32 2.11 Machine Learning ...... 32 2.11.1 Supervised Learning ...... 32 2.11.2 Unsupervised Learning ...... 34 2.12 Summary ...... 34

3 Literature Review 37 3.1 Introduction ...... 37 3.2 Survey Methodology ...... 37 3.3 Literature Review of MapReduce Pattern ...... 40 3.4 Literature Review of Data flow With Cycle Pattern ...... 43 3.5 Taxonomy and Classification ...... 47 3.5.1 General Approach ...... 48 3.5.2 Benchmarking and Profiling Approach ...... 49 3.5.3 Modelling approach ...... 51 3.5.4 Computing Framework ...... 52 3.5.5 Experimental Design ...... 52 3.5.6 Objective ...... 53 3.6 Research Gap Analysis ...... 54 3.6.1 General Approach ...... 54 3.6.2 Using of Representative Workloads to Effectively Profile a computing cluster base on Communication Patterns ...... 56 3.6.3 Curse of Configuration Parameters ...... 57 3.6.4 Generic Benchmarking ...... 58 3.7 Discussion on Current Trends ...... 59 3.8 Summary ...... 59

4 General Methodology and Practical Use Cases 61 4.1 Introduction ...... 61 4.2 General Methodology ...... 61 4.2.1 Identifying Candidate Performance Drivers ...... 63 4.2.2 Benchmarking ...... 64 4.2.3 Model Building ...... 64 4.2.4 Prediction and Decision Making ...... 64 CONTENTS III

4.3 Use Cases ...... 64 4.3.1 A recap of the Problem ...... 65 4.3.2 Use Case 1: Predicting Execution Time MR Pattern ...... 65 4.3.3 Use Case 2: Predicting Execution Time DFWC Pattern ...... 66 4.3.4 Use Case 3: Inferring Best Conﬁguration Parameters ...... 69 4.4 The generalisation of our approach ...... 70 4.5 Summary ...... 70

5 Benchmarking and Modelling MapReduce Pattern 73 5.1 Introduction ...... 73 5.2 Problem Background ...... 74 5.3 Theoretical Models of The MapReduce Pattern ...... 75 5.3.1 Map Phase ...... 76 5.3.2 Read ...... 77 5.3.3 Custom Map ...... 77 5.3.4 Collect ...... 78 5.3.5 Spill ...... 78 5.3.6 Merge ...... 78 5.3.7 Reduce Phase ...... 79 5.3.8 Shuffle ...... 79 5.3.9 Custom Reduce ...... 79 5.3.10 Write ...... 80 5.3.11 Combining it All Together ...... 80 5.3.12 Cost Model For The Entire Process ...... 80 5.4 Profiling and Modelling Methodology ...... 81 5.4.1 Adding Custom Framework Hadoop MapReduce Counters ...... 82 5.5 Phase Profiling Implementation ...... 83 5.5.1 Generic Mapper Implementation ...... 84 5.5.2 Map Selectivity Implementation ...... 84 5.5.3 Generic Reducer Implementation ...... 85 5.6 YARN Log Parser ...... 86 5.7 Model Building ...... 86 5.7.1 Read Model ...... 87 5.7.2 Collect Model ...... 88 5.7.3 Spill Model ...... 89 5.7.4 Merge Model ...... 90 5.7.5 Shuffle Model ...... 90 5.7.6 Write Model ...... 90 5.7.7 Custom Map and Reduce ...... 90 5.7.8 Generating the Input Parameters ...... 91 5.8 Experiment and Evaluation ...... 92 5.8.1 Setup ...... 92 5.8.2 Representative Workloads ...... 93 5.8.3 Summarisation Pattern: ...... 93 5.8.4 Filtering Pattern: ...... 94 IV CONTENTS

5.8.5 Data Organisation Pattern: ...... 95 5.8.6 Join Pattern: ...... 95 5.8.7 Evaluation of Results ...... 97 5.8.8 Summerisation Design Pattern ...... 97 5.8.9 Filtering Pattern ...... 98 5.8.10 Data Organisation Pattern ...... 98 5.8.11 Join Pattern ...... 100 5.9 Related Work ...... 101 5.9.1 Discussion ...... 102 5.10 Summary ...... 103

6 Benchmarking and Modelling Dataflow With Cycles Pattern (DFWC) 105 6.1 Background ...... 106 6.2 Methodology ...... 106 6.2.1 Key Config Selection ...... 107 6.2.2 Benchmarking ...... 107 6.2.3 Modelling & Decision Making ...... 108 6.3 Selection of Key Framework Level Configuration Parameters ...... 108 6.4 Selection of Key ML Application Configuration Parameters ...... 110 6.4.1 Linear Regression ...... 112 6.4.2 Logistic Regression ...... 112 6.4.3 SVM ...... 113 6.4.4 K-Means ...... 113 6.4.5 Decision Tree ...... 113 6.4.6 Random Forest ...... 114 6.4.7 Naive Bayes ...... 114 6.5 Benchmarking Design ...... 114 6.5.1 Customising HiBench ...... 115 6.5.2 Dynamic Configuration Files ...... 116 6.5.3 Benchmark Runner Implementation ...... 116 6.5.4 Discussion and Observations ...... 117 6.5.5 spark.executor.instance ...... 118 6.5.6 spark.executor.memory ...... 119 6.5.7 spark.executor.cores ...... 119 6.5.8 spark.de f ault.parallelism ...... 121 6.6 Model Building ...... 121 6.6.1 Training and Tuning The Models ...... 121 6.6.2 Model Selection ...... 124 6.7 Experiment and Evaluation ...... 125 6.7.1 Setup ...... 125 6.7.2 Evaluation of Results ...... 126 6.7.3 Support Vector Machines and Naive Bayes ...... 126 6.7.4 Linear and Logistic Regression ...... 127 6.7.5 Random Forest and K-means ...... 128 6.7.6 Evaluation of Model Performance ...... 129 CONTENTS V

6.8 Decision Making ...... 131 6.9 Related Work ...... 135 6.10 Discussion ...... 137 6.11 Summary ...... 138

7 Conclusion 139 7.1 Thesis Summary and Revisiting Research Contributions ...... 140 7.2 Lessons Learned ...... 141 7.3 Future Work ...... 143

Appendix A Implementation Codes and Details 145 A.1 MapReduce Benchmark Runner ...... 145 A.2 MapReduce Generic Benchmarking Implementation ...... 146 A.3 MapReduce Log Parser Implementation ...... 148 A.4 Benchmark Runner DFWC ...... 151

Appendix B Customising HiBench 157 B.1 Dynamic Conﬁguration ...... 158

References 159 LIST OF FIGURES

1.1 Conﬁguration Parameters vs Performance Goals ...... 4

2.1 The MapReduce Communication Pattern ...... 18 2.2 Dataflow with Cycles ...... 19 2.3 Dataflow with Cycles: DAG of k-means Algorithm...... 20 2.4 Data Flow With Barriers: From top to bottom, each task read from a distributed file system like HDFS, the data is then processed by mappers, shuffled, reduced by optional reducers and final results are written back to HDFS. The dependent task after the barriers must wait for all tasks before them to finish execution...... 21 2.5 Data Flow Without Explicit Barriers: From top to bottom, each task read from a distributed file system like HDFS, the data is then processed by mappers, shuffled, reduced by optional reducers and final results are written back to HDFS. Dependent task after the execution can continue as soon as some of the parent tasks complete execution...... 21 2.6 A high-level Architecture of the MapReduce Program (Letter Count) ...... 26 2.7 Apache Spark Architecture ...... 28 2.8 Benchmarking pipeline for big data systems ...... 29

3.1 Grouping questions by category ...... 38 3.2 Taxonomy of Benchmarking and Performance Modelling of DIA ...... 50 3.3 Classification of General Benchmarking Performance Modelling Approaches . . . 56 3.4 Classification of Workload Groupings ...... 57 3.5 Classification of Benchmarking Approaches ...... 58

4.1 General Methodology comprising of four major steps ...... 62 4.2 Execution Time vs Scalability ...... 65 4.3 A web based interface to predict the execution time of MR applications...... 66

5.1 Execution time for MapReduce TeraSort[43] and Simple Sort ...... 74 5.2 Execution time for MapReduce WordCount Program ...... 75 5.3 A MapReduce Workflow For Each Task [136]...... 76 5.4 A Phase Profiling and Modelling Methodology ...... 82 5.5 Sample Data Generated by YARN Log Parser ...... 86 5.6 Results of 10 Fold Cross-Validation and Prediction on the 8 Nodes cluster . . . . . 87 5.7 Best fit plot for each of the generic phase ...... 89 5.8 Applications in Summerisation Design Pattern ...... 97

vi List of Figures VII

5.9 Filtering Pattern ...... 99 5.10 Data Organisation Pattern ...... 99 5.11 Join Pattern ...... 100

6.1 A Four Component Methodology ...... 107 6.2 The frequency distribution plot of compression types ...... 109 6.3 Plots of Scaling Executor Memory ...... 110 6.4 Frequency Distribution Plot: Scaling Number of Executors for K-means algorithm. 118 6.5 Box plot showing the effect of scaling Number of Executors...... 119 6.6 Scaling Executors Memory for SVM algorithm ...... 120 6.7 Scaling Executors Cores for Random Forest ...... 120 6.8 Modelling Flowchart ...... 122 6.9 Model Evaluation Plot: RMSE ...... 124 6.10 Model Evaluation Plot: R2 ...... 125 6.11 Prediction Results SVM workload ...... 126 6.12 Prediction results for Naive Bayes workload ...... 127 6.13 Prediction Results for Linear Regression workload ...... 127 6.14 Prediction Results for Logistic Regression workload ...... 128 6.15 Prediction Results for K-means workload ...... 128 6.16 Prediction Results for Random Forest workload ...... 129 6.17 Model Performance Evaluation: Scaling Data Size ...... 130 6.18 Model Performance Evaluation: Scaling Executor Memory ...... 131 LIST OF TABLES

3.1 Taxonomy and Classiﬁcation of Benchmarking and Performance Modelling . . . . 55

5.1 Custom MapReduce Framework Counters ...... 82 5.2 RMSE and R-Squared values for Shufﬂe Phase ...... 88 5.3 RMSE and R-Squared values for Write Phase ...... 88 5.4 Metrics Extracted From Logs ...... 91 5.5 Metrics with values ...... 92 5.6 Workloads Used ...... 96 5.7 Results for Summarisation on 8-Node Cluster ...... 98 5.8 Results for Filtering Pattern on 8-Node Cluster ...... 98 5.9 Results for Data Organisation Pattern ...... 100 5.10 Experiment Data for Join Pattern ...... 101 5.11 Comparison of Related Work: The average prediction error used in the literature focuses on the absolute difference between the predicted and observed values. . . . 102

6.1 Candidate Conﬁguration Parameters ...... 111 6.2 Tunable Parameter: Caret Package in R ...... 123 6.3 Full description and parameters used in the model generation process...... 124 6.4 Comparison of Related Work ...... 135

A.1 List of Github URLs. This URLs contains both the raw data and the implementation to generate the models...... 155

viii LISTINGS

4.1 Running K-Means prediction using REST API. Example 1 where size of the data in GB is 5GB ...... 68 4.2 Running K-Means prediction using REST API. Example 2, where size of the data is scaled to 15GB ...... 68 4.3 Get the best configuration settings for running Linear Regression on 12 GB data 69 4.4 Get the best settings for running Logistic Regression on 12 GB data ...... 70 5.1 Adding Custom Counters To The MapReduce Framework ...... 83 5.2 Generic Mapper Implementation ...... 84 5.3 Map Selectivity Implementation ...... 85 6.1 Mapping From Properties to Environment Variable Names ...... 116 6.2 Benchmark Runner: A fragment of the implementation of Benchmark Runne. . 117 6.3 Default Model with without any tuning ...... 122 6.4 Training and Tunning use Random Search ...... 123 6.5 Training and Tunning Model Parameters using Grid Method ...... 123 6.6 Using curl calls to interact with the REST API to get the execution time of running an application using specific configuration values. We are interested in the accuracy of our results. For each application, we only scale the data size and keep all other application settings based on HiBench’s recommended settings. . 132 6.7 The best and the worst configuration settings for running k-means on 12 GB data. We are interested in the accuracy of our results. For each application, we only scale the data size and keep all other application settings based on HiBench’s recommended settings. or example, for K means we keep k constant and scale the dataset from 1GB to 12GB...... 132 6.8 Prerequisite to Getting The Best Configuration Parameters ...... 133 6.9 Dynamic Imputation of Configuration Values ...... 134 6.10 Two best cost effective EC2 instances for running kmeans application...... 134 A.1 The Benchmark Runner, executes the generic benchmarking code programmed in Java. For each data size, we the map selectivity from 10% to 100% . . . . . 145 A.2 A generic benchmarking implementation executed to profile the different stages of the MapReduce framework...... 146 A.3 Log Parser Implementation. MR counters are written to log, this program extract relevant MR COUNTERS from the logs...... 148 A.4 Generic Reducer Implementation ...... 151 A.5 A general method of modelling and performance prediction ...... 154 B.1 Add New Configuration Parameters for Reporting. New ...... 157

ix X List of Tables

B.2 Dynamic Imputation of Conﬁguration Values:This is just a fragment of the full implementation showing how we impute various values of Number of Executors and Executor Memory for each job execution...... 158 ACRONYMS

API Application Programming Interface

DAG Directed Acyclic Graph DFWC Dataﬂow With Cycle DIA Data-Intensive Applications

EC2 Amazon Elastic Compute

HDFS Hadoop Distributed File System HPC High Performance Computing

IaaS Infrastructure As A Service

JVM Java Virtual Machine

LHS Latin Hypercube Sampling

ML Machine Learning MPI Message Passing Interface MR MapReduce

NLP Natural Language Processing

PaaS Platform As A Service

RDBMS Relational Database Management System RDD Resilient Distributed Dataset RMSE Root Mean Squared Error

SaaS Software As A Service SQL Structured Query Language

xi XII ACRONYMS

SVM Support Vector Machine

SVR Support Vector Regression

TF-IDF Term Frequency Inverse Document Frequency

YARN Yet Another Resource Negotiator

YCSB Yahoo! Cloud Service Benchmarking GLOSSARY

Best Conﬁguration The best conﬁguration is set of parameter value settings that can be used to optimally execute a job to meet certain objectives. These objectives can be deadline or time constraints.

Error Rate The error rate in this thesis is a measure of the difference between the values of our prediction models and the measured values. That is the actual time it takes to run a particular application on a framework

Representative ML workloads These are machine learning workloads that are widely used in the literature.

Semi-representative Workloads These are a group of workloads that include representative and non representative workloads. We came up with this classiﬁcation to group papers based on the types of workload they include in their experiments.

Worst Conﬁguration The worst conﬁguration is set of parameter value settings that when used will yield a very bad performance. It is important to assess this to understand the extreme cases of parameter settings

xiii

CHAPTER ONE INTRODUCTION

1.1 Rise of Big Data

Since the advent of big data about a decade ago, data-intensive applications have become one of the driving forces for innovation in modern tech companies such as Google, Amazon, Facebook, Microsoft and LinkedIn. These companies generate1 terabytes of data daily and perform various analytic tasks to gain competitive advantage. The analytics may include running Machine Learning (ML) applications to cluster customers based on certain criteria such as buying behaviour, perform product recommendation [15], business intelligence analytic [141] and social network connection analysis [17]. To cope with this massive explosion of data and the demand to handle computation at scale, these tech enterprises in collaboration with the open-source community, developed various data-intensive computing frameworks to aid seamless big data analytic. Hitherto, parallel programming models such as Message Passing Interface (MPI) and OpenMP were used to process massive datasets. These programming models run on powerful supercomputers with a huge amount of resources. The main disadvantage of this approach is that such computers are costly to scale and acquire. Now massive data processing is performed using clusters of commodity hardware, this approach is cost-effective and scales to tens of thousands of machines. The generation and the use of big data to gain competitive advantage or provide a better quality of service is not only limited to its pioneers or tech giants like Google, Amazon and Facebook. Now, even conventional and non-technological organisations like governments, health sectors, agricultural sectors, ﬁnancial sectors and education are also generating massive datasets. Over the years, these sectors have also started leveraging the beneﬁts of big data to deliver better services and improve its overall performance. In the following section we will present some inspirational examples; in the domain of meteorology, big data is used in weather forecasting, the study of global warming, understanding

1 2 CHAPTER 1. INTRODUCTION

the patterns of natural disasters and the availability of usable water around the world [11]. , For example, IBM’s Deep Thunder [51] provides weather forecasting through high- performance computing of big data. On the other hand, IBM is also assisting Tokyo to improve weather forecasting for natural disasters or predicting the probability of damaged power lines. In the transportation industry, big data can be used for route planning, traffic congestion management, traffic control and identifying accident hot-spot areas. For example, companies such as Uber improve the quality of service by collecting a massive amount of data about drivers, their vehicles, locations, every trip from every vehicle. The goal is to develop prediction models to predict supply, demand, location of drivers, and fares that will be set for every trip [33]. In the entertainment industry, users use different devices to access media contents, thereby generating a lot of data. Examples of the benefits of big data in the media and entertainment industry include predicting the interests of audiences for customised content delivery, optimised or on-demand scheduling of media streams in digital media distribution platforms, getting insights from customer reviews and effective targeting of customers for advertisements. A good example is the Spotify’s on-demand music platform which collects data from all the users around the globe and then uses the analysed data to give informed music recommendations and suggestions to every individual user [83]. Finally, big data has revolutionised the health care sector. It can be used to provide evidence- based medicine by researching past medical records from different sources and provide a valid prediction of chronic disease outbreak in disease-frequent communities [30]. An excellent example in this domain is the wearable devices and sensors that have been introduced in the healthcare industry to provide real-time feed to the electronic health record of a patient [68, 106].

1.2 Big Data Frameworks

Big data frameworks are an integral part of the big data ecosystem. These frameworks provide high-level programming models to manage the processing of big data. However, they are highly configurable, which creates an additional opportunity and challenge for optimisation. As of writing this thesis, both Apache Spark and Apache Hadoop has over 200+ configuration parameters that can be used to optimise the performance of an application. Using the default settings or incorrectly tuning these configuration parameters could lead to performance issues. These tunable parameters include configurations settings for memory optimisation or management, garbage collection, application resource usage, shuffling behaviour, compression and serialisation, execution behaviour, networking and scheduling. Manually tuning these configuration parameters for best performance would be very ineffective because of their exponential combination. Therefore, a clear opportunity is to use 1.3. THE PROBLEM 3 machine learning approaches to investigate the effect and the relations between the configuration parameters and overall performance of an application. The first modern big data processing frameworks use the MapReduce pattern [41] and its open-source implementation called Hadoop MapReduce [63]. The MapReduce paradigm exposes a simple but powerful programming framework for processing massive data sets at scale on distributed computing clusters. The programming model is expressed using two simple functions called map and reduce. The MapReduce execution engine automatically handles the parallelisation of an application on a distributed computing cluster. One of the drawbacks of MapReduce is that it is not good for iterative computation such as machine learning applications due to its simple two-stage programming model. This drawback amongst other drawbacks like performance, high latency led to the development of distributed in-memory computing frameworks such as Apache Spark [138], which is based on the dataflow with cycle pattern [31]. Unlike MapReduce, Apache Spark is general purpose and not limited to only two functions. The framework offers offers a rich set of data manipulations functionalities know as transformations and actions. These transformations and actions allow complex analytic by organising tasks into Directed Acyclic Graph (DAG), executing them on the framework. Spark prevents the problem of MapReduce performance by keeping data mostly in memory. There is quite a lot of active development and research from both academia and open source communities to add new features and improve performance of data-intensive frameworks. These presents new research challenges and opportunities. One of the challenges is to identify and improve performance bottlenecks through benchmarking; typically done by running representative workloads to gather enough data for analysis. These benchmarks can be grouped into end-to-end or microbenchmarks. End to end benchmarks, such as HiBench benchmarking suite [74], offer functionalities to benchmark an entire big data framework. Micro benchmarks, on the other hand, focus on a specific aspect of the system, such as performance of Hadoop Distributed File System (HDFS) [63] reads and writes.

1.3 The Problem

In this section, we will present a concise example of the problem addressed in this thesis. The main performance bottleneck for data-intensive applications is the amount of data to process. The algorithms are normally efficient, simple and mainly written once without much need for regular optimisation. The only way to optimise the performance of these applications is to optimally tune, the application code, configuration parameters or horizontally scale the hardware infrastructure for more parallelism. However, due to the vast space of configuration parameters, this is not a trivial task to achieve. For example, let us imagine that a user wants to execute 4 CHAPTER 1. INTRODUCTION

Figure 1.1: Configuration Parameters vs Performance Goals: Can we understand the performance of various big data workloads as we scale input and resource configuration parameters such as data size, memory and CPU. a data-intensive application such as WordCount or RecordCount1 using a framework such as Hadoop or Spark. As shown in Figure 1.1, some of the significant performance challenges would be to understand how long it will take to execute these applications? What effects would it have on performance if input and configuration parameters such as data size, memory, CPU are scaled? What are the best combinations of configuration settings to achieve optimal performance or meet a specific deadline? These challenges are essential to answer because they would affect the monetary and operational cost of running applications. If the cluster is deployed in the cloud, then minimising the financial cost of operation is a crucial objective. Based on the above high-level problem statement, the main goal of this thesis is to study and understand the performance characteristics of various data-intensive application in relation to configuration parameters. To achieve this, we seek to enable a generic benchmarking and performance prediction framework for data-intensive applications. This framework must satisfy and adequately address the following open challenges. • Generic Benchmarking: To aid reproducibility of the performance models, the benchmarking approach should use generic method to benchmark the distributed computing framework. By doing so, it should target the core functionality and dataflow pipeline of that particular distributed computing framework.

• Representative Workloads: The goal here is to identify and use workloads that would effectively stress-test the core functionalities of the framework. It can also be workloads that 1A very common analytic performed to get a heartbeat of a data ﬂow rate on a particular interval (weekly, daily, hourly, monthly or yearly)[98] 1.4. RESEARCH QUESTIONS 5

are heavily used in industry, e.g. machine learning applications. The use of representative workloads aids accurate prediction when different workloads are used.

• General Performance Prediction Models: The goal is to enable a multi-objective performance prediction models. The performance models should not be limited to predicting the performance of one or speciﬁc applications only. It should also not be limited to predicting only one response variable. In our case, this can execution time, deadline, or best CPU and memory settings to optimise application runtime.

• Ubiquity of Configuration Parameters: The are more than 200+ configuration parameters for both Apache Spark and Apache Hadoop. These configuration parameters are used to tune the performance of distributed computing frameworks. We have discussed that we cannot study the effect of all these configuration parameters on performance; however, we need to identify key configuration parameters and review their overall impact on performance.

• Accuracy: The end goal is to enable a multi-objective performance prediction framework for data-intensive applications. The framework should be accurate enough in the prediction of each of the goals shown in Figure 1.1.

1.4 Research Questions

In this section, we will discuss in details the three research questions we have formulated to address the challenges covered in section 1.3.

RQ1 Considering the ubiquity of data-intensive workloads and the vast space of tunable framework conﬁguration parameters, can we enable a generic benchmarking framework for data-intensive applications? The question seeks to enable a generic benchmarking framework for varying big data workloads. There are lots of big data workloads; therefore, the best approach would be to identify common characteristics of applications and frameworks and use that knowledge in the process. For example, the MapReduce paradigm and its corresponding programming model encapsulates the low-level details from the users of the framework through a high-level map() and reduce() interfaces. This simplicity hides the various complexities that would be crucial in providing a generic benchmarking process. The low-level data processing pipeline of the framework has six phases, and a fundamental understanding of how these phases work and interact with one another would be vital in the creation of a generic benchmark. Whiles, there has been some research work such as [127, 129, 6 CHAPTER 1. INTRODUCTION

139, 140] they focused on the earlier version of Hadoop and relied on only high-level functionalities provided by the framework without delving into the low-level details. To address this question, we first find a way of grouping data-intensive workloads into a few common grouping based on robust criteria. After exhaustive background research, we categorise them based on dataflow communication patterns [31]. We mainly focused on MapReduce (MR) and Dataflow With Cycle (DFWC) because besides being the most popular patterns with easy to use frameworks, they provide a clear distinction in their approach to big data processing. For each of these patterns, we then enable a benchmarking component [21, 22]. For the MapReduce pattern, we studied the source code of Apache Hadoop to identify the six phases of the data processing pipeline. For each of these phases, we added the necessary implementation to report execution times. Since MapReduce uses counters to present task and job-related statistics to the user, we added six new counters to represent the six generic phases of MR (Read, Map Collect, Spill, Shuffle, Reduce and Write). We further implemented a generic MapReduce workload and executed it with different configuration settings. More details about this process and the entire implementation is presented in Chapter 5, Appendix A.2 and [22]. For the DFWC pattern, we used the Apache Spark framework. Unlike Hadoop, Spark does not have a specific set of phases that we can use to design the benchmark component. Instead, the data processing pipeline for jobs could have a varying number of transformation and iteration. For example, the batch version of the k-means algorithm could have n number of iterations which may translate to n number of transformations or actions. So we took a different approach by using representative workloads to design the benchmarking component. These workloads are mainly ML algorithms because they fit the dataflow with cycle pattern. To address the issue of identifying key configuration parameters, we devise a systematic approach by first studying academic literature focusing on performance modelling of DFWC applications. This process enabled us to generate a pool of candidate configuration parameters. In the second step, we employ an empirical approach by running small scale experiments to select the final list of configuration parameters. More details about this process is covered in Chapter 6.

RQ2: Using our findings in RQ1, can we devise a general approach to model various performance characteristics of data-intensive applications? The main challenge here is to enable a multi-objective performance prediction framework that can be used to understand the relationship between performance and configurations settings during job execution. Big data frameworks are highly configurable, which makes 1.4. RESEARCH QUESTIONS 7

it hard to manually tune and study the effects of all configuration parameters manually. Therefore, the use of performance models would be the best way to explore the impact of configuration parameters on performance. The models can be used to answer the following questions: How long will it take to run a data-intensive application as configuration parameters such as the size of the data, the number of executors, executor memory, executor core are scaled? What is the best combination of configuration parameters that can minimise application running time or meet a specific deadline? An example of such an application can be a summary statistics application [4, 47] that processes user activity data on a website every ten minutes. What is the best combination of cloud instances to optimise financial cost? While there has been some work in this area such as [27, 59, 102, 127, 128, 129, 139], however, they are not multi-objective, and some of them focus on only specific aspects of performance such as data compression and data serialisation. The prerequisite for addressing this question is to have a generic benchmarking component to profile a cluster and generate data about various workloads. The data collected is later used to build models using the grey-box approach. This approach entails using conventional machine learning tools to build models using historical data and taking into account an understanding of how the underlying framework work. The models are validated and used to answer the questions highlighted in the problem definition.

RQ3 How can we verify the generality and the accuracy of our approach on an arbitrary cluster size? Models are not perfect, but they can be used to understand the behaviours of the subject being studied. They are also limited by the amount and the quality of data used in the model building process. One of the main questions is to test and verify the scalability of the modelling framework to understand the strengths and the limitation. For example, will our models work on a cluster which has physical resources five times the size of our cluster? We addressed this question from two angles. The first one is to scale the computing infrastructure. The second one is to scale the value of the key configuration parameters during experiments. In scaling the computing infrastructure, we started with a single node cluster and expanded to an eight-node cluster. The results are consistent on both setups. In the benchmarking process, we scale the value of the configuration parameters to reach the maximum allowed values. We then performed a validation experiment by scaling the value of the configuration parameters to higher values. We have observed that the results of the models are consistent up to a certain point then they converge at some point and do 8 CHAPTER 1. INTRODUCTION

not perform any better. This behaviour is normal in any modelling task; more data and resources would be needed to do better than that. These results are discussed in Chapters 5 and 6

1.5 Main Contributions

In this thesis, the main goal is to enable a grey-box benchmarking and performance modelling framework for data-intensive applications. There are two significant parts to this goal. The first part is to investigate a cost-effective and generic approach to benchmark data-intensive applications based on communication patterns such as the MR and DFWC patterns. The second part is to build performance models using the data from benchmarking to answer performance- related questions. To achieve these significant goals, we employed a grey-box methodology by combining framework and application internals to the entire process. We then deploy the framework on an internal private cluster and run various experiments for validation. Below we discuss a more detailed breakdown of the main contributions of this thesis. • Contribution 1: Performance Modelling Framework for Data-Intensive Applications A multi-objective performance prediction framework that can be used to understand the relationship between performance and configurations settings of data-intensive applications during job execution.

• Contribution 2: A Generic Benchmarking Framework For MR and DFWC pattern

We have enabled a generic benchmarking framework for MR and DFWC patterns that can be used to proﬁle applications executed on a big data framework.

• Contribution 3: Theoretical models for each phase of MapReduce

Black-box performance models are simple to build, but they suffer from accuracy [27]. Therefore, a fundamental understanding of the low-level details of applications and the frameworks they are executed on would be useful in building more accurate performance models. The goal here is to study the low-level details of the MapReduce framework and provide mathematical models to represent each of the six generic phases. To achieve this, we performed an in-depth exploration of the Apache Hadoop source code to understand the data processing pipeline. There are six generic phases of the pipeline (read, collect, spill, merge, shufﬂe, write). For each of these phases, we devise a theoretical model and combine them to form a ﬁnal cost model. The models were used to support and validate the models generated by the machine learning approach. We achieve higher performance accuracy compared to the related work we surveyed, such as [129, 139]. 1.5. MAIN CONTRIBUTIONS 9

• Contribution 4: A Comprehensive Implementation of the Models as a Software Frame- work

What we have realised from related and highly sighted works such as [127, 128, 129, 139, 140] is that to the best of our knowledge there is little or no evidence to test their performance prediction frameworks or verify what they claim. We think it is fair to allow the users, readers and future researchers the opportunity to see part or all of research work in action through accessible simulations or web-facing interfaces. For the MapReduce pattern, we built a web-facing interface that can answer performance-related questions. Using this interface, the user can input metrics such as the size of the data, map-selectivity (ratio of map input to map output), the number of mappers for the framework to return the execution time for all the generic phases and the total execution time for a speciﬁc job. As for the DFWC pattern, we built a comprehensive REST Application Programming Interface (API) that can be used to query the multi-objective framework. Using this rest API, users can get answers to several objectives by specifying values for conﬁguration parameters such as memory, CPU, number of executors and size of data to process. We have provided the source codes of these implementations and GitHub links for reproducibility in Appendix A.

• Contribution 5: A Detailed Taxonomy of Benchmarking and Performance Modelling of Data-Intensive Applications (DIA)

As of writing this thesis and to the best of our knowledge, we found no detailed surveys focusing on benchmarking and performance modelling of data-intensive applications. Although the field is relatively new, we believe that a lot of interesting work has been done and therefore a comprehensive taxonomy would help in highlighting the trends, challenges and opportunities in this area. We have conducted a literature review focusing on benchmarking and performance modelling of data-intensive application. This is covered in Chapter 3 of this thesis. We identified key areas of trends, challenges and opportunities in this domain. To make the presentation of the taxonomy easy to understand, we grouped the surveyed papers by the general approach; this could be a black-box, grey-box or white-box approach. Then we focus on five more groupings (Benchmarking Approach, Modelling Approach, Communication Pattern, Experimental Design and Overall Objective.). This grouping also has some sub-groupings. The taxonomy allows us to highlight the most common approach used and the trends and opportunities in this domain. 10 CHAPTER 1. INTRODUCTION

1.6 Publications

The work presented in this thesis led to the publications of several peer-reviewed papers in reputable conferences. For each of these publications, I formulated the entire idea and wrote the ﬁrst and ﬁnal draft. The second author and my supervisor provided the necessary feedback and made some proofreading to improve the content and structure of the papers. The details of each of the papers are discussed below: • Sheriffo Ceesay, Blesson Varghese and Adam Barker. Plug and play bench: Simplifying big data benchmarking using containers, 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, 2017, pp. 2821-2828

This paper serves as the anecdote of this PhD work. Since big data benchmarking is an integral part of this work, the main goal of this paper was to research and identify major big data benchmarking frameworks and tools and highlight their strengths and weaknesses. It serves as a mini-survey to understand the trend of big data benchmarking. As a result, we were able to identify HiBench [74] as the best big data benchmarking suite and improve its drawback to complete the rest of the paper. Finally, this paper motivates us to work on a framework, that will include both benchmarking and performance modelling.

• Sheriffo Ceesay, Adam Barker and Yuhui Lin Benchmarking and Performance Modelling of MapReduce Communication Pattern 2019 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), Sydney, Australia, 2019, pp. 127-134.

This paper focused on benchmarking and modelling the MapReduce communication pattern. In this paper, instead of focusing on a wide range of conﬁgurable parameters, we studied the low-level internals of the MapReduce communication pattern. We used a minimal set of performance drivers to develop a set of phase level parametric models for approximating the execution time of a given application on a given cluster. Models can be used to infer the performance of unseen applications and approximate their performance when an arbitrary dataset is used as input. Our approach is validated by running empirical experiments in two setups. On average, the Error Rate in both conﬁgurations is 10% from the measured values. ± The error rate in this document is a measure of the difference between the values of our prediction models and the actual time it takes to run an application on a framework.

• Sheriffo Ceesay, Yuhui Lin and Adam Barker A survey of benchmarking and performance modelling of big data applications, 7th IEEE/ACM International Conference on Big Data Computing, Applications and Technologies. The University of Leicester. Leicester, UK, December 7-10, 2020. 1.7. THESIS STRUCTURE 11

As of writing this thesis and to the best of our knowledge, there are no comprehensive surveys that thoroughly examine the gaps, trends and trajectories of this area. To fill this void, we, therefore, present a review of the state-of-art benchmarking and performance modelling efforts in data-intensive applications. We start by introducing the two most common dataflow patterns used, for each of these patterns, we review their approach to benchmarking, modelling and validation & experimental environments. Furthermore, we construct a taxonomy and classification to provide a deep understanding of the focus areas of this domain and identify the opportunities for further research. We conclude by analysing each research gap and highlighting future trends.

• To Be Submitted: Sheriffo Ceesay, Yuhui Lin and Adam Barker Benchmarking and Perfor- mance Modelling of Dataﬂow with Cycle Pattern 2020 The 21st IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)

This paper’s primary goal is to benchmark and model the performance of dataflow with cycles. In this paper, present a multi-objective performance prediction framework for applications in the dataflow with cycles pattern (DFWC) based on machine learning approaches. Out of the 200+ configurable parameters, we first dealt with the problem of identifying the key metrics to use in the model building process. Using these parameters, we build models that can predict the execution time of a given application with high accuracy. We can also infer optimal values for each configurable parameter to meet a specific deadline or best performance. Based on these optimal configurable values, the framework can also recommend the best Amazon Elastic Compute (EC2) instances in terms of cost. The average error rate of the prediction results is 14% from the measured value. ±

1.7 Thesis Structure

This thesis is arranged into seven (7) main chapter; three of the chapters are works that have been peer-reviewed and published into conferences. Below we discuss the summary of each of these chapters.

Chapter 1 gives an introduction to the thesis by highlighting the motivation, the challenges and how these challenges were addressed. We have also discussed and highlighted the main contributions of this thesis.

Chapter 2 presents the background of benchmarking and performance modelling of big data applications. In this chapter, we introduced the core concepts relevant to this work. We also added the various technologies that are central to the implementation of this work. 12 CHAPTER 1. INTRODUCTION

Chapter 3 focuses on Literature review. We reviewed well sighted existing works in benchmarking and performance modelling domain. Based on the reviewed papers or literature, we built a taxonomy that highlights the challenges, trends and opportunities in benchmarking and performance modelling of big data applications.

Chapter 4 focus on the general methodology we have used in this work. We also presented interesting case studies to showcase the results of the framework. We believe this will give the reader a high-level view of how to use the application.

Chapter 5 focuses on modelling the ﬁrst communication pattern, i.e. MapReduce pattern. In this chapter, we highlighted the processes we have used to benchmark and model the performance of MapReduce pattern. We presented theoretical models of the six generic phases and use black-box machine learning approach to model performance. This work has been peer-reviewed and published in a conference [22].

Chapter 6 focuses on modelling the second communication pattern, i.e. the dataflow with cycle pattern. In this chapter, we highlighted the processes we have used to benchmark and model the performance of Dataflow pattern. We first identified vital configuration parameters out the 200+ configuration parameters. We then used key configuration parameters to benchmark and model performance of typical applications. For Chapter 5 and Chapter 6, we discussed the experiments within the chapters to keep the content organised. We are submitting this work to the CCGrid conference.

Chapter 7 concludes the thesis by summarising the content, revisiting the contributions and suggesting future work.

1.8 Summary

In this chapter, we first defined the problem statement in details using an intuitive and concrete example. From the problem statement, we generated three research questions discussing them in details, highlighting challenges and how we addressed those challenges. In section 1.5, we discussed five main constitutions of this work. At the end of the chapter, we discussed our publications and concluded with the overall structure of this thesis. In the next chapter, we will discuss the relevant background work of this thesis. We will cover some historical context related to this work then delve deeper into all the technologies we have used. CHAPTER TWO BACKGROUND

2.1 Introduction

This chapter reviews the background of this work, focusing on key concepts and technologies we have used. In section 2.2, we will highlight key concepts of data-intensive applications and cloud computing. In section 2.3, we cover various dataﬂow communication2 patters focusing more on the MapReduce and Dataﬂow With Cycle (DFWC) patterns. We give key historical concepts of parallel processing, in section 2.4, we discuss parallel programming models and highlight some of the key factors that led to the transition in distributed programming models such as Hadoop and Apache Spark. In section 2.9, we cover benchmarking approaches and in section 2.10, we cover various approaches to performance modelling. We discuss machine learning approaches in section 2.11 and summarises the chapter in section 2.12.

2.2 Big Data

Since the advent of the big data more than a decade ago, there has been no universally agreed definition owing to its shared origin between academia, industry and media [134]. However, one of the most cited definitions proposed by Doug Laney of META (now Gartner) [86] defines big data as data having a high volume, high velocity and high variety that cannot be stored and processed using Relational Database Management System (RDBMS) or traditional and conventional method of data processing. Oracle defines big data by categorising it by the source of data: Traditional enterprise data, Machine or Sensor generated data and Social data [42]. The paper further discussed what it referred to as the four characteristics or 4Vs of big data. The 4Vs of big data is presented in subsection 2.2.2. Amongst all the various definition of big data in [134], [29], [111], [94], the fundamental properties they have in common is the large volume, heterogeneous & autonomous sources (variety), high velocity and the challenge of processing

13 14 CHAPTER 2. BACKGROUND and storing such amount of data. To put these properties and challenges into context, let us consider the word "coronavirus" a global health pandemic and the most signiﬁcant health crisis as of writing this thesis. Searching the term "coronavirus" on Google returns about 11 billion results. These results are a combination of media articles, tweets, images, blog and opinion posts. As these results keep streaming, can an individual or a company provide real-time analysis and summary of the trajectory of this pandemic? This tasks would entail, gathering a massive amount of rapidly generated raw information in different forms from various sources and processing them on a large scale. This scenario is the typical big data problem that many companies are now battling to address.

2.2.1 Data Intensive Applications

Hitherto, an application is mostly classified as compute-intensive (n-body problem [137], dense matrix computation [48]) or memory-intensive (in-memory analytics [90]). This classification means that the main challenge of those applications in terms of performance is the amount and the power available CPU or memory. Although not strictly accurate adding more or faster CPU and memory would fundamentally improve the performance of these applications. A much more thorough approach would be to optimised applications to utilise more optimal memories such as the different caching levels. Data-intensive applications on the other hand are applications whose primary challenge is data. These challenges could be the amount of data to process, complexity of data, storage of the data and the speed at which it is changing. In the last decade, we have experienced a lot of exciting development in distributed systems, database, and in the way, applications are built on top of them. There are various driving factors responsible for this [82]. First, Internet companies such as Google, Microsoft, Facebook, Amazon, LinkedIn, eBay, Netflix and Twitter rapidly generate massive datasets and traffic forcing them to research and look into new tools to enable them to effectively and efficiently manage and store such scale. Most of the typical big data tools, data stores and frameworks such as MapReduce, HBase, AmazonDB were all inspired by these internet companies. Secondly, manufacturers have shifted away from manufacturing faster CPUs in favour of multi-core processors [82]. This design shift means a single CPU can run multiple tasks at the same time. Networks are also becoming faster, which has given rise to distributed and parallel computing. Now, big data problems are mainly handled in a distributed and parallel approach. Moreover, deploying distributed applications is now easier due to the advent of cloud computing. Thanks to infrastructure as a service, many companies, including startup companies, can now build applications that are distributed across many machines and even multiple geographic locations. This helps to cement the idea of high availability because extended 2.2. BIG DATA 15 outage or maintenance is now becoming unacceptable. Finally, free and open-source software systems like Hadoop MapReduce, Apache Spark, MongoDB, HBase has become very successful and are mostly preferred over expensive commercial or bespoke in-house custom versions.

2.2.2 Core Properties of Big Data

In the previous section, we have discussed that there is no universally agreed deﬁnition of big data. However, there are key characteristics that all the deﬁnitions implicitly or explicitly includes. These characteristics were originally referred to as the 3Vs of big data [86]. We will also cover two more Vs (Value and Veracity) which have been recently added. • Volume: The Volume property deals with the humongous size of data being generated in recent times. Let us look at this with some numbers to understand the magnitude of the data being generated. In a study conducted by IBM, the authors noted that 90% of all the data had been created in the last two years [38]. In another study [53] conducted in 2012, the authors noted that, by 2020, there would be around 40 trillion gigabytes of data. All these studies strongly signify that massive amount of data is being generated at an enormous rate.

• Velocity: Velocity is the rate at which raw data to be processed is generated. This is sometimes referred to as streaming data. Taking Twitter and YouTube as an example, the authors in this study [78], noted that, Twitter users send nearly half a million tweets every minute, it is also noted in the same study that YouTube users watch approximately 4.5 million videos per minute.

• Variety: Variety deals with the different types of data being produced. Traditional data are mostly text organised in a well-structured form. Big data is mostly unstructured; it includes semi-structured text such as tweets, posts, comments, and non-text ﬁles such as images, audios. An example of mainstream applications generating varieties of data would be social media applications like Facebook and Snapchat. With these applications, users can add a post to their walls and in the same post, share videos and photos with their friends.

• Value and Veracity: Value property deals with the inherent truth-value in data while veracity deals with the uncertainty of data. For example, IBM, in their research [38] noted that one in three business leaders do not trust the information they used to make decisions. It further states that poor quality data cost the United States economy around $3.1 trillion a year. 16 CHAPTER 2. BACKGROUND

2.2.3 Cloud Computing

Cloud computing is the process of offering both software applications, computing platforms and hardware infrastructure as services over the Internet [5]. Big data has been one of the critical drivers for the rapid development of cloud computing. Setting up and maintaining big data frameworks and infrastructure can be both an expensive and daunting task. With cloud computing, low budget companies such as startups can cheaply spin up hundreds of nodes with just a few clicks. A big data application can be deployed in an in-house infrastructure or public cloud. One of the main advantages of cloud computing in the context of big data is elasticity, which is the ability to scale the infrastructure when required seamlessly. Cloud Computing is a massive ﬁeld which is beyond the scope of this work. Therefore, in this subsection, we will brieﬂy discuss cloud computing concepts that are relevant to this work.

Software As A Service (SaaS)

Software As A Service is a service model that enable cloud clients to use application hosted by cloud service providers. The applications can be accessed using various client devices through either a thin client interface, such as web browsers or a program interface [95]. The user of these services does not manage or control the underlying cloud infrastructure. This infrastructure includes both the software and hardware stack. One of the main advantages of SaaS is software updates and bug ﬁxes are instantly available to users without them needing to install a new version of that software. Example of SaaS is the GMail provided by Google, Microsoft Teams software for videos calls and collaborations.

Platform As A Service (PaaS)

Platform As A Service is a service model that enables clients the ability to develop and publish applications on a computing platform without the need to manage the underlying hardware and software layers [26]. Generally, PaaS includes provisioning a software platform or framework and provides the facilities required to support the complete life cycle of building applications. In the context of big data, this would include deploying big data applications on frameworks such as the Hadoop and Spark ecosystem maintained by public cloud providers. An example is the Google Dataproc [58], which enables clients to ﬂexibly spin up a Hadoop cluster with an arbitrary number of nodes. PaaS providers handle platform maintenance and system upgrades, resulting in a more efﬁcient and cost-effective solution for enterprise software development. 2.3. DATAFLOW COMMUNICATION PATTERNS 17

Infrastructure as a Service (IaaS)

Infrastructure As A Service is the lowest level of abstraction, and it allows clients to remotely use IT hardware and resources on a "pay-as-you-go" basis [5]. It entails the provision of bare metal computer infrastructure such as CPU, disk, memory and network as a service. The client is free to install any software and manage them as they wish. Major IaaS players include companies like Google, Microsoft and Amazon. IaaS employs virtualisation, a method of creating and maintaining infrastructure resources in the "cloud". IaaS provides small startup ﬁrms with a signiﬁcant advantage since it allows them to gradually expand their IT infrastructure without the need for substantial capital investments in hardware and peripheral systems.

The Pay As You Go, Model

The pay-as-you-go model charges customers based on usage of rented services. Pay-as-you-go platforms, such as Amazon EC2, Google compute and Microsoft Azure, provide services by allowing users to get access to compute resources and charged by what is used. Users can configure the CPU, memory, storage, operating system, security, networking capacity and access controls, and any additional software needed to run their environment. In this work, one of the objectives of the performance modelling framework is to infer the best cloud instances for running a particular data-intensive application given specific configuration settings.

2.3 Dataﬂow Communication Patterns

We have witnessed a significant transformation of how systems software and corresponding applications are built over the past decade. Due to growing data volumes, we have seen the rise of the development of more data-parallel frameworks such as MapReduce and Apache Spark. On the other hand, there have been significant efforts into building high-capacity data centres to support efficient execution of distributed data-intensive applications [31]. Due to these developments, traditional point-to-point and client-server communication models are now replaced by higher-level communication patterns that involve many machines communicate with one another. Majority of the distributed data-intensive applications are executed using frameworks such as MapReduce and Apache Spark. These frameworks take user-defined jobs which follows specific workflows or set of rules enabled by the corresponding programming model. It is, therefore, essential to understanding the communication characteristics of these applications to identify and improve performance issues within the patterns. In this section, 18 CHAPTER 2. BACKGROUND we discuss common communication patterns of modern scale-out or data-parallel applications. However, this work focuses on only the first two because they are the most prevalent.

2.3.1 MapReduce Communication Pattern

The MapReduce pattern is one for the first distributed data-parallel patterns. As shown in Figure 2.1, the MapReduce pattern has two main steps of computation and one step of communication. The computation steps are handled in the map and reduce implementations while the communication involves a data shuffling operation from the mappers to the reducers. Therefore, given M mappers and R reducers, a MapReduce flow will create a total number of M x R flows during the communication or shuffling stage. The primary characteristic of communication in the MapReduce model is that a job will not finish until its last reducer has finished. Consequently, there is an explicit barrier at the end of the job [31]. Using the MapReduce programming model, computation is expressed using two functions: the map and the reduce functions and both functions works on key-value pairs.

Figure 2.1: The MapReduce Communication Pattern: Data is read from the disk and passed to custom map function. For each input split the function is applied and data is then shufﬂed by keys to the reducers. The custom reduce function is applied and data is then written to disk.

The map function takes an input pair and applies the implementation of the function to the input data. This produces an intermediate key-value pair for an optional reducer to process. The values of the intermediate output are then grouped by the same key and fed to the reduce function.

map(k ,v )= list(k ,v ) 1 1 ) 2 2 reduce(k ,list(v ..v )) = list(k ,v ) 2 1 n ) 2 3 2.3. DATAFLOW COMMUNICATION PATTERNS 19

The reduce function would apply the implementation of the function to (k2,list(v1..vn) and return the ﬁnal output.

2.3.2 Dataﬂow With Cycle Pattern (DFWC)

Traditional dataflow pipelines unroll loops to support iterative computation requirements. Similarly, the MapReduce pattern is not suitable for iterative computation because multiple jobs have to be chained together for each step of the computation. Each of these jobs has to write its output for the next job to consume. Frameworks that uses the dataflow with cycles pattern such as Apache Spark [138] and Apache Flink [19] obviates loop unrolling or job chaining by keeping in-memory states across iterations. As opposed to the MapReduce pattern, these frameworks provide operators that handle an entire workflow as a single job, rather than breaking them up into independent subjobs [82]. These frameworks also explicitly support communication primitives such as broadcast, whereas, MapReduce-based frameworks rely on replicating data to HDFS and reading from the replicas to perform broadcast jobs [31]. Figure 2.2 represents the Dataflow With Cycle pattern. The top circle or root node represents the initial state of the data to be processed, the intermediate and leaf nodes represent the states of the various transformation of the data. The arrows from the leaf nodes show that this process could be repeated several times and hence iterative or a circular pattern. Most machine learning algorithm fits within this pattern.

Figure 2.2: Dataﬂow with Cycles: The Root node represents the master task, e.g. read a task to read a ﬁle from HDFS. The internal nodes T represents a transformation task. Many transformations can be applied to the same data and then an action. Spark uses the so-called lazy-evaluation which means that the execution will not start until an action is triggered

As an example of an advanced analytic job, we consider the typical k-means [69] clustering algorithm. The DAG of this algorithm is shown in Figure 2.3. The goal of the k-means algorithm is to categorise similar data points or observations into the same group. This process is done by calculating the distance (e.g. Euclidean) between data points. This process is repeated for 20 CHAPTER 2. BACKGROUND several iterations for the algorithm to converge. From the ﬁgure, we can see that such pipelines can have multiple stages, and some of the stages can be repeated for several iterations.

Figure 2.3: Dataﬂow with Cycles: DAG of k-means Algorithm. Data is read from the disk, and the framework starts processing it by applying various transformations. For k-means, the calculation of distance and updating of the centroids could be repeated for several times for the algorithm to converge.

In this case, for K-means algorithm to converge, the algorithm has to repeatedly compute the sum of the squared distance between data points and all centroids, assign each data point to the closest cluster centroid and compute the new centroids for the clusters by taking the average of the all data points that belong to each cluster.

2.3.3 Data Flow With Barriers, Chained Jobs

As shown in Figure 2.4, these patterns use MapReduce as their building block. They expose a much more robust interface than MapReduce to allow users to process underlying data. The interface can be a Structured Query Language (SQL)-like interface or near-normal language, e.g. Hive and Pig. In terms of communication, these patterns are the same as MapReduce. 2.4. PARALLEL PROGRAMMING 21

Figure 2.4: Data Flow With Barriers: From top to bottom, each task read from a distributed file system like HDFS, the data is then processed by mappers, shuffled, reduced by optional reducers and final results are written back to HDFS. The dependent task after the barriers must wait for all tasks before them to finish execution.

2.3.4 Data Flow Without Explicit Barriers

Unlike MapReduce, these are data ﬂow pipelines without explicit barriers. As shown in Figure 2.5, a stage can start as soon as some input are available. Examples of such pattern are SCOPE [23], FLUME [25] and MapReduce Online [36]

Figure 2.5: Data Flow Without Explicit Barriers: From top to bottom, each task read from a distributed file system like HDFS, the data is then processed by mappers, shuffled, reduced by optional reducers and final results are written back to HDFS. Dependent task after the execution can continue as soon as some of the parent tasks complete execution.

2.4 Parallel Programming

Parallel computing was the main approach to processing massive datasets before the advent of modern distributed big data frameworks. Parallel computing is the process of solving 22 CHAPTER 2. BACKGROUND computational problems using multiple compute resources simultaneously [8]. A complex problem is broken down into discrete components and instructions and executed simultaneously on different processors on the same computer or computers connected to the same network. These problems are solved using parallel programming models such as MPI and OpenMP [60]. These models exist as a high-level abstraction above hardware and memory architectures. Broadly, there are several types of these models; the most popular are shared and distributed memory. A shared memory model passes data between processes and parallel processes/task share a common address space which they read and write to asynchronously [8]. An example implementation of shared memory architecture is OpenMP. A distributed memory model, also known as message passing model entails a set of tasks that use their local memory during computation. Messages can be shared within the same physical machine or across an arbitrary number of machines connected via a network [8]. An example implementation of a distributed memory model is MPI. One of the main advantages of this approach to parallel processing is low latency. However, parallel computing infrastructures are costly, they do not scale well, and they are not fault-tolerant. These disadvantages have led to the development of shared-nothing distributed computing architectures which uses less expensive commodity hardware to build scalable and fault-tolerant clusters. In the next section, we will discuss these approaches in details.

2.4.1 Amdhal’s Law

One of the cornerstones of parallel computing is the concept of Amdahl’s Law. The law can be used to calculate the theoretical speedup when using multiple processors to run applications [62]. Formally, Amdahl’s Law states that in parallelisation, if P is the proportion of a system or program that can be made parallel, and 1-P is the proportion that remains serial, then the maximum speedup S(N) (meaning how much faster the new algorithm or program compared to the old version) can be achieved using N processors is presented in Equation 2.1:

1 S(N)= (2.1) ((1 P)+(P )) N We can also use Amdahl’s Law to calculate the execution time of a program or algorithm after optimisation or parallelisation [77]. Let us consider the example below which is inspired by [77]. Imagine a program that processes periodically scales large images to thumbnail sizes. These images are stored in a central repository and a part of our program scans and zip these images by date of upload. After that, each of the zip ﬁles is processed in parallel. The scanning and zipping part cannot be parallelised but processing the images can be parallelised. Assuming that the total time to execute the entire program including the time for both the sequential and 2.5. PARALLEL PATTERNS 23 non-sequential parts on one processor is T and the serial part is B. Therefore, the part that can be parallelised can be represented as T B. The execution time of the application can be optimised by executing the T B part on N threads or CPUs. Using Amdahl’s Law and the information above, the total execution time of running a program with a parallelisation factor of N can now be represented using Equation 2.2:

(T(1) B) T(N)=B + (2.2) N From Equation 2.2, the key information to take away in terms of optimisation is, the performance of such an application is bounded by the serial part B. The parallelisable part can be executed faster by throwing more hardware or CPU at it but the non-parallelisable part can only be executed faster to optimising the code. We will refer to Amdahl’s law when we evaluate the effect of scaling resources on performance in Chapter 5 and 6.

2.5 Parallel Patterns

To give a more historic and general background to provide more context of this work, we discuss a body of parallel patterns also know as skeletons and highlight a few common implementations[16]. Parallelism patterns or algorithmic skeletons are high-level parallel programming models used in parallel and distributed computing. They are an abstract computational entity that models some common pattern of parallelism such as the parallel execution of a sequence of computations over a set of inputs where the output of one computation is the input to the next; also known as pipelines [16, 93]. They take advantage of common programming patterns to hide the complexity of parallel and distributed applications. Starting from a basic set of patterns (skeletons), more complex patterns can be built by combining the basic ones. One of the main difference between algorithmic skeletons and high-level parallel programming models is that orchestration and synchronisation of the parallel activities are implicitly deﬁned by the skeleton patterns. Programmers do not have to specify the synchronisations between the application’s sequential parts [64]. This approach has two implications: First, the algorithmic skeleton programming model reduces the number of errors when compared to traditional lower- level parallel programming models such as MPI and PThreads. Secondly, as the communication or data access patterns are known in advance, the cost models can be applied to schedule skeletons programs[64]. Some of the most and useful classical parallel skeletons are A seq a simple wrapper which processes a sequential evaluation of a function[16]. A map a skeleton that partitions an input into several other inputs and processes them in parallel[39]. A f arm unlike map models the application of a single function to a sequence of independent inputs. A pipe is similar to standard Linux pipe models a parallel pipeline where the input of the second function 24 CHAPTER 2. BACKGROUND is the output of the ﬁrst function[39].

2.5.1 Common Algorithmic Skeleton Implementations

We will discuss some common algorithmic skeleton implementation inform of frameworks and libraries. The Edinburgh Skeleton Library known as eSkel is provided in the C programming language and runs on top of the message passing library called MPI[12]. FastFlow is a skeletal parallel programming framework mainly used in the development of streaming and data-intensive applications. It was initially used in multi-core computers but now extended to target heterogeneous shared memory computing clusters[1, 2, 3]. JaSkel, a java based framework which supports skeletons such as farm, pipe and heartbeat. JaSkel uses the power of java inheritance enabling programmers to implement abstract methods for each skeleton to provide their application-speciﬁc code. Skeletons in JaSkel supports sequential, concurrent and dynamic versions[49]. Finally, SkePU is a C++ skeleton programming framework that supports multicore CPUs and multi-GPU systems. The library supports six data-parallel and one task-parallel skeleton, two container types, and support for execution on multi-GPU systems both with CUDA and OpenCL. It is also is being extended for GPU clusters[46].

2.5.2 Cost Models For Parallel Patterns

A cost model explains the execution time, memory and power consumption of an algorithm as a function of relevant parameters. Understanding the cost implications of any form of software abstraction is an important step towards developing an efficient framework. In this section, we will present common cost models for parallel patterns that we have reviewed in the literature. In [14], the authors develops a new cost-optimal implementation a double-scan skeleton for the divide-and-conquer paradigm. The authors demonstrate the use of the DS skeleton for parallelising a tridiagonal system solver and report experimental results for its MPI implementation on a Cray T3E and a Linux cluster. In [52], the authors focused on providing cost-models for SQL-query optimisers for parallel executions. They achieved this by providing a theoretical foundation underlying the design of efficient cost models that accurately approximate response time. Based on the above theoretical foundation, they formulate two heuristic cost functions generate plans whose performance is within 20-60% of the best-scheduled plan for 90-95% of the queries. In [117], they present a novel strategy for building cost models for skeleton-based parallel functional programming languages with a focus on Bird-Meertens theory of list [116]. Finally, in [6], the authors proposed a novel architectural cost model that takes into accounts the size of the cache on heterogeneous architectures and enables a skeleton- based programming model that simplifies programming heterogeneous architectures. They 2.6. BIG DATA FRAMEWORKS 25 exploited the way of improving load balancing and performance portability on heterogeneous architectures. They conclude that taking cache size into account greatly leads to improved balance and performance in heterogeneous skeleton on homogeneous shared and distributed memory architectures.

2.6 Big Data Frameworks

The advent of big data brought along a lot of research challenges and opportunities. It is not trivial to measure the total volume of data being generated globally. However, Zwolenski and Weatherill [144] forecasted a tenfold growth of digital content from 4.4 zettabytes in 2014 to 44 zettabytes in 2020. Due to the amount of data being generated, traditional data storage and processing methods like Relational Database Management System (RDBMS) were not designed to handle the storage and processing of such kinds of data. New tools and frameworks had to be developed to handle these new challenges. In this section, we will discuss some of the major big data frameworks that can be used to store and process such massive datasets.

2.7 Hadoop

Hadoop is an open-source big data processing framework comprising of different sub-open- source projects components. The main components are the ﬁle system know as HDFS [63], the programming model know as MapReduce [41] and the resource manager and allocator know as Yet Another Resource Negotiator (YARN). Successive publications from Google primarily inspired the birth of Apache Hadoop. HDFS implementation was inspired by the Google File System paper [55] while the MapReduce paper [41] inspired Hadoop’s implementation of MapReduce. One of the main advantages of Hadoop is that it was designed to run on cheap commodity hardware and does not require expensive, highly reliable upfront setup. It is intended to be highly available through replication of data within the cluster because, in large clusters of commodity hardware, the chances of node failures would be high.

2.7.1 Hadoop Distributed File System

In most big data problems, the dataset does not fit in a single physical machine. Therefore it becomes necessary to partition data across multiple numbers of computers. Distributed filesystems enable the storage of data in multiple nodes connected by a network. Hadoop by default comes with the HDFS [63] which is designed to store huge files with support for streaming data patterns and optimised for delivering a high throughput of data processing. These large files 26 CHAPTER 2. BACKGROUND are chunked and distributed on commodity hardware for storage. The size of these large files could range from hundreds of megabytes to terabytes in size. Hadoop is also compatible with the local file system and other filesystems such as Amazon S3. As mentioned earlier, HDFS supports streaming data access because it is built around the idea of write-once, read-many-times pattern. This means data is mostly generated and copied from source to the HDFS file system then the various analysis is then performed on the data.

2.7.2 Hadoop MapReduce Programming Model

Hadoop MapReduce is a large scale parallel and distributed programming model for big data processing based on the MapReduce pattern. Google ﬁrst introduced it in 2010 [41]. The open-source community then created an open-source version of MapReduce known as Hadoop MapReduce. The MapReduce execution pipeline divides the execution of a job amongst map and reduce tasks. The framework automatically executes the task in multiple nodes on a distributed clusters. The execution starts by ﬁrst reading data read from HDFS [63] and the data is fed to map task and reduce task for processing. A master node manages the scheduling and resource allocation within the cluster.

Figure 2.6: A high-level Architecture of the MapReduce Program (Letter Count), data is read from a source such as HDFS and passed to the map function. Which emits a count of 1 for each letter. The shufﬂe phase sorts and groups the map output by key and send all records with the same key to the same reducer. The reducer apply the reduce function and write output to disk

Figure 2.6 shows a high level of representation of the Hadoop MapReduce framework. Data is stored into HDFS in chunks to form input splits. The size of the input split is based on the block size settings; typically, this would be 128MB. During the map stage, each map task reads a split from HDFS and applies a user-defined map function. In our case, the function splits a string 2.8. APACHE SPARK 27 by character and maps each character to value 1. These intermediate results are shuffled over the network to enable the same keys to the same node. These keys are then processed by the user-defined reduce function. In our example, the reduce function counts the occurrence of each letter, and the final output is written back to a distributed file system such as HDFS.

2.7.3 YARN (Yet Another Resource Negotiator)

The resources in a cluster are typically shared between users and not limitless. Therefore there needs to be a mechanism to manage the job and resources submitted to the cluster for fair usage. In Hadoop, the master node of a cluster is responsible for the management of jobs and resources. In terms of this management, there is a stark architectural difference between the first version (Hadoop 1 or MapReduce 1) and the version used in this work (Hadoop 2 or MapReduce 2). This architectural difference was geared towards improving the overall performance of Hadoop clusters. One of the main performance problems with Hadoop 1 is, each worker node in the cluster is configured with a fixed number of map and reduce slots. The disadvantage of this approach is the reduce slots cannot use the map slots even if there is no map task to run and vice versa. On scalability Hadoop 1 hits scalability bottlenecks in the region of 4,000 nodes and 40,000 tasks while Hadoop 2 is designed to scale up to 10,000 nodes and 100,000 tasks. Hadoop 2 uses YARN [126] (Yet Another Resource Negotiator) for cluster resource management. YARN provides its core service through two types of daemons: the resource manager and node manager daemons. The resource manager daemon manages request for resources across the entire cluster, and it is installed the master node. The node manager daemon runs in all the nodes in the cluster, and it is responsible for launching containers and sends periodic reports of task progress to application master, which monitors and manages the entire application progress. In the context of YARN, a container is as a logical unit of computation assigned with CPU and memory where individual MapReduce tasks of an application are executed.

2.8 Apache Spark

Apache Spark is a unified general-purpose computing engine with a rich set of libraries for parallel data processing on a cluster of computers. As of writing this thesis, Apache Spark is the most used platform for big data processing. The project started at UC Berkeley in 2009. At this time, Hadoop MapReduce was the primary choice for big data processing. However, the drawbacks of MapReduce led to the development of other frameworks like Apache Spark. One of the main performance bottlenecks of MapReduce is when an application needs iterative or repeated access to data stored in HDFS [110]. For example, most machine learning algorithms 28 CHAPTER 2. BACKGROUND would require several (10 to 25) passes or iteration over the data in order to converge; in MapReduce, each of these iterations had to be written as a new MapReduce job, launched separately and load data in the cluster from scratch. To address this problem, the developers of Spark designed a general-purpose in-memory big data processing engine that could perform efficient, distributed in-memory data sharing across computation steps. Spark uses the concept of Resilient Distributed Dataset (RDD), which is a programming abstraction that represents an immutable collection of objects that can be split across a computing cluster. During computation, Spark uses a Directed Acyclic Graphs to build a job execution plan for each transformation such as map, flatMap, distinct and unique.

Figure 2.7: Apache Spark Architecture: The spark driver manages the entire spark application and runs in the master node, which connects to the cluster manager. YARN is one of the most common cluster managers. The function of the cluster manager is to manage the entire cluster and grant request from the application driver. The worker nodes are responsible for task execution.

Figure 2.7 shows the basic architecture of Spark. The four main components are the driver, cluster manager, workers and executors. The spark driver manages the spark application and runs in the master node, which connects to the cluster manager. The cluster manager manages the entire cluster and grant request from the application driver. Spark can use different cluster managers such as YARN and Mesos. Workers host spark executors, which are assigned logical resources for the execution of tasks in parallel. During application execution, spark bundles applications into stages. A stage corresponds to a collection of tasks. These tasks all execute the same code in parallel, each on a different subset of the data. Each stage contains a sequence of transformations that can be completed without shuffling the full data. Shuffling involves moving data from one node to another, and a new stage is started after each shuffle stage. 2.9. BENCHMARKING 29

2.9 Benchmarking

Benchmarking is the process of stress testing a system to gain correctness and performance insights. Big data benchmarks are developed to evaluate and compare the performance of big data systems and architectures. The benchmark aims to generate application-speciﬁc workloads and tests capable of processing big data sets to produce meaningful evaluation results [122]. Generally, the process of big data benchmarking follows the following key steps.

Figure 2.8: Benchmarking pipeline for big data systems: Planing stage deﬁne what to benchmark, relevant data is generated in the data generation stage. This data can be synthetic. Workloads to benchmark the system are then generated and executed. The ﬁnal step is to analyse that data and make informed conclusions.

In the planning steps, the infrastructure, application and metrics to extract are determined. The data generation stage involves identifying real data or generating synthetic data for the benchmarking process to use whiles workload generation consists of the identification of representative workloads to use in the benchmarking process. These workloads must cover a variety of domains to test the system or framework adequately. Execution entails running the workloads on the system and collect values for all the metrics identified in the first step. The data collected about the metrics are observed and analysed to make conclusions about performance. The goal of benchmarking is to provide realistic and accurate measuring of big data systems and thereby addressing two objectives. The first objective is to promote the development of big data technology; that is to develop new architectures, algorithms, techniques, and software stacks to manage big data and extract their value and hidden knowledge [66]. Secondly, it can also assist system owners in making decisions for planning system features, tuning system configurations, validating deployment strategies to improve systems [66]. They can also be used to identify performance bottlenecks, thus optimising system configuration and resource allocation strategies.

2.9.1 Approaches to Benchmark Design

Big data benchmarks can be grouped based on their scope. The following categories are identiﬁed. • Micro Benchmarks:- This group of benchmarks are used to measure and evaluate the performance of a speciﬁc part of a system or point of performance. For example, measuring the cost of serialising data or measuring the cost of compressing data [66]. The measured 30 CHAPTER 2. BACKGROUND

system components can include both hardware such as CPU or networking devices and software such as HDFS. For example, Hadoop is shipped with several microbenchmarks such as NNThroughputBenchmark and TestDFSIO, which are used to test the performance of Hadoop name node and network components in HDFS, respectively. Additionally, examples of speciﬁc system behaviours in Hadoop and Spark are grep and sort.

• End to End benchmarks:- These are benchmarks that are designed to measure and evaluate the entire system using representative and typical application scenario [67]. Each of these scenarios would correspond to a collection of related workloads in several domains. An example of an end-to-end benchmark is Transaction Procession Council-Decision Support benchmark (TCP-DS). The TCP-DS is a decision support benchmark that models and measures the performance of several applicable cases of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general-purpose decision support system. Additionally, Yahoo! Cloud Service Benchmarking (YCSB) [37] are end-to-end benchmark for NoSQL databases.

• Benchmark Suites:- This category of benchmarks follows a hybrid approach by including both end-to-end and micro bench. They aim to provide a comprehensive benchmarking solution for different big data frameworks. For example, benchmark suites such as HiBench [74] and BigDataBench [133] provides a benchmark solution for various big data systems like Hadoop, Spark and Streaming frameworks like Apache Flink.

2.9.2 Benchmarking Suites

• HiBench [74] developed by Intel is a representative, comprehensive and open source big data benchmark suite for Hadoop, Spark, and Streaming Workloads. It consists of workloads for different big data systems and a data generator that generates synthetic data of conﬁgurable sizes for those workloads. The workloads are comparably well abstracted and easily conﬁgurable to generate and run benchmarks. HiBench, out of the box is compatible with all major distribution of Hadoop, Spark and streaming frameworks, e.g. Apache, Cloudera, and Hortonworks distributions.

• BigDataBench [133] is the outcome of a joint research effort between the University of Chinese Academy of Science, Dropbox, Yahoo!, Tencent, Huawei, and Baidu. It consists of a data generator and a benchmark suite. Their main goal is to provide benchmark capability for a diverse set of big data application and dataset types. In their more recent release of the benchmark suite, they used a concept of "big data dwarf" [54] to mitigate the challenge of representing all possible big data computing workloads. Their idea is to construct a 2.10. PERFORMANCE MODELLING 31

benchmark suite using the minimum set of units of computation to represent the diversity of big data analytics workloads.

• BigBench [133] is an end-to-end big data benchmark proposal the underlying design of which follows a business model of product retailer. BigBench is not an open-source project and therefore not available for deployment and testing by others. Their workloads were executed on Teradata Aster DBMS which is based on nCluster [40]; a shared-nothing parallel database optimised for data warehouse and analytic workloads

2.10 Performance Modelling

In the context of this work, performance modelling is the process of building models to make informed decisions about unseen cases. Big data systems are complex with hundreds of configurations parameters; it would be an uphill task to tune these parameters for performance optimisation manually. To overcome this problem, researchers resorted to building performance models to answer various questions. The modelling processes use the data generated from the benchmarking stage to build performance models. The models can be trained, tested and validated using a machine learning algorithm. For example, if the objective is to predict the execution time of an application, then a regression algorithm can be utilised. If the objective is to classify whether an application will meet a deadline based on configuration parameters, then a classification algorithm could be used. Broadly, there are two main approaches to performance modelling, and they are discussed below.

2.10.1 Approaches to Performance Modelling

• Analytic:- This category is used in cost based performance prediction, and it uses a white-box approach to performance modelling. This means a deep understanding of the underlying system must be considered when building the models. Cost-based models and normally highly accurate because a deep understanding of the system is required. This approach is mainly applied to I/O cost, memory cost, CPU cost and network cost modelling [131]. Since it is challenging to understand the interaction between various system internals such as the software stack, operating system and the hardware stack, it becomes practically challenging to model more complex big data systems with hundreds of conﬁgurations parameters.

• Machine Learning:- This category use Machine Learning to build performance models by using historical data collected by executing workloads on the computing cluster. The models are black-box because it is not required to have a deep understanding of the underlying system. 32 CHAPTER 2. BACKGROUND

Machine Learning approaches have two main advantages: First, these kinds of models are based on observations of the actual system performance under a particular workload and cluster [66]. Second, they are usually simpler to build than white-box models as there is no need for detailed information about the system’s internals. It is important to mention that some approaches use a mix of analytic and machine learning approaches. In this work, we have used the grey-box approach, which combines details of the underlying frameworks and applications with black-box machine learning to build better performance.

2.10.2 What to Model

Performance models can have different objectives. In the context of modelling big data application, models can focus on predicting the execution time of an application or inferring whether a specific deadline would be met given certain resource configurations. Cost models can also be used to predict how much hardware (CPU, Memory and VM) would be needed to optimise performance. The base models could be used to recommend the best value for sets of configuration parameters to optimise performance and cost.

2.11 Machine Learning

The grey-box performance modelling approach we have used in this work combines machine learning with the details of the underlying computing framework to model the performance of applications. The goal of machine learning is to enable computers to use historical data to solve an unseen problem [81]. Machine learning algorithms are mainly black-box algorithms that users do not have much control over. Although some algorithms such as random forest, support vector machines do provide tuning parameters to improve performance. Below, we present a brief explanation of fundamental machine learning algorithms in relevance to this work. There are different types of machine learning approaches grouped into different categories. In this subsection, we will discuss the categories used in this work.

2.11.1 Supervised Learning

Supervised learning is where you have a predefined input variable x and an output variable y and use an algorithm to learn the mapping function from the input to the output [20]. The goal is to approximate a mapping function to predict the values of y given an unseen data x. This is called supervised learning because the algorithm learns from a labelled dataset. Below we will discuss few supervised learning algorithms. Supervised learning can also be grouped into a classification or regression problem. A classification problem is where the output variable is one thing or the 2.11. MACHINE LEARNING 33 other (such as: blue, white or red). In regression, the output is a continuous or real value such as time or distance. In this thesis, we focus on regression problems only. Semi-supervised learning is a form of supervised learning where you have a large set of input data X and only some of that data is labelled Y [24]. In the real world, many machine learning problems will fall under this category because of the challenges of labelling data. Another type of supervised learning is called Active learning; this is also referred to as optimal experimental design. In this learning method, the learning algorithm interacts with the user to label unknown or new data set [112]. In some cases, unlabelled data is abundant and to manually label this data is expensive.

Linear Regression

Linear regression is a linear approach to modelling the relationship between a variable of interest known as the response variable and one or more explanatory variables or independent variables [99]. For example, to model the execution time of an application, explanatory variables such as the amount of data to process, amount of memory, CPU, number of mappers, number of executors can be used as variables to inform the decision.

Decision Tree

Decision Tree is a supervised machine learning algorithm that is used to predict a target variable of interest by learning decision rules from independent variables. As the name suggests, this model breaks down data by making a decision based on asking a series of questions. Example when modelling execution time based on allocated resources, the breakdown can be attached to variables such as memory and CPU. Decision trees are used for both classiﬁcation and regression problems.

Random Forest

Random Forest is a popular supervised machine learning algorithm that uses an ensemble method to build models [88]. An Ensemble method is a machine learning technique that combines the predictions from multiple algorithms together to make a more accurate prediction than any individual model [24]. It operates by constructing a multitude of decision trees at training time and outputting mean prediction of the individual trees in the case of regression. The Random forest algorithm can also be used for both regression and classiﬁcation problems 34 CHAPTER 2. BACKGROUND

Support Vector Machine/Regression

The term Support Vector Machine (SVM) is mostly associated with classiﬁcation problems [120]. However, they can also be used as a regression method, maintaining all the main features that characterise the algorithm. In simple regression, the goal is to minimise the error rate between the predicted and actual values. While in Support Vector Regression (SVR), the goal is to ﬁt the error within a certain threshold (epsilon) [45]. The goal is to decide a decision boundary at e distance from the original hyperplane such that data points closest to the hyperplane or the support vectors (points) are within that boundary line.

2.11.2 Unsupervised Learning

Unsupervised learning is when there is only an input x without any predeﬁned output variable y. The goal of unsupervised learning is to discover patterns or structures in the data [70] without any form of supervision. They are called unsupervised because there is no predeﬁned answers and not supervisor but instead the algorithms devise a way to discover and present interesting patterns from data. In this thesis, we have mainly used supervised learning algorithms to model the performance of applications. However, in chapter 6, we modelled the performance of six machine learning algorithms with further discussions on each of them. Example of unsupervised machine learning algorithms is k-means [69], a clustering algorithm that partitions data into k-groups given a set of observations and predictors.

2.12 Summary

In this chapter, we first discussed the general background of data-intensive applications. We then discussed the background of various frameworks and technologies we have used in this work and the approaches used to model benchmark and model the performance of big data application. We can generally deduce three high-level takeaways: • Despite the diversity of big data application, in this work, we have identified and modelled the performance of two of the most common communication patterns. These patterns are MapReduce and Dataflow with cycles.

• Benchmarking and Performance Modelling can be used to identify performance bottlenecks in big data applications. We can use models to predict the performance of an application, recommend the best conﬁguration parameters.

• Big Data Frameworks are complex and highly conﬁgurable; therefore, to model their performance, proper techniques must be used to identify key performance drivers. 2.12. SUMMARY 35

In the next chapter, we will cover a detailed literature review to highlight the most important works in this domain, build a taxonomy and present various gaps.

CHAPTER THREE LITERATURE REVIEW

3.1 Introduction

In this chapter, we will review the existing state-of-art in benchmarking and performance modelling of data-intensive applications. The chapter starts by presenting the survey methodology in section 3.2, then to a thorough discussion of existing novel works.3 In section 3.2, we construct a detailed taxonomy of benchmarking and performance modelling of data-intensive applications to identify trends and research gaps. We discuss the research challenges and state the objectives of this research in section 3.7, we conclude and summarise in section 3.8

3.2 Survey Methodology

In this section, we present the survey methodology used in this work and discuss the state-of-art in benchmarking and performance modelling approaches of data-intensive applications. We have used IEEE Xplore, ACM, Google Scholar, ScienceDirect and recommendations from Mendeley to search, identify and organise outstanding and well-cited publications in this domain. We selected the papers from the following conferences: IEEE International Conference on Cloud Computing (CloudCom), USENIX Symposium on Networked Systems Design and Implementation (NSDI), ACM/SPEC International Conference on Performance Engineering (ICPE), IEEE International Conference on Big Data (BIGDATA), IEEE International Conference on High-Performance Computing and Communications (HPCC), International Conference on Very Large Data Bases (VLDB), IEEE Transactions on Parallel and Distributed Systems (TPDS), IEEE International Conference on Cloud Computing (CLOUD), IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) and ACM International Conference on Autonomic Computing (ICAC).

37 38 CHAPTER 3. LITERATURE REVIEW

In addition to including papers published in only relevant and reputable conferences, we added the following broad criteria to make sure the publications are closely related to this research. • The benchmarking approach must focus on common data-intensive applications, and we discard approaches focusing on benchmarking simple application.

• The performance modelling approaches may use a black, grey or white box approach. Information about the framework and application-level features included in the model building process must be apparent from the paper.

• The framework on which the experiments were conducted should be a parallel computing environment such as Apache Hadoop or Spark. This requirement helps us to classify works by communication patterns. We did not include High Performance Computing (HPC) environments.

After a thorough read-through of forty-ﬁve (45) candidate papers, based on the criteria listed above, we pruned down the papers to twenty-three (23) based. To further strengthen the relevance of including each of the related work, we make sure that, for each paper, we can evaluate and answer the questions discussed below. These questions are broadly grouped into three main categories: (1) Benchmarking approach, (2) Modelling approach and (3) Validation and experimental setup. The grouping and the classiﬁcation of each question is shown in Figure 3.1 Figure 3.1: Grouping the six questions by category. Some of the questions belongs to multiple categories. Q1 is about the main purpose of the study, Q2 is about the communication pattern the study focused on, Q3 is about the benchmarking approach, Q4 is about the objective of the performance models, Q5 is about scalability and Q6 is about the use of representative workloads.

Benchmarking Modelling

Q1 Q3 Q4

Q5, Q6

Validation and Experimental Setup 3.2. SURVEY METHODOLOGY 39

• Q1 – What is the main purpose of the work in the paper? This question seeks to identify whether the work is about benchmarking, performance modelling or both. We also seek to understand what benchmarking tools were used to gather the data needed for the modelling step. This question falls into the benchmarking and modelling approach categories.

• Q2 – Is the focus of the paper on MR or DFWC patterns: This question seeks to understand the focus of the paper in terms of the dataﬂow communication pattern being addressed. MR and DFWC are inherently different in many ways; therefore, their benchmarking and modelling approaches must also reﬂect that. This question seeks to address all the three categories, i.e. the benchmarking, modelling and experiments must focus on MR or DFWC patterns.

• Q3 – What approach is used for benchmarking or framework proﬁling: The goal of this question is to understand the different methods used by the selected papers to perform cluster and application proﬁling. This question falls into the benchmarking category.

• Q4 – What is the main objective of the modelling: This question seeks to address the main objective of the performance model. What are the authors trying to predict? For example, this could be execution time of an application, meeting deadlines, cost in terms of CPU usage, memory consumption, and the effect of serialisation & compression on performance. This question falls into the modelling category.

• Q5 – How scalable is the solution: This question seeks to evaluate the scalability of the solution proposed in the papers. This question has a lot to do with the experimental setup. Conclusions made using 2 - 3 nodes will carry less weight compared those carried out using many nodes. This question falls into the validation and experimental setup category.

• Q6 – Are the workloads representative? This question helps us to understand what kinds of workloads are used in the benchmarking and experimental setups. Workloads are crucial in the entire process. For MR, we expect batch or non-iterative workloads to be used and for DFWC, we expect iterative workloads such as machine learning applications. This question falls into the validation and experimental setup category.

In the following sections, we will present works grouped by the communication patterns, i.e. MR and DFWC patterns. In each work, we will discuss and highlight the answers to the above questions grouping them into the three main categories. We will also discuss the various benchmarking or framework proﬁling approaches used in the selected research papers. These benchmarking approaches are broadly used for two things: First, they are used to study and 40 CHAPTER 3. LITERATURE REVIEW understand the behaviours of data-intensive applications in the MR pattern. Secondly, they can be used to gather data for performance modelling. In section 3.3, we present literature review of MapReduce pattern and discuss the literature review of DFWC in section 3.4.

3.3 Literature Review of MapReduce Pattern

The MapReduce pattern uses a simple two-stage programming model for data processing. The main bottleneck in this pattern is in the shufﬂe stage that exists between the map and reduce stages. The shufﬂe stage is network bound because it transfers data over the network from the mapper nodes to the reducer nodes. In this subsection, we will discuss original works used in the benchmarking and performance modelling of the MR pattern. We will cover and discuss the benchmarking approaches used, the modelling approaches used, and the validation and experimental environment used.

Benchmarking Approaches

Zhang et al. [139, 140] in both works developed a novel benchmarking approach for the MapReduce pattern. They achieved this by generating and performing a set of parameterisable micro-benchmarks to profile the completion time of the six generic phases of the Hadoop MapReduce version 1 pipeline. The execution time of each of the generic phases is usually a function of the size of the dataset it processes. The approach they used accurately predict the competition time of selected MapReduce applications with an average error rate ranging from 15% to 17%. Ceesay et al. [21] designed a containerised benchmarking approach that can be used to automatically obtained low-level performance metrics from underlying frameworks like Hadoop. The primary motivation of this work is to simplify the entire big data benchmarking process inspired by popular benchmarking tools like HiBench [74]. Song et al. [118] developed a dynamic light-weight job analyser which extracts the following information from jobs submitted by the user. First, it measures and records the input size of the data being processed and the number of records. Secondly, it also estimates the complexity of the user-defined map and reduce functions. Finally, it also captures the data conversion rate, that is the ratio of mapper input and output data known as the map selectivity. Verma et al. [128] built a comprehensive MapReduce job profiling engine that compactly summarises critical performance characteristics and properties of the underlying application during the execution of map and reduce stages. The main goal of this profiling engine is to create a generic job profile independent on the amount of resources assigned to a job and covers all the phases of MapReduce pipeline. The stages considered are map, shuffle, sort and reduce 3.3. LITERATURE REVIEW OF MAPREDUCE PATTERN 41 phases. The results of the job profiles are later used in the model building process to estimate the completion time of a job. Herodotou and Babu [72] developed an analytic end to end MapReduce job profiler responsible for collecting job statistics. In their implementation, the job profiler gathers information about the dataflow and cost estimate for each MapReduce job. The dataflow estimates represent information about the data being processed, e.g. number of bytes read or written, while cost estimates resource usage and execution time. The dataflow information is used to understand the behaviours of the job being executed. This gathered information is later used in the model building process to predict the execution time of a job.

Modelling Approaches

In this section, we will cover the performance modelling approaches used in the selected papers. We will also include the goals of the models, for example, are the models predicting execution time or resource usage. Zhang et al. [139] divided the modelling process into two steps. The first step is to model each of the six-phase of the MapReduce execution pipeline. The second steps model the execution time of the entire job. In the first step, they used non-negative least square regression to model the execution time of each of the phases of the MapReduce pipeline. Their basic idea is to express these models as a function of the size of the data being processed. The modelling approach to the shuffle stage is a bit different from the rest of the six stages; they used a combination of two linear models over two data ranges (less than 3.2 GB and larger than 3.2 GB). In the second step, i.e. modelling the completion time of a given MapReduce job, they used the analytic model designed and validated in ARIA [128]. The main goal of their work is to predict the execution time of a given MapReduce job. Verma et al. [128] proposed ARIA [128], which used an analytic approach instead of machine learning approaches for performance modelling. Their proposed model utilises the idea of using the average and maximum map and reduce task completion times as a function of the allocated map and reduce slots. The model is based on the information collected by the job profiler and the performance bounds of completion time of the different phases. The main goal of their work is to estimate the amount of resources (map and reduce slots) required for a job with a known profile to meet a specific deadline. Herodotou and Babu [72] developed what they coined as a What-If-Engine, which contains a set of analytic white-box models similar to [71] for estimating the Dataflow and Cost fields in a virtual or hypothetical job profile. Dataflow fields are associated with MapReduce counters such as map input records, reduce input bytes whiles cost fields are associated with phase information such as read phase, collect phase, and write phase. The approach used various sets of formulas 42 CHAPTER 3. LITERATURE REVIEW to estimate the time it takes to run an entire job. These dataflow and cost statistics are used as input of the analytic models in the model building process. The goals of their work are in two folds. The first one is the What-If-Engine which is used to estimate the execution time of a given MapReduce job using analytic models. The second one is the provision of a job profiling framework to collect detailed statistical information about a given job. Khan et al. [80] used a hybrid approach to model the performance of MapReduce application. In their work, they first presented a series of analytic approaches to show how to calculate the execution time of each of the stages in the MapReduce pipeline. This initial steps only helps the user to understand how those models are derived, but they were not used in the performance prediction steps. The other part of the hybrid model building approach used a locally weighted linear regression to estimate the execution time of a job. Furthermore, they also used Lagrange multipliers techniques for optimal resource provision for deadlines to be satisfied. The main goal of their work is to meet deadlines for jobs and predict the execution time of a given MapReduce job.

Validation and Experimental Environments MR

This section addresses the last two questions listed in Section 3.2. The first question, i.e. Q5, deals scalability of the experimental results while the second one, i.e. Q6, deals with the types of workloads used in the performance prediction process. Answering these questions helps to understand the relevance of a particular research work and how it fits in current data-intensive research. In data-intensive processing and in most cases, it is preferred to have larger cluster setups in order to effectively parallelise jobs. In terms of workloads, a more representative workload covering various domains of data-intensive applications is also preferred over workloads focusing only one or few domains. In this work, we refer to a mix of representative and non-representative workloads as Semi-representative Workloads. This grouping helps to properly group works based on the types of workloads used in their experiments. We also consider the version of cluster computing software used in the experiment. Because in some cases there is a massive design difference between versions. For example, there is a considerable design different between Hadoop version 1 (MR1) and Hadoop version 2 (MR2). In simple terms, MR1 uses fixed-size slots to run tasks. The disadvantage of this is that a map slot cannot be used to run a reduce task and a reduce slot cannot be used to run a map task even if the slots are available. However, in MR2, this problem is mitigated by having a cluster-wide resource manager YARN to utilise the cluster resources fully. Khan et al. [80] used two setups – the first setup has eight nodes in house Hadoop cluster. All the nodes have Hadoop version 1 installed and 4 CPU cores, 8GB RAM and 150GB disk 3.4. LITERATURE REVIEW OF DATA FLOW WITH CYCLE PATTERN 43 space. Their second experimental setup uses a 20 node m1.large Amazon EC2 instances. The workloads used in their work is a WordCount and Sort Application. We can, therefore, argue that their computing environment is acceptable; however, the two workloads they used are quite small to make any general conclusions. Herodotou and Babu [72] experimental setup uses 16 Amazon EC2 nodes of the c1.medium instances. The instances have Hadoop version 1 installed, and each node can run a maximum of 30 maps and reduce tasks concurrently. They have selected representative MapReduce workloads which are used in different domains such as text analytic (Word Count), Natural Language Processing (NLP) (Word Co-occurrence), Web (Inverted Index) and data processing (Join and Sort). In this case, we can conclude that they have used modest cluster settings and representative workloads. Verma et al. [128] performed their experiment on a 66 node cluster installed with Hadoop version 1. Each of these machines has four AMD 2.39MHz cores, 8 GB RAM and two 160GB hard disks. Each of the workers is configured to run four maps, and four reduce slots. Their workloads consist of a set of semi representative workloads. The following applications were included: WordCount, Sort, Bayesian Classification, Term Frequency Inverse Document Frequency (TF-IDF) and WikiTrends. TF-IDF is used in information retrieval task, and the WikiTrends is used to counts the number of times each Wikipedia article has been visited according to the given input dataset, and finally, a Twitter application which evaluates the number of asymmetric links in a dataset. Similar to [128], Zhang et al. [139] performed their experiment on a 66 node cluster installed with Hadoop version 1. Each of these machines has four AMD 2.39MHz cores, 8 GB RAM and two 160GB hard disks. Each of the workers is configured to run two maps, and two reduce slots. They used representative workloads consisting of basic statistics applications such as WordCount and Sequence Count. Web applications such as the Inverted Index and Rank Inverted Index. They have also included Database and Machine Learning related application such as Joins and Kmeans. To wrap up this subsection, we have thoroughly presented the most relevant and cited work in both benchmarking and performance modelling of the MapReduce pattern. In the discussion, we have evaluated the modelling and benchmarking approaches used and also discussed the validation and experimental environments used.

3.4 Literature Review of Data ﬂow With Cycle Pattern

Unlike the simple two-stage model of the MapReduce pattern, the DataFlow With Cycle pattern can consist of several stages, each of which may be repeated for several iterations. In this pipeline, 44 CHAPTER 3. LITERATURE REVIEW there could also be several shufﬂe operations. Apache Spark, a general-purpose in-memory data processing engine is one of the most popular frameworks that implement the DFWC pattern. In this subsection, we will review top works in benchmarking and performance modelling of DFWC pattern; focusing on the benchmarking and modelling approaches and the validation & experimental environment used.

Benchmarking Approaches DFWC

Venkataraman et al. [127] build models with no historical information and try to minimise the amount of training data required. The main idea in the benchmarking process is to run a set of instances of the entire job on samples of the input dataset, and use the gathered data from these training runs to create a performance model. The goal of this approach is to avoid a high overhead in the benchmarking process, as in general, it takes much less time and resources to run the training jobs than running the entire job itself. Wang and Khan [132] developed a framework that first executes a program on a cluster using a limited amount of sample input data and collect performance metrics such as run time, I/O cost, and memory cost during the simulated run. Next, the extracted performance metrics from these simulated runs are used to predict the performance metric for the actual execution of the applications. The use of sample input data helps to reduce the overhead of collecting training data. On a more detailed note, the data collection steps leveraged spark event logs that are generated by the Apache Spark platform. It records execution profiles and performance metrics that are directly obtained from the Spark event listeners in the Apache Spark program and saved in JSON format. These log files are then analysed to calculate the values of various analytic models. Chao et al. [27] devise a way of collecting historical information for each application before modelling. As a common characteristic of machine learning algorithms, they carefully selected a balanced distribution of parameters for training data. The process of collecting these historical data was done by running applications with different configuration parameters. Once sufficient historical information is gathered, models are built and used to forecast the performance with system configurations and the size of input data. This history information, such as runtime configurations and DAG of stages, are extracted from the log files of Apache Spark’s history server. Wang et al. [131] used random sampling approaches in the process of data collection as a prerequisite to training the models. They generated a parameter list of 500 records for each workload, and they collect data from real execution of the Spark cluster for each workload. The authors employed a sparse sampling technique to address the problem of large parameter space. They explore two ways of achieving this; the first approach is to use an exhaustive search over 3.4. LITERATURE REVIEW OF DATA FLOW WITH CYCLE PATTERN 45 different subsets of the available parameter space. The second approach performs a random search over the parameters and select values uniformly random from a diverse range of values during data collection. The idea of combining these two approaches is to provide more insights into the real performance characteristics of the applications. The training data is collected based on these parameter values. The workloads are executed three times for each parameter configuration to capture performance variability. Nguyen et al. [102] identified eleven numerical and two Boolean configuration parameters as the critical tunable parameters out of the 150 configuration settings of Apache Spark. Multiple experiments are executed to collect the necessary data for these 13 (11 numerical and two Boolean) configuration settings for a given application. To address the exponential nature of including all possible values of the selected configuration parameters, they used Latin Hypercube Sampling (LHS) [119] technique to generate a possible combination of values. However, the original LHS implementation has a limitation where a specific value of a configuration parameter cannot appear more than once, which affects the range of the sampled values. To address this problem, they modified the original implementation to be more flexible.

Modelling Approaches DFWC Pattern

In this section, we will cover the performance modelling approaches used in the papers to model the DFWC pattern. We will also include the goal of the models, for example, are the models predicting execution time or resource usage. Venkataraman et al. [127] used a machine learning approach to model the performance of basic and advance analytic jobs. The main goal highlighted by the authors is to model and predict the running time on a specified hardware configuration, given a job and its input. To build the models, they added terms related to the computation and communication stages of the data processing pipelines. To be more specific, the following terms were added: First, a fixed cost term which represents the amount of time spent in serial computation. Secondly, an interaction between the scale and the inverse of the number of machines; this term captures the parallel computation time for algorithms whose computation scales linearly with data. The third term added is the log(number of nodes) term to model communication patterns like aggregation trees. The final term added is the linear term O(number of nodes) which captures the all-to-one communication pattern and fixed overheads like scheduling and serialisation. The authors used a non-negative least squares (NNLS) solver to find the model that best fits the training data. Wang and Khan [132] used analytic approach instead of machine learning to predict the performance of Apache Spark jobs. The main goal of their work is to model and predict the execution time of a spark job and resource requirement of a given job. To achieve this, they built three main analytic models, each representing a vital stage of the execution pipeline. The first 46 CHAPTER 3. LITERATURE REVIEW model estimates the entire execution time of a given Apache Spark job. This includes models to estimate the time it takes to run a stage and the time it takes to run a task. They have also included models to estimate memory and I/O cost. Chao et al. [27] used machine learning (regression approaches) to train a grey-box model using historical information. The main goal of their work is to predict the execution time of a given spark job. The model divides the prediction of performance into subproblems according to the stages generated by Apache Spark. During the training of the models, the following regression algorithms were used: Linear Regression, Support Vector Regression, Decision Tree Regression, Extremely Randomised Trees Regression, Random Forest Regression and Gradient Boosting Regression Tree. The six regression were evaluated in advance using typical model selection methods like leave-one-out or K-fold, and the best algorithm for each subproblem was chosen by evaluating the mean squared error (MSE) and mean absolute error (MAE). Wang et al. [131] employs a novel method of tuning configuration parameters of Apache Spark application based on machine learning algorithms. The main goal of this work is to evaluate whether an applications execution time has improved based on tuning configuration parameters. This is a binary classification problem which makes their approach deviates from the majority of the papers analysed in this work. Their approach first built a binary classification model and predicted the execution time of a job based on a set of configuration parameters values. If this execution time is shorter than the execution time of the default parameters, then the label of the category is set to 1 (improved), else 0 (not-improved). They further expand the ones classified as improved execution into bins of 5%, 10%, 15%, 20%, and 25% to build a multi-classification model of 5 categories. In the model selection process, the following classification algorithms were used: Decision Tree (C5.0) [50], Logistic Regression, Support Vector Machines and Artificial Neural Network (ANN). Nguyen et al. [102] proposed an analytic approach to model the execution time of a given Apache Spark application. Using these models and correlation analysis, they can evaluate the effect of each selected configuration parameters on the execution time of a given job. For example, they can assess the impact of scaling executor cores on performance. The developed several analytic formulas to model the performance of an entire job. These formulas include the time it takes to run a given job, the time it takes to execute a given stage and the number of tasks in a given job.

Validation and Experimental Approach DFWC

It is essential to understand the size of the computing cluster used in the experiment. To measure how general the entire approach is, it is crucial to understand what kinds of workloads are used during the testing and evaluation of the methods proposed in the surveyed works. 3.5. TAXONOMY AND CLASSIFICATION 47

Nguyen et al. [102] performed their experiments on a six node Apache Hadoop version 2. Each of the nodes is equipped with 12 CPU cores, 32 GB of RAM, and 1.8 TB hard drive. In aggregate, the cluster has 60 cores, 192 GB of RAM, and 10.8 TB hard drive. Compared to the experimental setups evaluated in this work, we consider this as an average setup in terms of capacity. The workloads include the following Word Count, TeraSort, KMeans, Matrix Factorisation, PageRank, and Triangle Count. Wang et al. [131] performed their experiments on a four-node cluster installed with Apache version 2. Each of the nodes is equipped with 16 to 24 cores, 16GB to 24GB of memory and 1TB of disk space. We considered this a modest setup in terms of the number of nodes. They included only four workloads, and they are Grep, Sort, WordCount, and Naive Bayes. We argue that these workloads except the Naive Bayes are not representative of the DFWC pattern. The testbed has a medium conﬁguration and the workloads used are representative. Chao et al. [27] comparatively used a high-end setup consisting of 32 worker nodes with each node equipped with 32GB memory and two physical CPUs. The cluster is installed with Apache Spark version 2.2.0. They included semi-representative workloads that include TeraSort, KMeans, PageRank and WordCount. Venkataraman et al. [127] used Amazon EC2 to test their implementation. The number of nodes used in their setup ranges from 16 to 64 machines based on the task at hand. For example, in training, they use between 1 to 16 nodes. In a situation where the algorithm uses the entire dataset such as prediction task, they use 45 and 64 nodes. They used nine machine learning algorithms from Apache Spark Machine Learning library (MLlib). We, therefore, categorised the testbed they used is a large and the workloads used are representative. Wang and Khan [132] evaluate their model on a 13 node cluster, the nodes used are heterogeneous with CPU ranging from 8 to 12 cores and memory ranging from 6GB to 16GB. The cluster runs Apache Spark standalone version with data stored in HDFS. The applications used to verify the performance of the prediction framework include one non-iterative text processing algorithm: WordCount; two iterative machine learning algorithms: Logistic Regression and KMeans clustering; and one graph algorithm: PageRank. In this case, we can, therefore, categorised the testbed they used as medium and the workloads used are representative.

3.5 Taxonomy and Classiﬁcation

In this section, we carefully analyse the contents of the surveyed papers based on descriptive words and phrases to come up with a meaningful taxonomy and classification. The section further identifies common relationship, themes, requirements, characteristics amongst the selected publications. The goal of this section is to provide a deep understanding of the focus areas of 48 CHAPTER 3. LITERATURE REVIEW this domain and identify the opportunities for further research. Figure 3.2 shows in details the taxonomies we have identified from the surveyed literature. We will elaborate on each of the taxonomies as we progress through the rest of the section.

3.5.1 General Approach

In this section, we evaluate the general benchmarking and performance modelling techniques used in the surveyed literature.

Black-Box

In the context of benchmarking and performance modelling, a black-box approach is when the process of benchmarking and modelling is performed without taking into account the way the underlying framework and applications works. To be more speciﬁc, performance prediction models for applications are generated by using standard machine learning tools using historical data and without considering any intrinsic performance characteristics in the frameworks (Apache Hadoop or Apache Spark) processing stack. From our analysis of the surveyed literature, this approach is the most popular option due to its simplicity, and it works well with less effort compared to grey-box or white-box approaches. Works such as [127], [132], [129], [123], [102], [131], [80], [10][85], [121], [118], [107] all used black-box approach. The main disadvantages of this approach are: First, if the size of the input dataset is small, then it becomes hard to train the models accurately for the prediction task. Secondly, since this approach does not consider low-level details of the system, it does not provide the best results in terms of accuracy [123]. Finally, any major change to the underlying framework or application would also require retraining of the black-box model.

White-Box

A white-box approach in the context of benchmarking and performance modelling is when the performance characteristics of the underlying framework and application are holistically taken into consideration using analytic methods. In this utopian approach, every performance characteristics are included in the modelling process. In the surveyed literature, such approaches would not only model the execution time of an application, but they will also add cost-based models such as I/O cost, memory cost, CPU cost and network cost [71, 72, 73]. White-box models have a better prediction accuracy compared to black-box models because of their holistic approach. However, since such method requires in-depth knowledge about system internals such as operating system, Java Virtual Machine (JVM), computing frameworks, workloads, and the hardware stack, it becomes challenging to capture all these requirements with 3.5. TAXONOMY AND CLASSIFICATION 49 the cost-based model. Therefore much more challenging to build [131] compared to black-box and grey-box approach.

Grey-Box

Due to the complexity of modern big data processing frameworks like Apache Spark or Apache Hadoop, it is hard to build an accurate prediction model using either a black-box or white-box approach [27]. Therefore, using a hybrid (grey-box) approach comprising of the best aspects of the two methods would be the best. This would include using standard machine learning tools to build performance models using historical data and taking into account a mild understanding of both framework and application-level parameters. This trend is becoming common and used in works such as [22], [27], [139], [56], [76].

3.5.2 Benchmarking and Proﬁling Approach

From the surveyed literature, we have identiﬁed two main way of benchmarking or data collection of the underlying framework and hardware infrastructure. Below we discuss these two approaches.

Holistic

A holistic approach to benchmarking is the process of collecting or gathering a large amount of historical data for use in the performance prediction stage. From the surveyed literature, this is by far the most common approach. It involves gathering information about configuration settings from job execution logs. This approach is used in [139], [140], [72], [128] to built models using historical data using different configuration parameters. The main advantage of this approach is that the prediction models would perform well because the data used in the training process is large enough to avoid under-fitting. However, this approach could be costly when the configuration space is large. This would mean manually running jobs with different configuration settings and collect information about those jobs. 50 CHAPTER 3. LITERATURE REVIEW Taxonomy of Benchmarking and Performance Modelling of Data-Intensive Applications. The second level groups the publication by the Figure 3.2: general approach which can beapproach black-box, and white-box the or communication grey-box pattern approach. of For focus. each approach we evaluate the benchmarking, modelling, experimental 3.5. TAXONOMY AND CLASSIFICATION 51

Heuristic

A heuristic technique is any approach to problem-solving, learning, or discovery that used a practical method not guaranteed to be optimal or perfect, but sufficient for the immediate goals. The holistic approach discussed above can be a daunting process when the configuration space is large. To avoid this problem, some works employs a heuristic-based approach to benchmarking and collecting data. The methods used in this approach mainly uses statistical sampling. For example, [127] build models with no historical information and try to minimise the amount of training data required. They achieve this by running an instance of the entire job on a sample of the input dataset. Works such as [131], [132], Chao et al. [27] all used heuristic approach. Wang and Khan [132] reduced the overhead of collecting training data by executing a program on a cluster using a limited amount of sample input data instead of using the entire dataset. In [131], the authors used random sampling approaches in the process of data collection and used a sparse sampling approach to address the problem of collecting data for large parameter space. Out of the vast space of configuration parameters, Chao et al. [27] carefully selected a balanced distribution of parameters for training data.

3.5.3 Modelling approach

In the following section, we will address two main approaches to performance modelling. The ﬁrst one is the analytic approach, and the second one is the machine learning approach.

Analytic Approah

In the analytic approach, researchers use mathematical modelling approaches to model various components of a job execution pipeline. This approach is widely used in both DFWC and MR patterns. The method can be used to model the execution time of an application, and it can also be used to estimate resource usage and requirements of a given application. For MR pattern, Verma et al. [129],[128] used analytic approach to model the amount of resources (map and reduce slots) required for a given job. In [72] and [71] the authors developed a set of analytic white-box models to estimate dataflow fields such as (map input records, total spilled records) and cost fields such as (read phase, map phase, collect phase) of a given job. In [130], the authors proposed a hierarchical analytic model that combines a precedence graph model and a queuing network model to capture the intra-job synchronisation constraints. These models are then used to predict the mean job response time, throughput and resource utilisation. As for the DFWC pattern, [132] used various analytic models to model the performance and the memory and CPU cost of a given job. In [102], the authors used an analytic approach in 52 CHAPTER 3. LITERATURE REVIEW conjunction with statistical techniques like correlation analysis to model the execution time of a given Apache Spark Application.

Machine Learning Approach

Recent trends in performance modelling seem to focus more on using Machine Learning than analytic methods. This is mostly due to the ubiquity of simple to use ML platforms provided by computing frameworks such as Apache Spark MLlib [96]. Works such as [139], [127], [27], [131], [22] all used machine learning approaches to predict the performance of data-intensive application. However, there are few exceptions such as [80] that used a hybrid approach to merge the beneﬁts of analytic and ML approach. The main advantage of using ML is its simplicity, and they provide more accurate and measurable results when enough data is collected. The disadvantage of using the ML approach is the overhead of collecting historical data to enable the learning algorithms to train the models.

3.5.4 Computing Framework

In this section, we will discuss the computing frameworks that the applications are executed on. This step is important as it exposes framework speciﬁc settings and it can help to evaluate the interest and the trajectory of the research community. Apache Hadoop [63] and Apache Spark [138] are the most commonly used data processing frameworks. It is also important to understand the version of the computing framework used especially for Apache Hadoop. As discussed in 3.3 (Validation Section), there is a fundamental architectural differences between Hadoop version 1 and Hadoop version 2 and therefore models developed using one version cannot be used to predict the performance of another version. In the surveyed literature, [80], [71], [72], [129], [128], [139], [140], [10], [85] all performed their experiments using Hadoop version 1 whiles performance modelling works such as [7], [21], [22], [109], [121], [123], [56] use Hadoop version 2 or greater. For DFWC, all the surveyed literature used Apache Spark. Since there is not much design difference between the versions of Apache Spark, we do not, therefore, worry about the version differences. These version details are not even covered in the surveyed literature.

3.5.5 Experimental Design

Here we categorise the experimental design used in the surveyed works by either the types of workloads used or the size of the experimental testbed used. The workloads and the experimental design are critical parts in deciding whether particular research in performance modelling of data-intensive applications would make much difference. The main idea is that workloads should 3.5. TAXONOMY AND CLASSIFICATION 53 be representative and cover the related domain as much as possible. The testbed should not use a pseudo-distributed to capture communication details. We consider at-least 5-nodes setups as this will accurately capture communication characteristics of the computing frameworks.

Workload

Through the researched literature, we have seen varying workloads which can broadly be divided them into batch and iterative workloads. The surveyed works on MapReduce pattern mostly use batch workloads. For example, [139], [140], [128], [85], [109], [10], [123], [80], [129] used simple batch workloads such as WordCount, AverageCount, Grep, TeraSort, SimpleSort, AdjacentList, InvertedIndex Join Operations. The are few works such as [139], [128] that includes iterative workloads like KMeans and Bayesian Classiﬁcation. For DFWC, we have observed a mix of workloads since Apache Spark is a general-purpose computing engine. Surveyed research work such as [127] uses nine machine learning algorithms from Apache Spark’s MLlib. Wang and Khan [132],[27], [131][102] included iterative workloads such as PageRank, Naive Bayes, Matrix Factorisation and KMeans.

Testbed

From analysis of the surveyed literature, we have seen testbeds ranging from few nodes to thousands of nodes. Majority of the testbeds are between 4 to 20 nodes [131], [102], [101], [56], [121][92][132], [107][27], [72]. Few are between 20 to 100 nodes [127], [139], [140], [128], [129] and 1 work used 1000+ nodes [135]. From this observation, we have seen that spinning a production size cluster of more than 50 or 100 nodes plus in academic research is very rare because of the cost involve. There are few exceptions such as [139], [140], [128]with a sizeable production size clusters where the authors collaborate with industry partners such as HP Labs.

3.5.6 Objective

In this section, we present the main objectives of the surveyed literature. This helps to understand the trajectory of interest in this domain. On a basic level, we have identiﬁed three main objectives in the literature. Amongst the three objectives, the majority of the papers have an element of modelling execution time in one form or another.

Predicting Execution Time

In this context, predicting execution time is estimating the time it takes to run a data-intensive application on a computing framework like Apache Hadoop or Apache Spark. Estimating 54 CHAPTER 3. LITERATURE REVIEW and predicting the competition time of a given application is one of the main interest we have discovered in the literature. For example, in MR pattern, works such as [139], [130], [128], [129], [118] primary goal is to predict the execution time of an application given speciﬁc conﬁguration. In [109, 140], the authors predict the execution time of an application based on different types of data compression. For the DFWC pattern, [132] used a simulation-driven approach to predict the execution time. Similarly, [107] proposed an experimental methodology to predict the runtime of iterative algorithms.

Meeting Deadlines

Meeting deadlines are normally a critical requirement for the most recent application. Sometimes these requirements part of Service Level Agreements that should be met by service providers. Making sure a data processing task is completed within a speciﬁc timeline is the main goal. In the surveyed literature there are few notable works such as [129], [76], [127], [92] that models application performance to meet deadlines based on robust resource allocation strategy.

Resource Utilisation and Best Conﬁguration

The main goal of works in this domain is to find the optimal configuration settings when running a job or estimate the resource utilisation of a given application. The amount of resources used by a particular application is crucial to its performance. In the surveyed literature, work such as [130], [128], [129], [121], [71] predicts the resource or best best configuration requirement of an application to meet a specific deadline. For example, in [128], the authors predict the amount of resources (map and reduce slots) that would be required to meet a given deadline. In [131], the authors used machine learning methods to auto-tune the configuration parameters of an Apache Spark application to get the best performance.

3.6 Research Gap Analysis

In this section, we will discuss the research gaps we have identiﬁed in the surveyed literature with respect to benchmarking and performance modelling of data-intensive applications. Comparatively, much work has been done in the MapReduce pattern compared to the DFWC pattern.

3.6.1 General Approach

As discussed in section 3.5.1, black-box approach is the most commonly used approach to benchmarking and performance modelling of data-intensive applications. This is due to its Table 3.1: Taxonomy and Classiﬁcation of Benchmarking and Performance Modelling of Data-Intensive Applications: Each of the main headers have some sub headers. Using this table we were able to identify relevant research gaps.

General Approacha Benchmarking Modelling Pattern Experimentb Objective Paper Black Grey White Heuristic Holistic Generic Analytic ML MR1 MR2 DFWC WL(R,B,I) TB(S,M,L) Time Deadline Conf [139] " " " " " R,B,I L " [140] " " " " " R,B,I L " [22] " " " " " " R,B,I M " " [72] " " " " R,B M " [71] " NA NA " " NA NA " " [128] " " " " B,I L " " [127] " " " " R,I L " " [132] " " " " B,I M " [27] " " " " R,B,I L " [129] " " " " B L " " [130] " " " " B S " " [102] " " " " R,B,I M " [131] " " " " B,I S " [80] " " " " B S " [10] " " " " R,B M " [85] " " " " B S " " [121] " " " " B S " [56] " " " " B M " [76] " " " " B,I S " " [118] " " " " " B S " [107] " " " " I M " [114] " " " " R,B L " [79] " " " " B L " a The general approaches are: Black-Box, Grey-Box and White Box b The WL (workload) options are: R:Representative, B:Batch, I:Iterative. The TB (testbed) options are: S:Small, M:Medium, L:Large 56 CHAPTER 3. LITERATURE REVIEW simplicity and comparatively success rate. However, this approach could be improved if it is combined with mild white-box approach. Figure 3.3: Classiﬁcation of General Benchmarking and Performance Modelling Approaches

Black-Box Approach White-Box Approach

Song et al. [118] My Work Verma et al. [128] Wang et al. [131] Grey-Box Herodotou [71] Verma et al. [129] Approach Vianna et al. [130]

This technique is mainly referred to as grey-box approach. As shown in Table 3.1, 80% of the surveyed work used black-box or white-box approach to performance modelling. Secondly, the grey-box approaches mentioned in the literature were mostly performed on Hadoop version 1. Therefore, there is a clear research gap to use the grey-box approach in both Hadoop version 2 and DFWC pattern. In this work and as shown in the Venn diagram in Figure 3.3, we devised a novel grey-box methodology to address this gap by selecting and including key conﬁguration parameters for both the data processing frameworks and applications.

3.6.2 Using of Representative Workloads to Effectively Proﬁle a computing cluster base on Communication Patterns

Any benchmarking and performance modelling task of data-intensive applications will not be accurate enough without using representative workloads that reﬂect the underlying framework. As shown in Table 3.1, more than 70% of the surveyed literature such as [56, 71, 76, 79, 107, 118, 128] focused on MR1 and they did not use representative workloads base on the communication patterns of the framework they used. 3.6. RESEARCH GAP ANALYSIS 57

Figure 3.4: Classiﬁcation of Workload Groupings

Representative Workloads Batch Workloads

Khan et al. [80] Chao et al. [27] Herodotou [71] Verma et al. [129] Islam et al. [76]

My Work

Popescu et al. [107]

C Iterative Workloads

As shown in the Venn diagram shown in Figure 3.4, we argue that for a piece of work in this domain to be general, representative works that reﬂect the communication patterns must be used. For example, for MR pattern, varying batch workloads covering different domains should be included. Similarly, for DFWC pattern, representative workloads should include iterative applications such as ML applications. To address this research gap, we included applications based on design patterns in the MR pattern. For DFWC, we included common iterative applications.

3.6.3 Curse of Conﬁguration Parameters

As of writing this thesis, there are more than 200 configuration parameters for both Apache Hadoop and Apache Spark platforms [22, 27]. Understanding the effects of these configuration parameters on the overall performance of data-intensive applications are very crucial. Incorrectly setting key configuration parameters or leaving them as default can drastically affect the performance of an application [128, 129]. From 3.1, we can see that the is a bit of research work on auto-tuning or finding the best configuration parameters for MR1, however, this is not the 58 CHAPTER 3. LITERATURE REVIEW case for MR2 and DFWC accounting for less than 5% of the surveyed literature. We, therefore, identified this as a research gap. In this work, we study the effects of these configuration parameters on the performance of both MR and DFWC patterns and provide best configuration settings based on our novel grey-box approach.

3.6.4 Generic Benchmarking

A generic benchmarking approach assumes that the process of benchmarking or gathering data for modelling is not attached to a specific application. The main advantage of this approach is to eliminate bias in the training data. If profiling a data-intensive framework is based on few applications that any modelling task based on that data would most likely not work for other applications. Figure 3.5: Classification of Benchmarking Approaches

Holistic Generic

Verma et al. [128] Zhang et al. [139] Khan et al. [80] My Work Zhang et al. [140] Chao et al. [27] Ceesay et al. [22]

From Table 3.1, we can see that only three [22, 139, 140] of the total surveyed literature used generic benchmarking approach. Sometimes it is hard to do a comprehensive generic benchmarking; in that case, several representative workloads and holistic approach should be used instead. As shown in the Venn diagram in Figure 3.5, we have used a novel generic benchmarking approach in the MR pattern and used the representative and holistic workload approach in DFWC pattern. 3.7. DISCUSSION ON CURRENT TRENDS 59

3.7 Discussion on Current Trends

In this section, we discuss the current trends in benchmarking and performance modelling of data-intensive applications. We have grouped these trends based on our observation of the surveyed literature summarised in Table 3.1, the summaries are as follows: • General Approach: From the surveyed literature, black-box approach to benchmarking and modelling is much more common due to their simplicity. However, there is a popular trend of using a hybrid approach which merges the beneﬁts of black-box and white-box approach thereby getting better performance.

• Benchmarking Approach: Currently, Heuristic and Holistic approach are the most common. Heuristic approaches mostly use statistical sampling methods to collect data about a particular framework. Holistic approaches mainly depend on historical data gathered as a result of executing previous jobs on a framework. However, both of these approaches are generally tight to a speciﬁc application. The Generic approach can use either heuristic or holistic approach, but the process is not based on any single application. This approach is much more promising because it eliminates bias in the data.

• Modelling Approach: There are two major approach to modelling and they are analytic and machine learning. Machine learning approaches are much more popular than analytic approaches, and this is due to recent advances in ML tools and their inherent simplicity to generate accurate models.

• Objectives: By far, the majority of the works surveyed in this work includes predicting runtime or execution time of a data-intensive application. This is consistent across all other patterns. Finding or tuning for the best conﬁguration parameters for best performance is second on the list. Meeting deadlines is featured in some of the works as well.

3.8 Summary

In this chapter, we have done an extensive review of existing works in benchmarking and performance modelling of data-intensive applications. We have also developed a comprehensive classification of the reviewed literature to help provide a high-level view of trends and gaps in this domain. The main contribution of this literature review is to identify research gaps in this domain. These identified gaps are addressed in the rest of this thesis. We have identified that the black-box approach is the most studied accounting for about 80% whiles grey-box approach is less explored. We have observed that the majority of surveyed works did not use representative workloads in their profiling and experiment steps. The third research gap highlights the effects 60 CHAPTER 3. LITERATURE REVIEW of configuration parameters on performance. The final gap shows that the majority of the works surveyed uses either holistic or heuristic approach to benchmarking but less focus on making those approaches generic. In the next chapter, we will the general methodology used in this thesis cementing it with practical use cases. CHAPTER FOUR GENERAL METHODOLOGY AND PRACTICAL USE 4CASES 4.1 Introduction

In this chapter, we introduce the general methodology we have used to enable a benchmarking and performance modelling framework for data-intensive applications. Some of the major high-level questions that can be answered using this framework are the execution time of an application, the best conﬁguration parameters to minimise a budget or meet a particular deadline and recommending the best cloud instances for running an application. The main goal of the chapter is to help understand the entire process and aid reproducibility for interested researchers. Furthermore, from a user perspective, we also presented a detailed case study of practical use cases of how the framework can be used for decision making. Note that the use cases in this chapter cover a few sample applications. We have expanded these use cases to showcase the full functionality in Chapter 5 and Chapter 6.

4.2 General Methodology

Due to the internal complexity of modern big data processing frameworks like Apache Spark or Apache Hadoop, it is hard to build an accurate prediction model using either a black-box or white-box approach [27]. Therefore, in this work, we propose a hybrid (grey-box) methodology comprising of the best aspects of both approaches. To achieve this, we use standard machine

61 62 CHAPTER 4. GENERAL METHODOLOGY AND PRACTICAL USE CASES learning tools to build performance models using historical data and taking into account the internal complexities of both the computing framework and application-level parameters. In this thesis, we grouped the benchmarking and performance modelling methodology based on the two common communication patterns. These patterns are the MapReduce pattern and the Data Flow With Cycle Pattern. Generally, the process used in the benchmarking and modelling of applications in these patterns is similar at a high level but different at a low level. We will discuss in detail the general approach used to model applications in these patterns. We will delve deeper into the individual approaches in their respective chapters. The MapReduce pattern is covered in Chapter 5 whiles the Data Flow With Cycle pattern is covered in Chapter 6 The general methodology used in both patterns is divided into four main steps, as shown in Figure 4.1. The ﬁrst step entails studying the low-level details of big data frameworks, such as Spark, Hadoop and selected data-intensive applications to identify candidate performance drivers. The second step of the methodology is to benchmark these big data frameworks

Figure 4.1: General Methodology comprising of four Major Steps: Details of these steps are discussed in the following sections. Specific implementation for MR and DFWC patterns are also discussed in Chapter 5 and Chapter 6 based on the identified candidate parameters using a generic approach to avoid experimental bias. For the MapReduce pattern, we enabled a generic benchmarking framework based on its dataflow pipeline whiles for the DFWC pattern; we used Representative ML workloads; these are workloads that are common and widely used in the literature. The main goal of this step is to gather benchmarking data about the selected candidate performance drivers for further analysis. The third major step of the methodology is to analyse the data generated from the previous step to identify the key performance drivers from the candidate configuration parameters. These key performance drivers are then used in the model building and decision-making process. The model building uses machine learning approaches to find the best predictors for predicting the performance of an application in a particular pattern. The details of this steps are thoroughly covered in Chapter 5 and 6. The general description of each of the steps of the methodology is discussed below. 4.2. GENERAL METHODOLOGY 63

4.2.1 Identifying Candidate Performance Drivers

This is the first and one of the most crucial steps in the methodology used in this work. Big data frameworks such as Spark and Hadoop have many configuration parameters that can affect the performance of applications. Users of these frameworks must adjust these configuration parameters to achieve optimal results. To gather key candidate performance drivers, we use a combination of literature review, framework source code analysis and application complexity analysis to get a corpus of candidate performance drivers. This approach enables us to avoid selection bias by considering both the application and framework level settings. We have divided these performance drivers into two major groups called framework performance drivers and application performance drivers. These two groups are discuss in subsections 4.2.1 and 4.2.1.

Framework Performance Drivers

These are configuration parameters related to the computing framework like Hadoop MapReduce or Apache Spark. To get these key performance drivers of these frameworks, a deeper understanding of the internals of the framework is necessary. In the case of MapReduce, we studied and analysed the source code to understand the performance characteristics of the different phases of the two-stage MR pipeline. We then modified the source code, compile and deploy it on our Hadoop cluster to report metrics about each of these phases for further analysis. For example, we collect information such as the number of mappers, number of reducers, map selectivity and the time it takes for each phase to process the input data. The MR pattern implemented using Apache Hadoop is suitable for applications that can be decomposed into its two-stage computation approach. However, the DFWC pattern implemented using Apache Spark is a general-purpose computing engine, and therefore a different approach was used to identify key configuration parameters. These configuration parameters are specified by users when an application is being submitted to the cluster for processing. For example, the amount of CPU or memory a job can utilise during its execution. To gather this information, we first studied academic works aimed at Spark performance tuning [27, 59, 105, 131]. We then identified representative DFWC applications and ran them using different configuration settings. We collected metrics about each of the candidate parameters and used empirical methods to determine if they make any significant performance differences.

Application Performance Drivers

These are the application conﬁguration parameters. They include properties such as the size of the data being processed, the number of observations, the number of features and number of 64 CHAPTER 4. GENERAL METHODOLOGY AND PRACTICAL USE CASES iteration. To derive these parameters, we analyse the implementation of the application and their respective computational complexity.

4.2.2 Benchmarking

Benchmarking is the second step of our methodology, and it is the process of stress testing a system to gain performance insights. This process can be costly and time consuming if not properly designed. Efficiently collecting sufficient data about all the possible combination of configuration parameters of big data frameworks is a huge challenge. To overcome this challenge, we developed a generic and reproducible benchmarking framework for both MR and DFWC patterns. In both the MR and DFWC patterns, we extended the big data benchmarking suite called HiBench [74] to automate and collect all the custom metrics we are interested in exploring. In MR, we used a generic benchmarking approach to avoid focusing on specific applications only. In DFWC, we used the same approach but concentrated mainly on machine learning applications due to their iterative nature which fits the DFWC pattern. Details of these benchmarking procedures are covered in the respective chapters.

4.2.3 Model Building

At this point, we have generated sufﬁcient data about key conﬁguration parameters from the benchmarking step. These data is what is used to build models used in the performance modelling step. The model building step used several standard machine learning approaches to train, test and validate the produced models. Model building, validation and testing for both MR and DFWC are discussed in details in Section 5.7 and 6.6 respectively.

4.2.4 Prediction and Decision Making

Prediction and decision making is the last step we have used in the methodology of benchmarking and modelling data-intensive applications. At this stage, we have produced parametric models that we can use to answer performance-related questions about an application. We can answer questions such as the execution time of an application, infer the best conﬁguration parameters for processing XGB of data of application a.

4.3 Use Cases

In this section, we demonstrate a step by step procedure of how the performance modelling framework can be used to address performance-related questions about an application. In this 4.3. USE CASES 65 use case demonstration, we will use a few applications to demonstrate how users can answer various questions. The procedure is the same for any data-intensive application. A more speciﬁc and in-depth use cases are discussed in Chapter 5 and Chapter 6 chapters.

4.3.1 A recap of the Problem

First of all, let us imagine that a user wants to run a data-intensive application on a big data cluster deployed on-premises or in the cloud. This cluster may be running Apache Spark or Apache Hadoop. As shown in 4.2, major performance questions would be; how long will it takes to run an application. What effect would it have on performance if input parameters such as data size are increased?

Figure 4.2: Execution Time vs Scalability: What would be the effect on execution time as input data size is increased

What is the best combination of conﬁguration settings to achieve optimal performance? These questions are essential to answer because they would affect the monetary and operational cost. If the cluster is deployed on the cloud, then minimising the ﬁnancial cost of operation is a crucial objective.

4.3.2 Use Case 1: Predicting Execution Time MR Pattern

For MapReduce pattern, we enabled a web-based interface that can be used to predict the execution time of an application. The web interface can also be used to predict the execution time of all the six stages of the MapReduce pipeline. A user can select the application to execute on a cluster and provide the amount of data expected to be processed using the interface. Parameter values such as the input data size, map selectivity and the number of reducers can be modified to generate a new prediction value. The number of mappers is inferred from the input data, and therefore there is no need to provide a value for it. The values for the shuffled data, map time and reduce time parameters are 66 CHAPTER 4. GENERAL METHODOLOGY AND PRACTICAL USE CASES also calculated from previous runs. Figure 4.3 shows an implementation of the decision-making module for the MR pattern. The web-based interface includes fifteen (15) applications. The applications are: Min Max Count, Inverted Index, Average Count, Median & Standard Deviation, Distributed Grep, Top N, Distinct, Total Order Sorting and the major types of Join. A short description of each of this application is also displayed in the application. Furthermore, the modelling procedure and details about all the application are discussed in Chapter 5.

Figure 4.3: A web based interface to predict the execution time of MR applications. This interface is available at https://sc306.host.cs.st-andrews.ac.uk/cbm/

4.3.3 Use Case 2: Predicting Execution Time DFWC Pattern

Resource allocation is an essential aspect during the execution of an Apache Spark job. If not configured correctly, a spark job can create bottlenecks by consuming the entire cluster resources and make other applications starve. An essential task for users running any Spark Application is to understand how to configure the number of executors, memory settings of each executor and the number of cores for a Spark Job for optimal performance. In this use case, we are 4.3. USE CASES 67 interested in showing how the performance modelling framework can be used to predict the execution time of selected application by allocating configurable resources at the time of job submission. The prediction step is the last in of the methodology shown in Figure 4.1. For simplicity and reproducibility, we provided a REST API that can be used to answer questions. The following key configuration parameters of the apache spark framework should be supplied as model parameters using the REST API. • Data Size: This is the amount of data size in GB to process. This size of the data is proportional to the number of features and observations. So the larger the data size, the more the number of observation and features of that dataset.

• Executor Memory: This parameter speciﬁes how much memory should be assigned to each executor. Since Apache Spark is an in-memory computing framework, then in most cases, the more the memory, the better the performance. However, there are some exceptions where more memory will not result in better performance. Running executors with too much unnecessary memory can often cause excessive garbage collection delays, therefore, affecting performance negatively.

• Number of Executor: This parameter speciﬁes the global number of JVM process of an application that the application master can launch in a cluster. Executors are analogous to containers because they represent a logical computation unit assigned with resources such as memory and cores. Executor runs tasks and keeps data in memory or disk storage across them. Each application has its executors. A single node can run multiple executors, and executors for an application can span multiple worker nodes. An executor stays up for the duration of the Spark Application and runs the tasks in multiple threads.

• Executor Cores: Generally, a core is the basic computation unit of a CPU, and a CPU may have one or more cores to perform tasks at a given time. The more cores we have, the more work we can do. In Apache Spark, this parameter controls the number of parallel tasks an executor can run. However, there is a catch here because we might assume that the number of cores is proportional to the number of concurrent tasks an executor can run. So we might think, more concurrent tasks for each executor will give better performance. However, during our experiment, we have realised that performance suffers when we have more than ﬁve concurrent tasks.

• App: This parameter speciﬁes which application would be sent to the REST API for prediction. The options available are as follows: RF represents Random Forest, KMEANS represents the k-means algorithm, SVM represents the support vector machine, LR represents Logistic 68 CHAPTER 4. GENERAL METHODOLOGY AND PRACTICAL USE CASES

Regression, and BAYES predicts Naive Bayesian applications and the LINEAR option predicts the linear regression algorithm.

• The API: There are several API endpoints hosted on the same server. The API endpoint for best conﬁguration settings is at located at https://sc306.host.cs.st-andrews.ac.uk/best/. The API endpoint for execution time prediction is at https://sc306.host.cs.st-andrews.ac.uk/dfwc/. Clients can use any REST client such as curl or postman, to query the backend. A clear example of how to use the API is illustrated in Listing 4.1 These key conﬁguration parameters must be assigned a value else the default value is used to predict the performance of an application. Figure 4.1 and Figure 4.2 shows how to predict the execution time of k-means using the REST API.

1 curl 2 -X POST 3 -d ’{"DataSizeGB":5, "NumEx":16, "ExCore":4, "ExMem":8, "LevelPar":16, " App":"KMEANS"}’ 4 -H ’Content-Type: application/json’ 5 https://sc306.host.cs.st-andrews.ac.uk/dfwc/ 6 #Execution Time : 79.2131 seconds

Listing 4.1: Running K-Means prediction using REST API. Example 1 where size of the data in GB is 5GB

1 curl 2 -X POST 3 -d ’{"DataSizeGB":15, "NumEx":16, "ExCore":4, "ExMem":8, "LevelPar":16, " App":"KMEANS"}’ 4 -H ’Content-Type: application/json’ 5 https://sc306.host.cs.st-andrews.ac.uk/dfwc/ 6 #Execution Time : 154.9325 seconds

Listing 4.2: Running K-Means prediction using REST API. Example 2, where size of the data is scaled to 15GB Comparing Figure 4.1 and Figure 4.2, we can see that the data size parameter that is ("DataSizeGB") is scaled from 5GB to 15GB and all the other variables are kept constant. We can see that the execution time of the 15GB is greater than that of the 5GB one. In this section, the goal is to demonstrate how to use the REST API from the users perspective. More details about the analysis of the results are covered in chapter 5 and chapter 6. 4.3. USE CASES 69

4.3.4 Use Case 3: Inferring Best Conﬁguration Parameters

The Best Configuration is a set of parameter value settings that can be used to optimally execute a job to meet certain objectives. These objectives can be deadline or time constraints. In this section, we demonstrate how to use the framework to get the best configuration parameter for an application. The API endpoint is at https://sc306.host.cs.st-andrews.ac.uk/best/, and the DataSizeGB and App parameters must be supplied. Listing 4.3 and 4.4 show the best configuration settings for running Linear Regression and Logistic Regression on 12GB of input data.

1 curl 2 -X POST 3 -d ’{"DataSizeGB":12, "App":"LINEAR"}’ 4 -H ’Content-Type: application/json’ 5 https://sc306.host.cs.st-andrews.ac.uk/best/ 6 #======7 #This will output the results below 8 #======9 #{ 10 # "Best Config":[{"DataSize":12, 11 # "Numex":16, 12 # "ExMem":2, 13 # "ExCore":2, 14 # "LevelPar":8, 15 # "Predictions":48.69 16 # }], 17 # 18 #}

Listing 4.3: Get the best conﬁguration settings for running Linear Regression on 12 GB data 70 CHAPTER 4. GENERAL METHODOLOGY AND PRACTICAL USE CASES

1 curl 2 -X POST 3 -d ’{"DataSizeGB":15, "App":"KMEANS"}’ 4 -H ’Content-Type: application/json’ 5 https://sc306.host.cs.st-andrews.ac.uk/dfwc/ 6 #======7 #This will output the results below 8 #======9 #{"Best Config":[{"DataSize":12, 10 # "Numex":8, 11 # "ExMem":6, 12 # "ExCore":4, 13 # "LevelPar":8, 14 # "Predictions":661.56}], 15 # 16 #}

Listing 4.4: Get the best settings for running Logistic Regression on 12 GB data

4.4 The generalisation of our approach

In assessing the generality of our approach, we use the following steps to achieve this objective. 1. For the MapReduce pattern, to avoid benchmarking bias, we use conﬁgurable generic workloads that emulates processing different amount of data in each execution step. This eliminates data bias towards a particular workload or set of workloads.

2. For DFWC pattern, we used a different approach due to its internal complexity. We used representative workloads such as ML application to benchmark the computing framework.

3. After successful generic benchmarking, we use that data to build performance models that are general enough to predict the performance of data-intensive applications. This is an important step towards simplifying the process of building models that can accurately give users performance estimates of their applications. See section 4.3 to see the models in action.

4.5 Summary

In summary, we have presented the general methodology of how we benchmark and model the performance of data-intensive applications. Generally, the methodology is divided into four significant steps. The first step deals with the identification of key configuration parameters 4.5. SUMMARY 71 for both the computing framework and applications. The second step deals with the generic benchmarking approach. The third step uses information from the previous step to build performance models. The final step is the decision-making step which we have demonstrated in the form of use cases. Going further, we will cover in details of these four steps for the MR and DFWC patterns. In the next two chapters, we will expand the general methodology discussed in this chapter and apply the specifics to the two most common dataflow communication patterns. We will start with the MR pattern discussed in Chapter 5 and then to the DFWC pattern, this is presented in Chapter 6. These two chapters are self-contained; we present their respective methodologies to benchmarking, performance modelling, experiment and validation of results.

CHAPTER FIVE BENCHMARKING AND MODELLING MAPREDUCE 5PATTERN 5.1 Introduction

In this chapter, we will discuss how we address benchmarking and performance modelling of MapReduce (MR) communication pattern. The first goal is to understand the complexity of the low-level internals of Apache Hadoop. To address these challenges, we studied the internals of the MapReduce communication pattern to design a generic benchmarking component. The second goal is to use this generic benchmarking component, with key performance configurations parameters to develop a generalised prediction framework. The third goal is to validate our approach by running empirical experiments in two internal cluster setups. On average, we have seen that the error rate of our prediction models in both configurations is 10% from ± the measured values. Note that in this literature, the prediction error is a measure of the absolute difference between the predicted and observed values. This should not be confused with RMSE based errors which are techniques used in assessing model accuracy. The rest of this chapter is organised as follows; we discuss a quick background in section 5.2, in section 5.3 we discuss details our theoretical parametric performance model for the different phases of the execution pipeline of MapReduce. In section 5.5, we highlight details of our approach in profiling a Hadoop MapReduce cluster by executing generic workloads. In section 5.7, we show how to generate the parameter to complete the model by performing a linear regression and cross-validation. The experiment and evaluation of results are presented in section 5.8.

73 74 CHAPTER 5. BENCHMARKING AND MODELLING MAPREDUCE PATTERN

5.2 Problem Background

The recent boom of data coined as big data has led to several challenges. Traditional data storage and processing systems like Relational Database Management Systems (RDBMS) by design are inefﬁcient and rigid to store and handle such amount of rapidly changing data. Over the years, researches have developed big data processing frameworks and storage systems to handle these challenges. These new systems run on clusters of machines in a distributed and coordinated manner. Like any new system, it is crucial to model and understand the various performance characteristics of these modern systems. This process could be a tedious and challenging problem because big data application under the hood runs on frameworks comprising of a complex and complicated pipeline of phases with many tunable conﬁguration parameters. For example, applications deployed using MapReduce [41] and Apache Spark [138] frameworks go through several computations or communication stages while being processed. Considering this challenge, we argue that a phase by phase modelling approach can gain a better understanding of the performance drivers of such frameworks. Some of the previous works in this area focuses only on high-level performance drivers like data size [56, 71] and their approach may work well for some applications.

Figure 5.1: Execution time for varying datasets for MapReduce WordCount. The two sorting plots have a more linear relationship between input data and execution time because input data is equal to output data.

For example, from the sort operations plots shown in Figure 5.1, we can see a clear linear relationship between the data size and the time taken to complete the execution, where the input data size is equal to the output data size. 5.3. THEORETICAL MODELS OF THE MAPREDUCE PATTERN 75

Figure 5.2: Execution time for varying datasets for MapReduce WordCount Programs. The two WordCount workloads have a more complex relationship between input data and execution time. We would need more details of the execution framework to understand this behaviour.

However, as evident in Figure 5.2, the relationship between processing time and input data size is non-linear, and therefore further investigation is needed to make conclusions. Here, we dig further into the low-level details and conﬁgurations parameters of the framework to gain an understanding of the performance dynamics. This work can serve as a foundation for performance modelling of big data frameworks like Apache Spark, a general-purpose computation engine that does of super-set of what MapReduce does. The ﬁndings could be useful in size-based schedulers to run small jobs even when the cluster is loaded with long-running and expensive jobs.

5.3 Theoretical Models of The MapReduce Pattern

We will start by ﬁrst exploring the formulae for estimating parameters, these equations should not be confused with the regression functions for learning model parameters discussed in Section 5.7. Our grey-box approach to benchmarking and performance modelling uses a hybrid method by combining black-box and mild white-box approach. Our goal in the white-box part is to include performance metrics of the underlying computing framework. In this section, we present the lessons learnt from the in-depth exploration of the MapReduce source code and its internals. We present a detailed explanation for each of the ﬁve generic phases of the pattern and propose hypothesised cost models. For consistency and ease of presentation of the proposed models, we 76 CHAPTER 5. BENCHMARKING AND MODELLING MAPREDUCE PATTERN

Figure 5.3: A MapReduce Workflow For Each Task [136]. The blue blocks are the different phases, whiles the brown blocks are the process data in transition from one stage to another. will use b0i as intercept for all equations (where i represents one of the phases of MR pattern) and bn where n represents each of the consecutive performance variables. For a given MapReduce job j executed on a given cluster c, the following operations make up the total execution time: UD, the time taken to process the user-defined phases such as map and reduce phases. FD the time taken to process the framework defined phases such as spill, collect, merge and shuffle. Finally any other operational overheads like scheduling s. Therefore, for task t in job j running on cluster c, we can use (5.1) to approximate the running time.

FD UD Tt, j,c , Â TFDi + Â TUDi + s (5.1) i=1 i=1 The relationship in Equation (5.1) represents the cost model of an individual task. However, in most cases, MapReduce spawns multiple tasks which run in parallel with each tasks having a map and optional reduce phase. These phases contain user-defined and framework defined implementations. Equation (5.1) can be modified and rearranged to fit these two main phases of the framework. The detailed process of each of these two phases and their corresponding cost models are explained in the following sections.

5.3.1 Map Phase

As shown in Figure 5.3, the Map Phase consists of the following sub-phases Read, Map, Collect, Spill and Merge. In the following section we will present theoretical models for each of this phases where the dependent variable is names as TphaseName and the explanatory variables are preceded by corresponding b symbols. 5.3. THEORETICAL MODELS OF THE MAPREDUCE PATTERN 77

5.3.2 Read

This phase reads a configurable input split or data block from the Hadoop Distributed File System (HDFS). Each record in the input split is then sent to a customised map() function for processing. The cost in the read phase depends on the size of the input split. The default size of input split as of writing this thesis is 128MB. Given a specific cluster, we can define the cost model of the read phase with a linear function of the size of the input split:

Tread , f (d) , b0a + b1d + e (5.2) where d represents the size of the data and b0 and b1 represents the unknown independent variables of the linear function. For a given MapReduce job, the time taken to read each block of 128MB of data is usually the same. The dependent variable we are trying to model is the time for each read operation represented by Tread.

5.3.3 Custom Map

This is the second sub-phase of the Map Phase. The map function contains a user-deﬁned code to specify how to process records consumed from the input splits. For each record in the input split read from the HDFS, MapReduce launches a map task to process it. As a key-value centric system, the framework treats the byte offset of the line as the key and the line content as the value. Therefore, the cost of the custom map phase is the time that it takes to apply the custom function to the consumed data value. Again, our hypothesis is that the time taken to process the entire phase is the function of the input dataset. This phase also has an additional overhead for setting and cleaning up the task. Since this phase is user-deﬁned, we, therefore, devised a way to approximate the time as shown below.

n Â Trec T i=1 /N (5.3) map , M c

Where n is the number of records in the input split fed to the map function, Trec is the time taken to process each record, M is the number of mappers and Nc is the number of containers. Moving forward, the amount of data that passes through each stage is crucial to the accuracy of the model, therefore the data emitted by the map phase should also be calculated as shown in Equation (5.4).

Msel is the ratio of map input to the map output for each task.

Md , d.Msel (5.4) 78 CHAPTER 5. BENCHMARKING AND MODELLING MAPREDUCE PATTERN

MapReduce is a parallel processing framework, this means that a job j is broken down into several task t during execution. Depending on the size of the data the framework may spawn several waves of task during a job execution life cycle. Expanding on 5.4, for each task t, Md is the size of the map output data that will progress to the rest of the phases, d is the total input data for task t and Msel is the map selectivity ratio.

5.3.4 Collect

This is the next stage after the custom map phase. The output of each map task is not directly written to disk and consumed by the reducers. Instead, they are buffered and presorted in memory. This phase also depends on the amount of data emitted by the mappers. The cost model of the collect phase is given as below:

Tcollect , f (Md) , b0b + b3d.Msel + e (5.5)

In equation 5.5, the dependent variable is Tcollect and the explanatory Md which comprises of d and Msel

5.3.5 Spill

In this phase, map output data is partitioned, sorted in memory and written to local disks. Writing data to local disk is the main bottleneck. The more data to spill, the more time it will take. We can, therefore, represent the cost model using the linear relationship shown below.

Tspill , f (Md) , b0c + b4d.Msel + e (5.6)

In equation 5.6, we are trying to model Tspill as a function of the amount of data that it spills to the next phase.

5.3.6 Merge

This is the last phase of the Map Phase we are modelling the dependent variable Tmerge as a function of amount of data it merges for the reduce phase. Each time the collect buffer reaches its configurable threshold, a new spill file is created. By default, 10 spill files are merged at once. The individual spill files are then deleted after merge operations. Similar to the spill phase, the cost depends on the amount of data to merge-sort and write back to disk. From the source code of MapReduce, this phase uses the merge-sort algorithm which has an average complexity of 5.3. THEORETICAL MODELS OF THE MAPREDUCE PATTERN 79 nlogn. Hence, the relationship to model this phase is:

Tmerge , f (MdlogMd) , b0d + b5Md(logMd)+e (5.7)

Note: For simplicity, we have Md instead of the base parameters in Equation 5.7.

5.3.7 Reduce Phase

As shown in Figure 5.3, Reduce Phase of each task consists of the following sub-phases: Shufﬂe, Reduce and Write. The cost analysis of each of these three phases will be discussed as follows.

5.3.8 Shufﬂe

This phase is one of the most involved and expensive phases in MapReduce. It comprises several tightly intertwined steps. Therefore, we decided to model it as a whole. After the merge phase, the same keys are copied over the network to the reduce nodes for further processing. When all map outputs have been copied, data are then merged into larger ones while maintaining their sorting order to be consumed by the next stage. To model this phase, we calculated the data that are shufﬂed through to each reducer. The total data processed by each reducer can be represented as: d.Msel.Mt Sd , (5.8) Rt

Where Mt is the total number of mappers, and Rt > 0 is the total number of reducers. Using Equation (5.8), the cost model of the shufﬂe phase can be formulated as:

d.Msel.Mt Tshu f f le , b0e + b6( )+b7Mt + e (5.9) Rt

5.3.9 Custom Reduce

Similar to the user-defined map function, the reduce phase has a custom user-defined code that processes the shuffled data obtained from the map phase. For each key in a data partition, the reduce function is executed. We use the formula below to approximate the time it takes for the reduce phase to complete.

n Âi=1 Tkey Treduce , /Nc (5.10) Rt

Where Nc is the total number of containers. 80 CHAPTER 5. BENCHMARKING AND MODELLING MAPREDUCE PATTERN

5.3.10 Write

This is the last phase of the MapReduce pipeline. The output of custom reduce function is collected and written to HDFS. To model this phase, ﬁrst, we modelled the total data that the reducer task emits, i.e.,

Rd , Sd.Rsel (5.11) where Equation (5.8) is the total shufﬂe data fed to a reduce task and Rsel is the ratio of input and output sizes. Using the relations in Equation (5.11), we can, therefore, deﬁne the cost model of the write as a linear function of the reduce output data size, which is:

Twrite , f (Rd) , b0 f + b8Rd + e (5.12)

5.3.11 Combining it All Together

Now that we have proposed the cost model for the individual phases of the MapReduce pipeline. Putting them all together for the two main phases, we have:

Tmt , Tread + Tmap + Tcollect + Tspill + Tmerge (5.13)

Now substituting the phases with their corresponding linear cost models and simplifying all the constants in for the map phase as b0m and in the reduce phase as b0r to have

Tmt , b0m + b1d + b2d + b3Md + b4Md + b5MdlogMd + Tmap (5.14)

Similar to the Map Phase, we can combine all the initial formula in the sub phases of the reduce phase, i.e.,

Trt , Tshu f f le + Treduce + Twrite (5.15) Putting the cost models together, we have:

Trt , b0r + b6Sd + b7Sd + b8Rd + Treduce (5.16)

5.3.12 Cost Model For The Entire Process

The models presented in the previous section represent models for a single task. MapReduce runs several tasks in parallel. Depending on the system resources, all task may run in one round, or in most cases, there will be several rounds of tasks before the entire job finishes. Before YARN, map and reduce slots were used to determine the number of tasks that can run concurrently. This approach does not fully utilise the cluster resource and previous performance models using this 5.4. PROFILING AND MODELLING METHODOLOGY 81 approach cannot be applied to YARN-based MapReduce application. YARN uses containers for task execution. The number of containers in a cluster determines the number of tasks that can run concurrently. Assuming that a cluster has 20 containers Nc and job j has a total of 100 map tasks Mt, there will be at least five rounds of 20 map phase with each phase running 20 tasks concurrently. Similarly, with 20 containers and 20 reducers task Rt, all the reducer task may run simultaneously in one round. Using this logic we can modify both the Map and Reduce final formulas as follows:

Mt Tmt = [Tread + Tmap + Tcollect + Tspill + Tmerge] Nc Mt (5.17) = [b0m + b1d + b2d + b3Md + b4Md+ Nc

b5MdlogMd + Tmap]

Similarly, the same can be done for the reduce phase which is:

R T = t [T + T + T ] rt N shu f f le reduce write c (5.18) Rt = [b0r + b6Sd + b7Sd + b8Od + Treduce] Nc

The ﬁnal cost model of the entire job can be obtained by merging Equation (5.17), Equation (5.18), the custom phases (Equation 5.3 and Equation 5.10) and simplifying further by grouping like terms would produce:

Mt Rt Tjob =[ [Tmt ]+ [Trt]] + e Nc Nc Mt = [b0m + b1d + b2d + b3Md + b4Md + b5MdlogMd]+ Nc Rt [b0r + b6Sd + b7Sd + b8Od]+e (5.19) Nc Mt = [b0m + bxd + byMd + b5MdlogMd + Tmap]+ Nc Rt [b0r + bzSd + b8Rd + Treduce]+e Nc 5.4 Proﬁling and Modelling Methodology

As shown in Figure 5.4, there are three main steps involved in the generic benchmarking and modelling process. First, we modiﬁed the MapReduce 3.01 source code and added the necessary codes to obtain the running times of each stage. Since MapReduce uses counters to present task 1https://github.com/apache/hadoop/tree/trunk/hadoop-mapreduce-project 82 CHAPTER 5. BENCHMARKING AND MODELLING MAPREDUCE PATTERN and job-related statistics to the user, we added six new counters to collect metrics for the six generic phases. The details of how we added these counters are discussed in subsection 5.4.1

Figure 5.4: A Phase Proﬁling and Modelling Methodology

5.4.1 Adding Custom Framework Hadoop MapReduce Counters

Hadoop MapReduce uses counters to report both framework and job or application-level metrics. However, by default, the framework does not output execution time information for all the phases we are interested in modelling. It instead provides the total time it takes to run a particular job. Since these counters are not application-level counters, we had to modify Hadoop MapReduce source code and compile it to enable the framework to report the new counters. These counters are summarised in the Table 5.1 Table 5.1: Custom MapReduce Framework Counters: These are not the same as MR application counters.

Variable Value READ_MILLIS Returns the time it takes read a block of data from HDFS. Returns the time it takes to run the collect phase of COLLECT_MILLIS MapReduce pipeline for each task Returns the time it takes to run the merge phase of MERGE_MILLIS MapReduce pipeline for each task Returns the time it takes to run the spill phase of SPILL_MILLIS MapReduce pipeline for each task Returns the time it takes to run the shu f f le phase of SHUFFLE_MILLIS MapReduce pipeline for each task Returns the time it takes to run the write phase of WRITE_MILLIS MapReduce pipeline for each task MAP_TIME_MILLIS Returns the time it takes to run the map function for each task. REDUCE_TIME_MILLIS Returns the time it takes to run the reduce function for each task.

To have this custom counters in place, we perform modiﬁcations to MapperTask and ReducerTask classes of the Hadoop MapReduce project. A segment of the modiﬁcation we have 5.5. PHASE PROFILING IMPLEMENTATION 83 done is shown in Listing 5.1. A similar approach is used for implementing all the other custom counters.

1 public static class MapOutputBuffer 2 implements MapOutputCollector, IndexedSortable { 3 private Counters.Counter spillPhaseMillis; 4 private Counters.Counter mergePhaseMillis; 5 private Counters.Counter collectPhaseMillis; 6 7 spillPhaseMillis = reporter.getCounter(TaskCounter.SPILL_MILLIS); 8 9 private void spillSingleRecord(...){ 10 long startTime = System.currentTimeMillis(); 11 //Rest of the original implementation 12 long endTime = System.currentTimeMillis(); 13 long elapsedTime = endTime - startTime; 14 if(elapsedTime >= 1){ 15 spillPhaseMillis.increment(elapsedTime); 16 } 17 } 18 }

Listing 5.1: Adding Custom Counters To The MapReduce Framework

5.5 Phase Proﬁling Implementation

In this section, we introduce the detailed steps we used to profile the Hadoop MapReduce framework. The outcome of the methodology enables us to train and build models to predict the performance of applications. The profiling process is inspired by [139], but different in terms of the underlying system, workloads used and the experimental approach. The core idea of our approach is to profile each phase of the MapReduce pipeline by executing generic MapReduce workload. A generic MapReduce workload is a workload that does not have a specific implementation in the map() or reduce() functions. The goal of this is to avoid bias towards specific workloads or domains. It also helps to understand how the different phases behave with different sizes of input data. The results can be used to make conclusions on the behaviour of each phase. The design of the benchmarking process is parameterised as follows:

BMi =(Di,Mseli ,Bsizei )

Each generic benchmark has the following parameters: Di is the size of the input dataset,

Msel represents the map selectivity and Bsize represents the block-size. For each proﬁling step, we vary Di to read data size ranging from 500MB to 5GB with an interval of at most 500MB. 84 CHAPTER 5. BENCHMARKING AND MODELLING MAPREDUCE PATTERN

This parameter mainly proﬁles the read phase. Msel represents the Map Selectivity, which is the ratio of map input data to map output data. This parameter affects the collect, spill, merge and shufﬂe phases. We also vary Bsize using 64MB and 128MB. Therefore, for a given input data size of 5GB, we executed the benchmark 20 times (10*2, where 10 is from the various value of

Msel (10%..100%), and 2 is the number of options for Bsize (64MB and 128MB)). Now, we will discuss in details the implementation of phase proﬁling.

5.5.1 Generic Mapper Implementation

Based on the Map Selectivity that is Msel, the generic mapper implementation simply writes out whatever data is read from the Hadoop Distributed File System. So if the Msel is set to 90% then the generic mapper will output 90% of the input data. By default, there are no conﬁguration settings to modify the behaviour of map selectivity of the Hadoop MapReduce framework. However, a programmer can override the run() method of the Mapper class to apply a custom map selectivity setting. This is the approach we have taken to conﬁgure each benchmark with different map selectivity. Listing 5.2 shows the Mapper class implementation and a placeholder for the run() method to implement the map selectivity. On line 3, the map function emits the value read from the input data.

1 public static class GenericMapper extends Mapper { 2 public void map(Object key, Text value, Context context) throws Exception { 3 context.write(value, new Text("")); 4 } 5 @Override 6 run(context){ 7 //Implementation is provided in \ref{ch5:code:generic-mapper} 8 } 9 }

Listing 5.2: Generic Mapper Implementation

5.5.2 Map Selectivity Implementation

For each input data size we vary the Msel from 10% to 100% using 10% interval. In Listing 5.3, we initialise a custom conﬁgurable parameter mapSelectivity on line 8. We use this value to calculate the number of rows to read from the data generated by Teragen. From lines 14 to 19, we call the map() function as long as the number of records processed is less than or equal to the map selectivity (threshold.) 5.5. PHASE PROFILING IMPLEMENTATION 85

1 @Override 2 public void run(Context context) throws IOException, InterruptedException { 3 long startTime = System.currentTimeMillis(); 4 setup(context); 5 Configuration conf = context.getConfiguration(); 6 7 int splitSize = Integer.parseInt(conf.get("splitsize")); 8 double mapSelectivity = Double.parseDouble(conf.get("selectivity")); 9 long totalRows = Utility.getNumberOfRows(splitSize); 10 long rows = 0; 11 // LOG.info("Total Records in input file "+totalRows); 12 int threshold = (int)(mapSelectivity*totalRows); 13 LOG.info("Total Records to process "+threshold); 14 while (context.nextKeyValue()) { 15 // LOG.info("Rows processed so far "+rows); 16 if (rows++ <= threshold) { 17 map(context.getCurrentKey(), context.getCurrentValue(), context); 18 } 19 } 20 cleanup(context); 21 long endTime = System.currentTimeMillis(); 22 long totalTime = endTime-startTime; 23 context.getCounter(MAP_RED_CUSTOM.MAP_TIME_MILLIS).increment(totalTime ); 24 }

Listing 5.3: Map Selectivity Implementation A prerequisite step in this process is to use the Teragen [103] program provided by the MapReduce framework to generate the dataset. Each row generated by Teragen has a size of 100 bytes. Therefore, to process 10% of a 500MB input, we invoked Teragen to generate [(500MB to Bytes) /100] rows, and we override the run() function of MapReduce interface to stop after the map had processed the 10% threshold.

5.5.3 Generic Reducer Implementation

The implementation of the generic reducer is different from the generic mapper because we did not do any complex overriding in the run() function. We are only interested in outputting whatever data is shufﬂed from the mappers. The implementation is shown in Appendix A.4 86 CHAPTER 5. BENCHMARKING AND MODELLING MAPREDUCE PATTERN

5.6 YARN Log Parser

The function of the YARN log parser in the framework is to emit metrics of the parameters proﬁled in the proﬁling step. For each job and each task, YARN logs are collated, aggregated and parsed to extract the relevant counters and their respective values for model building. There is no complex implementation in this process. We use standard Java and Linux matching tools such as grep to match and extract relevant information. Figure 5.5 shows a sample of the data generated from the log after the extraction process.

Figure 5.5: Sample Data Generated by the YARN Log Parser The first column represents the MR phase; this can be read, collect, spill, merge, shuffle or write. The second column represents the time it takes for that phase to finish. The third column represents the map selectivity.

5.7 Model Building

Figure 5.6 gives the results of the models on the processed YARN log data, using a 10 fold cross- validation on each dataset. The observations for each are discussed in the following subsections. The plot shows the results of the predicted and actual values. To select the right parameters and evaluate the accuracy of the model of each phase, we consider the best practices for model selection. For each model, we study the effect of p-value, Root Mean Squared Error (RMSE) [9], and R2 [97]. First, to select the most important parameters of the model, we performed backwards-elimination and accept any variable whose p-value <= 0.05. The p-values tell us how statistically signiﬁcant a particular variable we include in the regression is. RMSE measures the standard deviation of the residuals or prediction errors. Residuals are the measure of how far from the regression line, the data points are. In determining the model accuracy, we evaluate how small the value of RMSE is considering the range of the dependable variable we are using. In our case, the dependent variable is execution time represented on the Y-axis of Figure 5.6. From the basic analysis of RMSE, the smaller the RMSE value with respect to the Y-axis (execution time in our case), the better the model. Our general assumption is that in most cases, the performance of these phases depends on the amount of data that it processes. This assumption is conﬁrmed in Figure 5.7. We can see 5.7. MODEL BUILDING 87

Figure 5.6: Results of 10 Fold Cross-Validation and Prediction on the 8 Nodes cluster. The index in the x-axis is the row number of the data item being validated that, in most cases, as we increase the size of the dataset, the processing time generally increases and therefore, a possible linear relationship. The last two phases in Figure 5.7, i.e. shuffle and write have a less linear relationship compared to the first four phases. To select the best learning method, we modelled the data using different ML approaches. As shown in Tables 5.2 and 5.3, the best ML approach for both the Shuffle and Write phase is the support vector machine algorithm but not by a wider margin compared to linear regression. The algorithm has the highest R-squared value and the best RMSE in our case.

5.7.1 Read Model

By default, each mapper is assigned to a maximum of 128MB data block to process. Each record in this block of data is processed by a map task. We measured the time taken by each map task to read the next key-value pair and aggregate the values to get the total time taken to read the entire dataset by all mappers. The Read Phase of Figure 5.7 supports our initial assumption in (5.2) that there is a linear relationship between the size of input data and the processing time. The model obtained using linear regression is presented in Equation (5.20). The read plot in Figure 88 CHAPTER 5. BENCHMARKING AND MODELLING MAPREDUCE PATTERN

Table 5.2: RMSE and R-Squared values for Shufﬂe Phase. From the basic analysis of RMSE, the smaller the RMSE value with respect to the Y-axis (execution time in our case) the better the model. R2 values closer to 1 are better.

Algorithm RMSE R-Squared Settings kernels="svmLinear,svmRadial" method = "cv" SVR 4655.42 0.96 number = 10 verboseIter = TRUE classProbs = FALSE Random Forest 5027.23 0.95 NA Decision Tree 8370.37 0.78 NA Linear Regression 4774.75 0.95 NA

Table 5.3: RMSE and R-Squared values Write Phase. From the basic analysis of RMSE, the smaller the RMSE value with respect to the Y-axis (execution time in our case) the better the model. R2 values closer to 1 are better.

Algorithm RMSE R-Squared Settings kernels="svmLinear,svmRadial" method = "cv" SVR 2747.380 0.94 number = 10 verboseIter = TRUE classProbs = FALSE Random Forest 2940.09 0.93 NA Decision Tree 9366.400 0.68 NA Linear Regression 6256.47 0.88 NA

5.6 and RMSE value of 4.08 proves a strong case for the accuracy of the model.

T = 0.01 D + 1.33 + e (5.20) read ⇥

5.7.2 Collect Model

Map output data is collected as soon as map tasks complete; once the circular buffer of each map task reaches the predeﬁned threshold of 80% data is then written to disk. This behaviour can be observed from Figure 5.7, a spike in the amount of time taken is at 80% Msel. The model for the collect phase in our test environment is illustrated as below:

T = 0.01 M + 0.97 + e (5.21) collect ⇥ sel 5.7. MODEL BUILDING 89

Figure 5.7: Best ﬁt plot for each of the generic phase: The main purpose of this plot is to show how performance(ms) changes with respect to data size. The blue line is the line of best ﬁt, while the points or the black dots are the corresponding time spent by each job to process the various datasets.

Again, the RMSE value of 3.036 and the Collect plot in Figure 5.6 supports our model.

5.7.3 Spill Model

As shown in Figure 5.7, our linearity assumption between data size and the processing time is conﬁrmed. The model obtained after the cross-validation is presented in Equation (5.22). The model has RMSE value is 2.522 which would be considered a good model in this case.

T = 0.02 M + 0.98 + e (5.22) spill ⇥ sel 90 CHAPTER 5. BENCHMARKING AND MODELLING MAPREDUCE PATTERN

5.7.4 Merge Model

Spill ﬁles are merged together to form bigger ﬁles. From the plot, we can see a close relationship between the actual and the predicted values with an RMSE of 9.36. The parameters of the model are shown in Equation 5.23

T = 0.002 M logM + 4.80 (5.23) merge ⇥ sel sel

5.7.5 Shufﬂe Model

In this phase, data with the same keys are copied from all map nodes to the reduce nodes and merged for ﬁnal processing. This phase is the most expensive of all because it involves inter-node communication. The main model parameters are the amount of data being shufﬂed and the number of Mappers task. The RMSE value obtained after is 4655.

T = 10.45 S + 579.48 M + 6144.6 + e2 (5.24) shu f f le ⇥ d ⇥ t

5.7.6 Write Model

This phase involves writing data to disk. The model predicts the time taken to write x amount of data to HDFS. From Figure 5.7 (Write Phase), we can observer a larger spread of data from the ﬁtted line. We have conﬁrmed this from the results of our diagnostic plots of the model. The cross-validation plot also shows a strong relationship when the test values are used in the predictive model. The model has an acceptable RMSE of 2427

T = 6.94 R + 2139.98 + e (5.25) write ⇥ d

5.7.7 Custom Map and Reduce

Since these phases have a custom implementation that depends on the problem being solved, we, therefore, used Equation (5.3) and Equation (5.10) to approximate their respective completion times. 2Linear equation, however we used the SVR model in the prediction. Same for the 5.7.6 5.7. MODEL BUILDING 91

Table 5.4: Metrics Extracted From Logs

Name Variable Description Num Of Bytes Read D Total Size of Data

Map Output Bytes Md Data Outputted by Mappers Map Selectivity M (M /d) 100 sel d ⇤ Bytes Shufﬂed Sd Data shufﬂed, refer to Eq: (5.8)

Bytes Written Rd Data Outputted by Reducers D Total Mappers Mt BlockSize Total Reducers Rt Optimisable

Number of Containers Nc Inferred from Cluster Conﬁguration.

Map Time Tmap Total time taken for map() function

Reduce Time Tmap Total time taken for reduce() function

5.7.8 Generating the Input Parameters

With the parameters from proﬁling, the model Tjob in (5.19) becomes:

Mt [(0.01 D + 1.33)+Tmap +(0.01 Msel + 0.97)+ Nc ⇥ ⇥

(0.02 Msel + 0.98)+(0.002 MsellogMsel + 4.80)]+ ⇥ ⇥ (5.26) Rt [(10.45 Sd + 579.48 Mt + 6144.6)+Treduce+ Nc ⇥ ⇥ (6.94 R + 2139.98)] + e ⇥ d In this equation, e represents the residual term which represents the deviations of the observed values from their means, which are normally distributed with mean 0 and variance of s. Using equation 5.26, actual values for the input parameters listed in Table 5.4 can be substituted to generate a solution. To get the values for each of these parameters, the first step would be to run an application on the profiled cluster using a representative input dataset for the application. The application logs are then parsed to the YARN log parser to collect the values of the parameters for substitution into the corresponding models. This would provide the input parameters to generate the final predictive model. To illustrate, we take the Reduce Side Inner Join algorithm as a target application. The generated values for each of the input parameters are listed in Table 5.5. To facilitate the generation of the values from YARN logs we provided Java and R scripts available on our Github page at https://github.com/sneceesay77/mr-performance-modelling. 92 CHAPTER 5. BENCHMARKING AND MODELLING MAPREDUCE PATTERN

Table 5.5: Metrics With Values: The values for each parameters for Reduce Side Inner Join algorithm extracted from YARN logs.

Variable Value D 19584

Mt,Rt 153,11

Tmap,Treduce(ms) 33069,286257

Md 128

Msel 100%

Sd 19584

Rd 19584

Nc 8

5.8 Experiment and Evaluation

5.8.1 Setup

To gauge the applicability of our approach on an arbitrary setup, we conducted two sets of experiments on two different hardware setups. On the first setup, we used a single node YARN cluster with 32GB memory and 8 CPUs. YARN was allocated 24GB and a minimum of 3GB per container. During the execution of the experiments, we have realised that on average, the highest number of containers executed simultaneously is 4. Note, this makes sense because the minimum amount of memory that YARN can assign to a container is 3GB as per our configuration. In the second setup, we used an 8-node in-house YARN cluster to mimic a real-world deployment scenario. Each of these nodes has 8GB of RAM, eight vCPUs and a 500GB of storage space. YARN is allocated 48GB of memory, and a maximum of 8 containers can run at a time. From the results of the two setups for the experiments, we have a strong case to conclude that, the approach can be used to profile an arbitrary cluster size. To test how the models perform on an unseen or new application, we have to follow the following steps. First, identify the parameters of the various models. Second, evaluate how these parameters can be extracted from the job execution logs. Third, run the application with a representative and sample dataset and collect the values of those parameters. Finally, substitute these values in the corresponding models to get the execution time. As shown in Table 5.4, for each workload, a set of log related parameters are extracted or calculated and used as input for the generated models. To measure the variability of the results and increase the certainty of the results, each of the experiments was executed three times to avoid data skew and bias then the average of those executions ware calculated and used. 5.8. EXPERIMENT AND EVALUATION 93

5.8.2 Representative Workloads

One of the research challenges listed in Section 3.6 of Chapter 3 deals with the problem of using representative workloads in benchmarking and performance modelling of data-intensive application. To address this problem, we grouped our workloads based on MapReduce design patterns. In all the experiments, we have adapted the algorithms discussed in the design patterns book [98] and used the same StackOver f low data source but a more updated version. Table 5.6 summarises all the workloads and their data-sizes used in this work. To have a representative cost model for the various MapReduce phases and the entire process, we studied the most common design pattern algorithms of MapReduce presented in [98]. The motivation of using MapReduce design pattern approach in the experimental setup is to avoid bias towards speciﬁc applications only. We also include workloads that are common in real-life applications, e.g. Inverted index and join operation, which is common in search engines and databases, respectively. As shown in Table 5.6, we have only included large datasets in the experiment to narrow our focus on big data and data-intensive applications. We have identiﬁed four common design patterns, and for each of these patterns, we tested at least three algorithms to show how algorithms using the same pattern relates to one another. We found out that algorithms with the same pattern would have similar performance characteristics.

5.8.3 Summarisation Pattern:

This pattern focuses on algorithms that produce a top-level, summarised view of your data so you can glean insights not available from looking at a localised set of records alone. The popular MapReduce WordCount program is a good example. The summary may involve grouping the dataset by a particular key or count the number of occurrence of each word, finding the average of each word. These are standard operations in MapReduce because grouping is a central part of the framework. Numerical summarisation, a subpattern of summarisation, is used to calculate the value of the aggregate statistics over a dataset: Example, count, minimum, maximum, average, variance and standard deviation. Inverted Index Summarisation is another subpattern of this pattern that generates an index from a data set to allow for faster searches or data enrichment capabilities. Algorithm 1 show the implementation of the Min-Max Count algorithm for the Reducer function. This algorithm can be used to find out the first, the last time and the number of times a user accesses a website or a service. 94 CHAPTER 5. BENCHMARKING AND MODELLING MAPREDUCE PATTERN

Algorithm 1: Min Max Count MapReduce Implementation 1 function minMaxCount (keyData); Input :< UserID,> 2 foreach Tuple T: keyData do 3 if !T.minDate then 4 T.minDate = minDate 5 end 6 if !T.maxDate then 7 T.maxDate = maxDate 8 end 9 Count+=Count; 10 emit(UserID,Tuple < minDate,maxDate,Count >) 11 end

5.8.4 Filtering Pattern:

This pattern filters and returns a subset of a given original dataset. In most cases, there is a reduction in the amount of output dataset compared to the input dataset. Examples of algorithms in this pattern are Distributed Grep, Distinct, Sampling and Top K. Filtering pattern is more about understanding a smaller fragment of your data, such as all records generated from a particular user, or the top five most used verbs in a corpus of text. Algorithm 2 and 3 shows the implementation of distributed grep. Basically, the algorithm finds and count matching text from

Algorithm 2: Mapper: Distributed Grep MapReduce Implementation 1 function distributedGrep (); Input :inputData Output :< match,1 > 2 Pattern p = "reg_ex"; 3 foreach Text T: inputData do 4 if T.matches(p) then 5 emit(match,1) 6 end 7 end a corpus of raw data. In short, filtering allows you to take a bird’s eye or microscopic view of your data. It can also be considered a form of search. If you are interested in finding all records that involve a particular piece of distinguishing information, you can filter out records that do not match the search criteria. 5.8. EXPERIMENT AND EVALUATION 95

Algorithm 3: Reducer: Distributed Grep MapReduce Implementation 1 function distributedGrep (); Input :< match,List < 1,1,..,1 >> Output :< match,totalCount > 2 foreach Int v: List do 3 totalCount += v; 4 emit(< match,totalCount >) 5 end

5.8.5 Data Organisation Pattern:

This pattern deals with the reorganisation from one structure to another. In many organisations, Hadoop usually is one component in a more extensive data analysis platform. Therefore there is always a need for data to be transformed from one structure to another. The main subpatterns in this patter are Structure to Hierarchical, which transforms data into a very different structure. An example may be coalescing n table-based data structure to a hierarchical JSON or XML format. The second subpattern is the partitioningandbinpattern. This pattern splits the data into partitions, shards or bins which can be achieved using a custom partitioner or MultipleOut puts class of Hadoop.

5.8.6 Join Pattern:

In data systems, it is infrequent to organise all your data in one large ﬁle. For example, presume you have business information stored in a SQL database, this information would generally spread over multiple tables for convenience. Meanwhile, in the context of big data, weblogs may arrive in a constant stream and are dumped directly into HDFS. Also, daily analytics that makes sense of these logs are stored someone wherein HDFS and ﬁnancial records are stored in an encrypted repository. Data is all over the place, and while it’s precious on its own, we can discover interesting relationships when we start analysing these sets together. This is where join patterns come into play. Hadoop MapReduce is not designed by native to handle data from a well organised and normalised data sources. This leads to the absence of SQL like joins for data manipulation. However, programmers can implement these functionalities by themselves. The reduce side implementation of inner and left-outer joins are listed in Algorithm 4 and 5 respectively. The most popular implementation of Join in MapReduce is the Reduce side join. It works in all cases but can be slow as the size of the data increases. There are other types of Join Patterns such as Replicated Join and Composite Join. In both algorithms, we assume that the Mapper tasks which outputs and group the input data by key have already been executed. 96 CHAPTER 5. BENCHMARKING AND MODELLING MAPREDUCE PATTERN

Algorithm 4: Reduce Side Implementation of Inner Join in MapReduce 1 function innerJoin (listA,listB); Input :Two list of KV pairs listA and listB Output :Inner Join of listA and listB 2 foreach Text A: listA do 3 foreach Text B: listB do 4 emit(A, B) 5 end 6 end

Algorithm 5: Reduce Side Implementation of Left Outer Join in MapReduce 1 function leftOuterJoin (listA,listB); Input :Two list of KV pairs listA and listB Output :Left Outer Join of listA and listB 2 foreach Text A: listA do 3 if listB.empty() then 4 foreach Text B: listB do 5 emit(A, B) 6 end 7 else 8 emit(A,B) 9 end 10 end

Table 5.6: Workloads Used in the 8 node cluster. For the single node cluster we have used less than 1GB of data in almost all the cases.

Algorithm Design Pattern Data Size MinMaxCount Summarisation 16GB Average Count Summarisation 16GB Median and Std. Summarisation 16GB Inverted Index Summarisation 12GB Grep Filtering 12GB Top X Filtering 2.7GB Distinct Filtering 16GB Structure to Hierarchy Data Organisation 12GB,16GB Total Order Sorting Data Organisation 2.6GB Shufﬂing Data Organisation 2.7GB RSJ Inner Join 16,2.7GB RSJ Left Outer Join 16,2.7GB RSJ Right Outer Join 16,2.7GB RSJ Full Outer Join 16,2.7GB 5.8. EXPERIMENT AND EVALUATION 97

5.8.7 Evaluation of Results

For each of the two setups, the same experiments are repeated and the results are plotted and discussed side by side. As expected, the time taken by the single node cluster for each of the algorithms is mostly greater than the eight node cluster. Comparatively, the observed performance of each of the four algorithms are mainly affected by the amount of shufﬂed data and sometimes the complexity of the reduce function. The map function in most cases performs a standard consume and emit operation. Below, we discuss the rest of the results in details. Note: In all the tables the error rate is measured as the difference of the absolute value of the predicted and observed values. The less the error rate the better the accuracy of our prediction.

5.8.8 Summerisation Design Pattern

The algorithms and the results obtained included are listed in 5.7 and corresponding plot in Figure 5.8. The percentage prediction error for both setups is less than 16%. Also note that the algorithms in this pattern have a similar completion time boundaries.

Figure 5.8: Applications in Summarisation Design pattern have similar performance characteristics. x-axis are the applications and the y-axis is the time in seconds. The average prediction error is less than 16% from the measured value.

From the experimental data, the shufﬂe operation for MinMaxCount transferred approximately 371MB to the reduce node while the MapReduce program to generate Inverted Index from 98 CHAPTER 5. BENCHMARKING AND MODELLING MAPREDUCE PATTERN

Table 5.7: Results for Summarisation on 8-Node Cluster

Algorithm Predicted (sec) Actual (sec) |Error%| MinMaxCount 196 201 2 Inverted Index 118 117 1 Average Count 157 176 11 Median and Std. Dev 158 169 7

Wikipedia article shuffles 56MB data from mappers to the reducers. The more data to shuffle, the less the performance in most cases. This observation is reflected in the performance plot, as shown in Figure 5.8, and it is consistent across the algorithms.

5.8.9 Filtering Pattern

In filtering, we included, Grep, Distinct and Top 100 algorithm. We have observed that our prediction is quite close to the observed values. Table 5.8 shows the results of the experiment and the respective errors of each algorithm. We have also observed that our prediction error is less than 14% from the observed values for both setups. Analysing the shuffle data in this pattern, we have seen that Distinct algorithm shuffles 106MB, the Grep shuffles 11MB but it has a more complicate map function than the rest of the algorithms in this pattern because, for each entry in the file being read, a regex operation is applied. The TopX in our case Top50 algorithm has the lowest amount of shuffled data because we are only shuffling 50 data items. Table 5.8: Although Grep shuffles 11MB but it has a more complicate map function than the rest of the algorithms in this pattern. The mappers spent the majority of the time on applying regex operation to each input split. The average prediction error rate is less than 8% from the measured value

Algorithm Predicted (sec) Actual (sec) |Error%| Grep 470 430 9 Top X 45 40 12 Distinct 128 130 2

5.8.10 Data Organisation Pattern

In data organisation, we have observed that the amount of input data is mostly the same as the amount of output data. Here data is just reorganised and there is no data pruning component. The Question and Answer Hierarchy algorithm merged data from two big data sets. For each Question posted in Stackoverflow, the corresponding answers are collated from the Post file. The results of the three executed experiments are illustrated in Figure 5.10 and Table 5.9. In this pattern we have observed a different trend because the entire input data is shuffled. For 5.8. EXPERIMENT AND EVALUATION 99

Figure 5.9: Filtering Pattern

Figure 5.10: Data Organisation Pattern example, the Question and Answer algorithm’s input data is comes from two data sources which are approximately 16GB and 12GB and the shufﬂe data is approximately 28GB. The same 100 CHAPTER 5. BENCHMARKING AND MODELLING MAPREDUCE PATTERN

Table 5.9: Results for Data Organisation Pattern: Both the 8 node and single node exhibit the same performance characteristics. The Q&A is the most expensive here because it performs a join like operation between two input ﬁles. The average prediction error rate is 10% from the measured value

Algorithm Predicted (sec) Actual (sec) |Error%| Q&A Hierarchy 1653 2024 18 Total Order 162 160 1 Anonymise & Shufﬂing 131 125 8 observation is made for Shufﬂing and Total Order Sorting algorithms.

5.8.11 Join Pattern

Joins are one of the most expensive operations in MapReduce; this is evident in our experimental results. We executed inner, left-outer, right-outer and full-outer joins using the Reduce Side Join Algorithm. We used the Users and Comments dataset curled from Stack Overﬂow. The Users dataset is 2.7GB, and the Comments dataset has a size of 16GB. The results for the different join operations are quite close. The prediction error for both setups is at most 10% of the observed value. The results are shown in Table 5.10 and Figure 5.11.

Figure 5.11: Join Pattern

Just like the Data Organisation pattern, the Join Pattern also shufﬂes the entire input dataset to the reducers which makes its performance a function of the input data. This is consistent across 5.9. RELATED WORK 101

Table 5.10: Experiment Data for Join Pattern: Compared to all other patterns, joins are the most expensive to run on the framework. he prediction error for both setups is at most 10% of the observed value

Algorithm Predicted (sec) Actual (sec) |Error%| Reduce Side Inner 975 1080 10 Reduce Side L-Outer 1410 1285 10 Reduce Side Right Outer 1854 1920 3 Reduce Side Full Outer 900 960 6 all the four algorithms we have tested. However, delving into their implementation, we have seen that the right outer join in more expensive because our right table ( Comments 16GB ) is much more larger than the left table (Users with size 2.7GB), therefore requires more iteration over the keys.

5.9 Related Work

For a summary of the related work, please refer to Table 5.11. Benchmarking and Performance modelling has been an integral part of system design and analysis. It helps to compare systems and identify performance bottlenecks that can be improved. Standard benchmarks such as TPC-DS [100] has been used in the research community to evaluate the performance of decision support systems. With the advent of highly configurable software systems, benchmarking and performance modelling is now more challenging. In [115], for a given configurable system, they combined machine learning and sampling heuristics approaches that derive a performance- influence model, describing all relevant influences of configuration options and their interactions. In [18], they automate the process of modelling and significantly extend it to allow insightful modelling of any combination of application execution parameters. Furthermore, in [73], they focused on modelling parallel scientific applications beginning from the design process throughout the whole software development cycle. However, all these works concentrate on supercomputers and parallel computing frameworks like MPI [61]. Therefore, their findings and conclusions cannot be applied to data-intensive applications. Big data Benchmarking and Performance modelling of data-intensive applications recently got some attention from researchers such as [44, 56, 71, 74, 79, 129] but there is still much room from improvement. Huang et al. [74] and Wang et al. [133] developed benchmark suites for Hadoop, Spark and Streaming Frameworks. These suites consist of a set of workloads organised into related groups. These workload groups span from basic statistics, machine learning, graph processing and SQL. However, this approach makes it inefficient and cumbersome to test new workloads. Also, as stated in Ceesay et al. [21], deploying these tools can be cumbersome for non-technical individuals. Therefore it becomes a challenge for broader adoption. Zhang 102 CHAPTER 5. BENCHMARKING AND MODELLING MAPREDUCE PATTERN

Table 5.11: Comparison of Related Work: The average prediction error used in the literature focuses on the absolute difference between the predicted and observed values.

Related Work Comparison Metric MR [79] [129] [128] [140] [130] [56] [139] Multiple Objectives 7 7 7 7 7 7 7 " Generic Benchmarking 7 7 7 " 7 7 7 " Representative Workloads 7 7 " 7 7 7 " " Average Prediction Error 12% 10% 15% 16% 15% 15% 10-17% 10% et al. [139] presents a MapReduce performance model that measures generic phases of the framework. However, their work focused on an older version of Hadoop (0.20.0) which uses mapper and reducer slots for job processing. The current MapReduce framework uses YARN which efﬁciently manages cluster resources. Given this fact, it is clear that the approach and the model they used cannot be representative of the current MapReduce paradigm. Secondly, our experimental approach also differs, while they pick any big data application, we picked ours grouped by the algorithms design patterns, the results of which show some interesting correlation in terms of their performance. We have observed that applications in the same pattern mostly have similar performance characteristics in terms of their execution time. Finally, Verma et al. [128] proposed ARIA, a job and resource scheduler for the MapReduce framework that aims to allocate the right amount of resource to meet required service level objectives (SLOs). In their work, they extracted information from MapReduce logs as a basis for their framework. Like [139], they used Hadoop 0.20.0, and therefore their approach will not be applicable for Hadoop 2.0 or later versions. Venkataraman et al. [127] built performance models based on a small sample of data and predicting on more massive datasets and cluster sizes. However, they focus on a small subset of machine learning algorithms.

5.9.1 Discussion

Here we revisit the research gaps or challenges we have discussed in Chapter 3 in Section 3.6 and evaluate them with respect to the MapReduce communication pattern. • General Approach: We used a grey-box approach by combining machine learning and the low-level details of the MapReduce framework and applications. The works compared in Table 5.11 all used black-box or analytic approach except [139] and [140]. A thorough discussion of this is covered in the literature review chapter.

a model would be more bias towards a particular domain[66, 67]. In this work, we have used MapReduce design patterns to group algorithms. The main design patterns included are ﬁltering, summarisation, join and data organisation. This covers a wide variety of standard MapReduce algorithms. We have also scaled our experimental environment from a single node setup to an eight-node cluster, and the results are consistent in both environments.

• Benchmarking Approach: For the benchmarking, we adopted a generic approach by running dummy MapReduce workload on the cluster. The main idea behind this is to avoid benchmarking bias when only a particular set of workloads are used.

• Curse of Configuration Parameters: Big data frameworks have many configuration parameters which make it challenging to come up with any comprehensive performance model [132] model. In this chapter, we prune the parameters to a maximum of eight key configurations such as the number of mappers, number of reducers, number of containers and the amount of data that passes through each stage as the key performance drivers. The pruning of these parameters was done with the help of ML. The significance of parameters was assessed to determine if they can be considered as key parameters. We have used Linear and Support Vector regressions in this process. The final model building process involves feeding these key performance drivers to ML algorithms.

5.10 Summary

In this chapter, we have shown our approach of benchmarking and modelling the MapReduce Communication pattern. We used a grey-box approach which achieves a higher prediction accuracy compared to related works, as demonstrated in 5.11. To avoid any bias in the benchmarking process, we have used a novel generic benchmarking approach to proﬁle the Hadoop cluster. In this approach, we study the low-level internals of the framework to segment the MapReduce pipeline into phases. The modelling and benchmarking and then based on the combined results of these phases. In the experiment, we validate our work by using representative workloads grouping them by MapReduce design pattern. On average, the error rate in our experimental setups is 10% from the measured values. ± In the next chapter, we will discuss benchmarking and performance modelling of the Dataﬂow with cycle pattern. We will highlight the benchmarking and modelling approach. We will also present experimental results and compare our results to related works.

CHAPTER SIX BENCHMARKING AND MODELLING DATAFLOW WITH CYCLES PATTERN 6(DFWC)

In the previous chapter, we present benchmarking and performance modelling of MapReduce pattern. In this chapter, we will discuss how we address benchmarking and performance modelling of Dataflow With Cycle (DFWC) pattern. The first goal is to understand the complexity of this pattern and the ubiquity of application and workload configuration parameters to come up with a comprehensive performance modelling framework. To address this challenge, we identified representative workloads and studied the complexity of each of them to design a generic benchmarking framework. The second goal is to use the generic benchmarking framework, with key performance drivers to develop a generalised prediction framework. The third goal is to validate the prediction framework by running empirical experiments in two setups. On average, we have seen that the error rate of our prediction models in both configurations is 14% from ± the measured values. The rest of this chapter is organised as follows: In section 6.1 we give a breif background and in section 6.2 we discuss in details the methodology used in benchmarking and modelling of DFWC applications. In section 6.3, we discuss the identification and selection process of key framework-level configuration parameters. In section 6.4, we cover the computational complexity of each ML algorithm to enable us to identify key application-level configuration parameters. In section 6.5 covers the generic benchmarking design approach used to gather data for performance modelling stage. We discuss the model building and selection process in section

105 106 CHAPTER 6. BENCHMARKING AND MODELLING DATAFLOW WITH CYCLES PATTERN (DFWC)

6.6. In section 6.7, we cover the experimental design. In section 6.8, we explained informed decisions that can be taken as a result of the modelling outcome. We wrap up the chapter with general discussions on our observations in section 6.10 and conclude in section 6.11.

6.1 Background

MapReduce communication pattern discussed in the previous chapter is one of the first players in big data processing domain. MR is best for a simple two-stage style of computation, and it is simple and easy to use. However, it is not expressive and suffers in-terms of solving multi- stage and iterative computation problems. This limitation has led to the development and the rise of the DFWC pattern, which is an excellent fit for multi-stage and iterative applications such as machine learning applications. These applications are ubiquitous in all domains of the modern-day technology and business environments. Therefore, understanding and predicting the performance of such applications running in the cloud or on-premises could help minimise the overall cost of operations and provide opportunities in identifying performance bottlenecks. However, the complexity of the low-level internals and the ubiquity of configuration parameters of big data frameworks like Apache Spark [138] makes it extremely challenging and expensive to come up with a comprehensive and effective performance modelling solutions. For example, as of writing this thesis, Apache Spark has over 200 plus configuration parameters. The underlying hardware and application-specific configurations further compound this challenge. An essential part of gaining performance improvement is to tune these configuration parameters to find the right set of settings to run an application on a computing cluster efficiently. Non-optimal configuration parameters could have a significant performance implication on an applications life cycle, and manually tuning them is often a costly process of trial and error.

6.2 Methodology

First, let us define the main objective of this work before diving into the methodology. Approximating or knowing beforehand how long a distributed big data application will take to run on a big data framework can be very valuable. It can be useful in workload execution planning in the cloud which can reduce the overall cost of operations. However, the solution is not a trivial one because big data frameworks and application have lots of configuration parameters that can affect the performance of applications. For example, consider running a simple WordCount problem using Spark, some of the configuration parameters to consider are the size of the input dataset, amount of memory to assign to executors, number of spark executors 6.2. METHODOLOGY 107 and number of executor cores. It will, therefore, be interesting to know how the performance scales as these parameters change. To address these challenges, we present a four-component research methodology in this section, as shown in Figure 6.1. Each of the four components of the methodology is discussed in the following subsections.

Figure 6.1: A four-component methodology used in the benchmarking and performance modelling framework. First select key conﬁguration parameters out of the 200+ conﬁgurations. Perform benchmarking to collect information about those parameters. Use benchmarking results for modelling, then use the models for decision making.

6.2.1 Key Conﬁg Selection

The first component of the framework is to devise a way of selecting key performance drivers out of the vast space of configuration parameters. We called them key configuration parameters because we found out that they are the most significant parameters with respect to performance. These parameters are divided into two groups, and they are the framework and application-level parameters. The goal is to use these parameters in the benchmarking stage to generate the data for the prediction stage. Computing frameworks such as Apache Spark provides tons of configuration parameters that can be used to improve the performance of big data applications. However, incorrectly setting these configuration parameters can lead to performance problems. Application configuration parameters, on the other hand, are parameters that are relevant to an application. For example, the number of input rows and columns in a dataset is crucial in rule or ensemble-based machine learning algorithms. So we propose that a holistic approach of including both framework and application-level configuration parameters would yield the best results in terms of performance modelling. Details of how key configuration parameters are selected for each of the two groups are discussed in Section 6.3 and 6.4.

6.2.2 Benchmarking

The benchmarking component entails the process of stress testing the Apache Spark framework with a different application using different parameter settings to collect performance metrics of selected key conﬁguration parameters. To avoid a costly but holistic benchmarking process, we designed a generic benchmarking approach focusing on applications that generally ﬁts the 108 CHAPTER 6. BENCHMARKING AND MODELLING DATAFLOW WITH CYCLES PATTERN (DFWC)

DFWC pattern. For each application and each of the key conﬁguration parameters, we vary parameter values by a factor before each execution cycle. This process generates a lot of training data which is later used in the next stage of the methodology.

6.2.3 Modelling & Decision Making

Since we are using a grey-box approach, we used the information gathered from the previous two components and use machine learning to model the performance of applications. One of our main interest is to predict the execution time of an application given certain conﬁguration settings. This is a regression problem, and we, therefore, investigate the best least-square approach to use for optimal results.

6.3 Selection of Key Framework Level Conﬁguration Parameters

The main goal of this section is to address finding key framework parameters for best performance. In our case, these parameters are used when executing spark applications on the cluster. Spark group these configuration parameters into sixteen (16) categories based on their functionality1. We are interested in the parameters that mainly affects performance during execution. These parameters are found in the following categories Application, Execution Behaviour, Shuffle Behaviour, Compression and Serialisation. Examples are CPU and memory allocated to spark executors. As discussed in Section 2.6, big data frameworks like Apache Spark and Apache Hadoop are highly configurable; therefore, it is crucial to understand the effect that a configuration parameter might have on the overall performance of a running application. This is, however, a very challenging problem because of the vast space of these configuration parameters. 1https://spark.apache.org/docs/latest/configuration.html 6.3. SELECTION OF KEY FRAMEWORK LEVEL CONFIGURATION PARAMETERS 109

Figure 6.2: The frequency distribution plot of compression types: (LZF: Lempel-Ziv-Free [28], ZSTD: ZStandard [35], LZ4 [34] and Snappy [57]) Frequency shows the number of observation (execution time) in each bin. On the x-axis, lesser values are better. We can see that there is no signiﬁcant performance improvement of using one compression method over the other because the average execution time is approx 49 sec for each of them. Therefore using any of the default compression methods not present any major performance bottleneck

To address the problems caused by the ubiquity of these configurations, we followed a systematic approach to identify key configuration parameters. We first conducted extensive research to select candidate configuration parameters by studying academic works aimed at Spark performance tuning [27, 59, 105, 131]. We have also looked at official Apache Spark documentation. Secondly, we used empirical methods to determine if a configuration parameter is a key performance driver. In this approach, we executed mini benchmarks to collect data about each candidate configuration parameters using sample applications such as WordCount and SV M. We observed the effect of each of these parameters by analysing the density plots of the observations. For example, from Figure 6.2, we consider the four main compression algorithm used by the Apache Spark framework to compress data during data processing. We can see that the impact of using one form of compression algorithm over the other does not result in any significant performance improvement because they have almost the same average execution time of 49s. LZ4 [34] and LZF [28] are similar types of compression algorithms that focus on lossless data compression and decompression speed, LZF goes for low memory usage during compression. ZSTD [35] developed by Facebook is a fast compression algorithm, providing high compression ratios. It also offers a special mode for small data, called dictionary compression. Snappy developed by Google is a fast compression and decompression library based on ideas of the LZ 110 CHAPTER 6. BENCHMARKING AND MODELLING DATAFLOW WITH CYCLES PATTERN (DFWC) family. The algorithm aims for very high speed and does not aim for maximum compression or compatibility with other algorithms.

Figure 6.3: Scaling Executor Memory: The frequency distribution plot shows how scaling executor memory affects performance. Frequency shows the number of observation (execution time) in each bin. On the x-axis, lesser values are better. We can see that there is a signiﬁcant performance improvement as we increase the size of memory because the average performance improves from 53s in 2GB plot to 46sec in 8GB plot

However, in Figure 6.3, comparing the bin sizes shows an apparent variation of performance as we increase the executor memory. Example, from the 2GB and 4GB plots, majority of the timings are distributed between the first and the second bins (0s to 100s). Whiles the majority of the execution time for the 6GB and 8GB plots are in the first bin only (0s - 50s). Numerically, the average execution time improved from 53s to 46s, a 7s improvement in performance. We, therefore, include such parameters as key configuration parameters. However, it is not accurate to conclude that more memory would always achieve the best performance in all cases because further experiments in Section 6.7 has shown that increasing memory does not always result into performance improvement. Table 6.1 shows the list of candidate configuration parameters we have included in this study.

6.4 Selection of Key ML Application Conﬁguration Parameters

Application configuration parameters include properties such as the size of the data being processed, the number of observations, the number of features and number of iteration. To derive Table 6.1: Candidate Configuration Parameters: We have used these parameters in the benchmarking process. The range column describes different configuration values we have used during the benchmarking. After some analysis some of these configuration parameters are dropped because of their insignificance towards performance.

Parameter Name Description Default Value Range Group The amount of memory to use for spark.driver.memory 1G 1G - 8G Application the driver process spark.executor.memory Total memory assigned to each executor 1G 1G - 8G Application spark.executor.cores Total number of cores to use on each executor 1 1-8 Execution Behaviour spark.executor.instance The total number of executors to run - 1 - 32 Execution Behaviour In YARN the Number of partitions in RDDs default is 2 spark.default.parallelism returned by transformations like join, 1 - 32 Execution Behaviour or number of cores reduceByKey, and parallelize. in all nodes Configures the number of partitions that spark.sql.shuffle.partitions are used when shuffling data 1 1 - 32 Execution Behaviour for joins or aggregations. The codec to use to compress internal spark.io.compression.codec data such as event logs, lz4 lz4, lzf, snappy, zstd Compression and Serialisation broadcast variables and RDD partitions Controls whether to compress broadcast spark.broadcast.compress true true/false Compression and Serialisation variables before distributing them Controls whether to compress map spark.shuffle.compress true true/false Shuffle Behaviour output data Controls whether to the spilled spark.shuffle.spill.compress true true/false Shuffle Behaviour data during shuffles tiny,small,large, Datasize/scale profile Size of data to process tiny ML Application huge,gigantic,bigdata Number of observations in n NA Depends on datasize ML Application the dataset to process Number of features in m NA Depends on datasize ML Application the dataset to process 112 CHAPTER 6. BENCHMARKING AND MODELLING DATAFLOW WITH CYCLES PATTERN (DFWC) these configuration parameters, we analyse the computational complexity of each of the machine learning algorithms of interest.

6.4.1 Linear Regression

Given a set of data points y ,x ,...,x , where i represents the number of observations and p { i i1 ip} represents the number of features. The goal of least square linear regression is to ﬁnd the best regression line of the form yi = b1xi + b0 + ei which minimises the sum of the squared error. That is, the errors obtained when the function f (x)=b1x + b0 is used to estimate the true values of y where b1 is the parameter representing the slope of the line and b0 is the constant parameter. If there is a large set of predictors, then under the hood linear regression transforms the data set as a system of matrix equation of the form,

Y = XA+ E (6.1)

In equation 6.1, Y is a matrix of all the true values of yi, X is a matrix of all the value of xi, A is the unknown matrix of b estimate values and E are the error terms. The least-square estimates T 1 T b is obtained by solving the normal equation A =(X X) X Y [143] with time complexity of O(mn2 + n3) [32]. Therefore given a matrix with n training examples and m predictors, asymptotically, the computational complexity of solving for the normal equation A is O(mn2 + n3) [32]. We will, therefore, collect information about these two performance drivers in the benchmarking process.

6.4.2 Logistic Regression

Logistic regression is a classiﬁcation algorithm that assigns an observation of a discrete set of classes. The goal of logistic regression is to solve an optimisation problem [108]. Apache Spark’s Machine learning library (MLLIB) supports many optimisation algorithms like Stochastic Gradient Descent SGD and L BFGS [96], however in MLLIB SGD is more mature. Apache Spark uses the sigmoid activation function [96] deﬁned below.

1 S(z)= z (6.2) 1 + e Here z is the most important parameter which represents the prediction function of the same form as in section 6.4.1 solved in O(mn2 + n3). Therefore we can conclude that the complexity of Logistic Regression is also O(mn2 + n3) 6.4. SELECTION OF KEY ML APPLICATION CONFIGURATION PARAMETERS 113

6.4.3 SVM

Support Vector Machines [120, 142] are used for both classiﬁcation and regression problem. Several kernel functions like svmLinear, svmRadial, etc. can be passed to an SVM algorithm. The goal of SVM is to ﬁnd an optimal hyperplane which maximises the margin of separation of the training data. The complexity of SVM is O(n2 p + n3)

6.4.4 K-Means

K-means [69] is a clustering algorithm that partitions data into k-groups-given a set n observations and m predictors. The algorithm performs t iterations performing the following fundamental steps: 1. Identify k predeﬁned clusters

2. For each data point, measure the distance from the initial centres and assign to the closest cluster.

3. Update the cluster centres by taking the mean of the assigned points. Update t = t + 1

4. Repeat steps 3 and 4 until no changes in the assignment of data points to clusters Mathematically, the goal is to minimise intra-cluster distance J using Equation 6.3.

k n ( j) 2 J = Â Â xi c j (6.3) j=1 i=1

Considering the above relation in the equation, we can conclude that standard k-means has a complexity of O(m n k t) ⇤ ⇤ ⇤

6.4.5 Decision Tree

In general, tree-based algorithms uses a divide and conquer approach to narrow the decision boundary. Given n number of training observations and m features, the decision tree algorithm [50] would calculate a quality function based on each split of the data for each feature in each non-leaf node. Therefore the time complexity of a decision tree would be O(n log(n) m) ⇤ ⇤ 114 CHAPTER 6. BENCHMARKING AND MODELLING DATAFLOW WITH CYCLES PATTERN (DFWC)

6.4.6 Random Forest

In Random Forest [88, 89], the algorithm uses an ensemble approach to make a ﬁnal decision from multiple trees. Thus, the overall computational complexity of the entire process with ntree tree and mtry candidate variables would, therefore, be O(ntree mtry nlog(n)). However, if ⇤ ⇤ depth d of the tree is provided then the complexity would be O(ntree mtry d) ⇤ ⇤ Given a dataset D with n observations and m variables, the Random Forest algorithm uses the following key steps to generate ntreet trees: 1. A sample of n observation is taken at random with replacement from the main dataset D

2. A random subset of variables referred to as mtry, are selected from the total number of variables m. This value is kept constant.

3. Each tree is grown to the largest extent, the complexity of which would be O(mtry nlog(n)) ⇤ 4. Perform prediction by aggregating results from the trees. A vote is used in classiﬁcation while average in regression.

6.4.7 Naive Bayes

Naive Bayes is a simple yet powerful conditional probabilistic machine learning algorithm used for classiﬁcation tasks. Given n observations and m features and c classes, the Naive Bayes algorithm performs the following major steps:

1. For each class c of each observation ni

2. Compute the frequency of each feature value m j This will lead to the computational complexity of O(n m) ⇤

The common denominator in all these algorithms is that the performance is mainly dominated by the number of observation n and the number of features m. We, therefore, include these two parameters in our benchmarking process.

6.5 Benchmarking Design

Benchmarking is the process of stress testing a system to gain performance insights. This process can be costly and time consuming if not properly designed. To overcome this challenge, we carefully designed our benchmark approach by focusing on parameters that matter most in our investigation. We used HiBench [74], which enables the execution of various machine learning workloads using Apache Spark’s machine learning library. However, by default, HiBench has 6.5. BENCHMARKING DESIGN 115 some limitations because it does not report metrics about configuration parameters used when running jobs. We had to add custom implementation to enable this functionality. The details of these implementations are discussed in section 6.5.1 and section 6.5.2. Furthermore, to avoid workload bias in the benchmarking process, we include a combination of classification, regression and clustering algorithms such as linear regression, logistic regression, k-means, SVM, naive Bayes and random forest applications. Workload parameters such as spark.executor.memory, spark.executor.cores, etc. are configured during the execution of the benchmarks. The design of our benchmark is parameterised as follows.

A A BMJ =[D,P1,..,n]J

For each benchmark denoted as BM, D is the data to process, A could be any application like k means and J is an instance run of algorithm A with conﬁguration settings. P denotes the list of application and framework parameters such as spark.executor.memory = 8GB, spark.executor.core = 4 for job instance J of application A. For each of these benchmarks, we assign values according to the range provided in Table 6.1. For example, we use 2GB, 4GB, 6GB and 8GB as values for both driver and executor memory parameters. HiBench is mainly implemented using Bash, and some aspects are implemented using Python. The bash scripts are used to automate and interact with the underlying big data framework like Apache Spark. The customisation we have in mainly in Bash.

6.5.1 Customising HiBench

Recall that the default HiBench implementation reports only execution time and throughput metrics and does not report any of the key configuration parameters we identified. To solve this challenge, we have to add our implementation to include all the configuration parameters listed in Table 6.1. To achieve this, we first have to define the new parameters into a Python mapping dictionary, as shown in Listing 6.1. These mappings are then loaded into a configuration file and assigned dynamic values at runtime. 116 CHAPTER 6. BENCHMARKING AND MODELLING DATAFLOW WITH CYCLES PATTERN (DFWC)

1 HiBenchEnvPropMapping=dict( 2 SPARK_HOME="hibench.spark.home", 3 SPARK_MASTER="hibench.spark.master", 4 SPARK_EXAMPLES_JAR="hibench.spark.examples.jar", 5 ...... 6 #New parameters we have added to HiBench for reporting 7 MAP_PARALLELISM="spark.default.parallelism", 8 REDUCE_PARALLELISM="spark.sql.shuffle.partitions", 9 SSC="spark.shuffle.compress", 10 SSSC="spark.shuffle.spill.compress", 11 SICC="spark.io.compression.codec", 12 SBC="spark.broadcast.compress", 13 )

Listing 6.1: Mapping From Properties to Environment Variable Names The second part of customising HiBench is to add the new conﬁguration parameters by extending the report generation function. This implementation is available in the Appendix in Listing B.1.

6.5.2 Dynamic Conﬁguration Files

One of the main challenges we have faced in the design and implementation of a generic benchmark is to provide values for configuration parameters at runtime. For each job, the modification should auto imputes new parameter values into HiBench property files and persist metrics to an output file. The vanilla HiBench implementation does not provide any mechanism to do this and manually doing this for the 2000+ executions for each application is just not feasible. To address this problem, for each job, we have to impute configuration values at runtime dynamically.

6.5.3 Benchmark Runner Implementation

After modifying HiBench to report new configuration parameters, the next step is to run or execute the selected machine learning applications on a cluster to profile its performance. Due to the vast options of each configuration parameter, the applications are executed more than 2000 times. We used the benchmark runner in Listing 6.2 to automate this process. 6.5. BENCHMARKING DESIGN 117

1 #!/bin/bash 2 scaleProfile=(tiny small large huge gigantic bigdata) 3 for i in "${scaleProfile[@]}" 4 do 5 #Set initial configuration settings for each scaling profile. 6 sed -i.bak s/"hb.scale.profile.*"/"hb.scale.profile $i"/g hb.conf 7 8 #Generate the dataset, this can be tiny to bigdata 9 bin/workloads/ml/kmeans/prepare/prepare.sh 10 numExecutor=(4 8 16 24 32) 11 for j in "${numExecutor[@]}" 12 do 13 sed -i.bak s/"hb.yarn.exe.num.*"/"hb.yarn.exe.num $j"/g spark.conf 14 executorMem=(2 4 6 8) 15 for k in "${executorMem[@]}" 16 do 17 sed -i.bak s/"spark.exe.me.*"/"spark.exe.memory $k"g/g spark. conf 18 #Do the same for the rest of the configuration parameters 19 ... 20 #Run the workload 21 runworkload.sh 22 ... 23 done 24 done 25 done

Listing 6.2: Benchmark Runner: A fragment of the implementation of Benchmark Runne.

On line 2, we define the scale profile, which represents the amount of data to generate for each profiling step–this range from a tiny data set of few megabytes to a massive dataset of hundreds of gigabytes. On line 9, for each ML algorithm, we generate the synthetic dataset for processing. We then scale by a factor and impute each of the candidate parameters listed in Table 6.1 into the respective configuration files. For example, we scale the number of executor from 2GB to 8GB and scale the number of executors from 4 to 32. We follow the same procedure for the rest of the parameters. For each application, we collect about 2000+ observation for each parameter. The goal is to collect enough data for the performance modelling process.

6.5.4 Discussion and Observations

In this section, we will discuss what we have observed for each of the key conﬁguration parameters. These observations are supported by various distribution and box plots. In all the plots, the vertical dotted red line represents the mean value of the distribution. The Frequency 118 CHAPTER 6. BENCHMARKING AND MODELLING DATAFLOW WITH CYCLES PATTERN (DFWC) shows the number of observation (execution time) in each bin. On the x-axis, lesser values represent better performance.

6.5.5 spark.executor.instance

This parameter controls how many spark executor processes to start. For the SVM algorithm, the size of the synthetic data set we generate is between 200MB to 12GB. From the distribution plots in Figure 6.4, we can see that on average we get the best performance when the number of executors is set to 4 and 8. However, as shown in Figure 6.5, we have observed that given small datasets like 191MB, setting the number of executors to a large number does not translate into performance improvement. This because setting the number of executors and their memory setting play a major role in a spark job. Assigning too many executors with too much memory often results in excessive garbage collection delays [104].

Figure 6.4: Frequency Distribution plot showing the effect of scaling Number of Executors for K-means algorithm. Frequency shows the number of observation (execution time) in each bin. The evaluation shows that adding more resources such as the number of executors, executor CPU or executor memory may not always translate to increase in performance. This also is evident in plots 191MB vs 12GB in Figure 6.5 . 6.5. BENCHMARKING DESIGN 119

Figure 6.5: Box plot showing the effect of scaling Number of Executors. In these plots, the lower the time on the y-axis, the better the performance. We can see that unnecessarily increasing the number of executors does not translate into performance improvement. For example, processing 191MB of data using 32 executors add additional overhead for the framework to start, manage and stop the launched executors and hence the performance overhead.

6.5.6 spark.executor.memory

This parameter controls how much memory is assigned to each executor process. Figure 6.6 shows the various distribution and the mean value of the performance as we scale memory from 2GB to 8GB for SVM algorithm. In each case, there is a trend of performance improvement. E.g. the 2GB ﬁgure, the average execution time is around 120s, while for the 8GB, the average execution time is 90s. This makes sense since Spark is primarily an in-memory computational framework.

6.5.7 spark.executor.cores

This parameter controls the number of CPU cores to assign to an executor which affects the number of concurrent tasks that an executor can run. From Figure 6.7, we observed performance improvement from 2 - 6 cores but almost no improvement because we have observed that setting the number of executor cores high leads to bad HDFS I/O throughput. Furthermore, this observation is in line with Amdahl’s law because most application are not entirely parallel so the serial parts will take the same time now matter how many CPUs you throw at them. Finally, it has been widely reported that the optimal number of executor core for each executor should be set to ﬁve [104] when running spark application. 120 CHAPTER 6. BENCHMARKING AND MODELLING DATAFLOW WITH CYCLES PATTERN (DFWC)

Figure 6.6: Scaling Executors Memory for SVM algorithm. We can observe an increase in performance as we increase the memory. This makes sense because SVMs are memory intensive.

Figure 6.7: Scaling Executors Cores for Random Forest: Performance improved from 2- 6 cores. However, at 8 cores, the performance starts to degrade due to bad HDFS I/O throughput. 6.6. MODEL BUILDING 121

6.5.8 spark.de f ault.parallelism

This parameter controls the number of partitions of an RDD which controls the number of tasks that a stage can have. Although Apache Spark is a parallel processing engine but getting optimal parallelism is not automatic. Since the number of partition is proportional to the number of task in a stage, then an RDD with few partitions will spawn a few tasks which may result in cluster resources being underutilised. If the partitions are large, then this may create memory bottleneck. In such cases, data will be persisted to disk which will generate I/O bottleneck. We have observed that setting the parallelism to at-least the number of available cores yields the best performance. Setting the value too high creates an unnecessary overhead for the framework, thereby decreasing performance.

6.6 Model Building

At this point, we have identiﬁed the key conﬁguration parameters for all the six ML application we are modelling. We have also generated datasets that we can use to build models. We will explain the various steps we have taken to build accurate performance models for these applications. Since the problem of predicting execution time is a regression problem, we use linear regression, random forest, decision trees and support vector machines to model the application. We compare their performance and select the best performing model using the caret package in R [84] in R. We have used the caret package because of its simplicity in streamline model building and evaluation process. Example, we can specify how we intend to train a model by specifying parameters such as type of resampling, e.g. k f old cross-validation (repeated or once), number of iterations and tuning methods. A general approach of a model building using Support Vector Regression is shown in Appendix A.5. We follow the same procedure for other regression algorithms; more details about this is discussed in subsection 6.6.2. The full reproducible implementation is also available at [113].

6.6.1 Training and Tuning The Models

In this section, we will discuss in details the process we followed to train and tune the model parameters. For each of the DFWC or machine learning applications discussed in section 6.4, we divide the data collected in the benchmarking phase into training and testing set at a ratio of 80:20. Then we follow the ﬂowchart in Figure 6.8 to accomplish the entire process. First, we pick a regression algorithm. In each training setup, we use 10-fold cross-validation and three repeats. The ﬁrst training uses the default values without tunning any of the tunable parameters, as shown in Listing 6.3. 122 CHAPTER 6. BENCHMARKING AND MODELLING DATAFLOW WITH CYCLES PATTERN (DFWC)

Figure 6.8: Modelling Flowchart: Details the steps we have taken train and select the best model.

1 control <- trainControl(method="repeatedcv",number=10,repeats=3) 2 rfm <- train(Time ~ ., data=trn[c(-6)], method="rf",trControl=control) 3 svr <- train(Time ~ ., data=trn[c(-6)], method="svmRadial",trControl= control) 4 lmm <- train(Time ~ ., data=trn[c(-6)], method="lm",trControl=control) 5 rpm <- train(Time ~ ., data=trn[c(-6)], method="rpart2",trControl=control )

Listing 6.3: Default Model with without any tuning This default setting is crucial as it allows us to evaluate the default behaviour of the models. Models by default are not tuned, so identifying how tunable parameters for best accuracy is always crucial. The full list of tunable parameters available in caret package for each regression algorithm we use in this work is listed in Table 6.2. In the next step, we perform hyperparameter optimisation; however, not all the machine learning algorithms in the caret package are tunable. We therefore rely on the guidelines deﬁned in [84] for tuning parameters which recommends using Random and Grid [13] search for auto- tuning the parameters. The random search is used to avoid biases in cases where we are unsure of what the best range of tunable parameters is. Listing 6.4 shows how we used random search to tune the parameters for the four regression models. 6.6. MODEL BUILDING 123

Table 6.2: Tunable Parameter: Caret Package in R. These parameters were used in the model tuning process

Algorithm Parameter Description Number of predictors Random Forest mtry randomly selected as candidates at each split Sigma : Controls the accuracy of the model. High value is better but SVM may cause over-fitting Cost defines how much influence a Cost : single training example has CP, Complexity Parameter, Decision Tree CP controls size of the tree to select optimal tree size.

1 ctrl <- trainControl(method="repeatedcv",number=10,repeats=3,search=" random") 2 #mtry <- sqrt(ncol(training_set))# Automatically handled by random search 3 rfmr <- train(Time ~ . data=trn[c(-6)], method="rf",trControl=ctrl) 4 svmr <- train(Time ~ . data=trn[c(-6)], method="svmRadial",trControl= ctrl) 5 lrmr <- train(Time ~ . data=trn[c(-6)], method="lm",trControl=ctrl) 6 rpmr <- train(Time ~ . data=trn[c(-6)], method="rpart2",trControl=ctrl)

Listing 6.4: Training and Tunning use Random Search In Listing 6.5, we used a grid search method by manually specifying the values of the parameters to try on our tuning problem. The process will try all combinations and locate the one combination that gives the best results.

1 ctrl <- trainControl(method="repeatedcv",number=10,repeats=3,search = " grid") 2 grid_radial <- expand.grid(sigma = c(0, 0.0001, 0.001, 0.01, 0.02, 0.025), C = c(1, 100, 500, 1000, 1500, 2000, 3000, 4000, 5000 )) 3 rfmr <- train(Time ~ . data=trn[c(-6)], method="rf",trControl=ctrl) 4 svmr <- train(Time ~ . data=trn[c(-6)], method="svmRadial",trControl= ctrl) 5 lrmr <- train(Time ~ . data=trn[c(-6)], method="lm",trControl=ctrl) 6 rpmr <- train(Time ~ . data=trn[c(-6)], method="rpart2",trControl=ctrl)

Listing 6.5: Training and Tunning Model Parameters using Grid Method 124 CHAPTER 6. BENCHMARKING AND MODELLING DATAFLOW WITH CYCLES PATTERN (DFWC)

Name Description Setup Parameters Support Vector Regression. method="repeatedcv" Here we used the GRID number=10 *.GRID search method to tune repeats=3 SVM the parameters search="grid" Random Forest sigma = c(0, 0.0001, 0.001, 0.01, Linear Model A grid space is provided for the 0.02, 0.025), Trees search step to search through. cost = c(1, 100, 500, 1000, 1500, This is done by supplying sigma 2000, 3000, 4000, 5000 )) and cost parameters. *.RANDOM Support Vector Regression. method="repeatedcv" SVM Here we used the RANDOM number=10 Random Forest search method to tune the repeats=3 Linear Model parameters search="random" Trees

Table 6.3: Full description and parameters used in the model generation process.

6.6.2 Model Selection

In this section, we cover the process of comparing and selecting the best regression approach for predicting the performance of the ML applications.

Figure 6.9: Model evaluation plot for RMSE. RMSE is a measure of the standard deviation of the predicted errors and the lower the value with respect to the measured variable the better the performance. From this plot, we can therefore see that the algorithm with the best value is SVM

To select the ﬁnal set of conﬁguration parameters from the candidate list in Table 6.2, we analyse the p-value of the candidate parameters. We have realised that in most cases, executor memory, executor core, executor instance, driver memory, parallelism, number of observation and number of features are the parameters that affect the performance most. 6.7. EXPERIMENT AND EVALUATION 125

We refer to these parameters as the key con figuration parameters. As for the model selection, we study the effect of p-value, Root Mean Squared Error RMSE [9] and R squared (R2) [97]. RMSE measures the standard deviation of the residuals or prediction errors whiles R2 measure how well regression predictions approximate the real data points. An R2 of 1 indicates that the regression predictions perfectly ﬁt the data. In Figure 6.9, we can see that SVM has the best performance in terms of RMSE value whiles in Figure 6.10 Random Forest algorithms has the best R2 value.

Figure 6.10: Model evaluation plot using R2. R2 measures how well a regression predictions approximate the real data points, the closer the value is to 1 the better the performance. From this plot, we can see that the algorithm with the best R2 value is Random Forest

6.7 Experiment and Evaluation

The main goal of this experiment is to test the accuracy of our methodology on the applications discussed in section 6.4 by using only the key configuration parameters. We argue that using these key configuration parameters; we can accurately model and predict the execution time of applications in dataflow with cycle applications pattern.

6.7.1 Setup

Cluster Conﬁguration: Our setup includes an 8-node in-house cluster running Apache Spark version 2.4.0 and Hadoop version 3. Each of these nodes has 20GB of RAM, eight vCPUs and a 500GB of storage space. YARN is allocated a total of 19GB memory on each node, and a maximum of 6 containers can run at a time in each worker node. We have reserved the rest of the resources for the framework daemons. In all the eight nodes we have installed Ubuntu 18.04.1 LTS, and we set HDFS replication is set to three. 126 CHAPTER 6. BENCHMARKING AND MODELLING DATAFLOW WITH CYCLES PATTERN (DFWC)

Applications and Workloads: We have used seven popular machine learning applications for the prediction task. We used SVM, Linear Regression, Logistic Regression, K-means, Random Forest and Naive Bayes. We used the reserved test data to test the accuracy of the models on unseen data. We customised HiBench’s default implementation to run selected ML based on Apache MLlib. For each of the ML algorithms discussed, we collect data relating to the key parameters and follow the model building ﬂowchart in Figure 6.8. For each proﬁling step, we scale the data size from 300MB to 12GB, which directly affects the n and m parameter listed in Table 6.1.

6.7.2 Evaluation of Results

Here we analyse and discuss the accuracy of predicting each of the ML applications. For each of the applications, we perform multiple executions of the workload varying each of the ﬁnal conﬁguration parameters noting the execution time. We then used the developed model to predict and compare our result with the measured value. In all the plots, both the x-axis and y-axis are measured in seconds. The goal is to get the points clustered close to the blue line. The y-axis represents the time it takes for that observation to run on our cluster. On average, the prediction error of each of the algorithms is at most 14% from the observed value. Below we discuss the ± results of each of the six algorithms covered.

6.7.3 Support Vector Machines and Naive Bayes

As shown in Figure 6.11 and Figure 6.12, the model performs well in predicting SVM and Naive Bayes applications included in HiBench [74]. Analysis of the results shows that on average, the

Figure 6.11: The ﬁgure shows the prediction results of HiBench[74] SVM workload with an average prediction error rate of 16% ± overall prediction error rate is 16% for SVM and 12% for Naive Bayes. ± ± 6.7. EXPERIMENT AND EVALUATION 127

Figure 6.12: Prediction Results for Naive Bayes workload developed by HiBench with an average prediction error rate of 12% ± However, as a result of the range of the values, we have covered, the prediction accuracy degrades when predicting extreme value. This can be solved by having more training data which are close to these extreme values.

6.7.4 Linear and Logistic Regression

Figures 6.13 and 6.14 shows the results of Linear and Logistic Regression applications using the HiBench benchmarking suite.

Figure 6.13: Prediction results of a sample Linear Regression workload using the HiBench benchmarking suite. The average prediction error rate of our model is 12% ± From the plots, we observe that the performance of our prediction models is good in most cases with an average prediction error of 12% for the linear regression workload and 14% ± ± for the logistic regression workload. Both algorithms have similar performance, which may be due to their internal implementation. 128 CHAPTER 6. BENCHMARKING AND MODELLING DATAFLOW WITH CYCLES PATTERN (DFWC)

Figure 6.14: Prediction results of a sample Logistic Regression workload using the HiBench benchmarking suite. The average prediction error rate of our model is 14% ± 6.7.5 Random Forest and K-means

As shown in Figures 6.15 and 6.16, both algorithms have similar performance behaviour with high values of execution time. The prediction error for both algorithms is 16%. The k-means ±

Figure 6.15: Prediction results of a sample k-means workload using the HiBench benchmarking suite. The average prediction error rate of our model is 16% ± workload tests the K-means clustering using spark.mllib implementation. The input data set is a synthetic data generated by GenKMeansDataset based on Uniform Distribution and Gaussian Distribution [75]. For K means we set K to 10 following the ofﬁcial guide on HiBench and scale from 5 to 10. This workload is implemented in spark.mllib and the input data set is generated by Random- ForestDataGenerator [75]. The features are scaled from 100 to 10K whiles the observation were scaled from 10K to 100K 6.7. EXPERIMENT AND EVALUATION 129

Figure 6.16: Prediction results of a sample Random Forest workload using the HiBench benchmarking suite. The average prediction error rate of our model is 16% ± 6.7.6 Evaluation of Model Performance

Models are not perfect because their accuracy depends on different factors, such as the amount of training data. Generally, the more the training data, the better the results would be. In this section, we evaluate the accuracy of each of the models based on key model parameters such as the amount of data and executor memory. Figures 6.17 and 6.18 shows the performance of scaling both memory and data size from 1GB to 100GB whiles, keeping all other key conﬁguration parameters to their optimal values. As we can see from Figure 6.17, all the models except Linear Regression converge at some point. That is that model performance does not improve at that point. 130 CHAPTER 6. BENCHMARKING AND MODELLING DATAFLOW WITH CYCLES PATTERN (DFWC)

Figure 6.17: Scaling application’s input data size from 1GB to 100GB. In all plots, we observe that ﬂat trend at a speciﬁc point. This is when the performance of our models starts to degrade. This could be improved by improvising more training data. The goal here is to observe the limitation of our models given our setup.

The model predicts the same value, no matter how large the size of the data is. For example, for SVM, K-means and Logistic Regression, the models converge around 75GB of data. More training data will solve this problem. Given the fact that the max value for the data size feature is 12GB, we can infer that the general performance of the models is acceptable. From Figure 6.18, in all the plots, we observe no performance improvement from 35GB of memory. Normally, it would be expected that the more the memory, the better the performance. However, we can see that performance improves up to 25GB but decreases around 50GB and then converges to the same value. This makes sense because running executors with too much memory often result in excessive garbage collection delays. Similar observations were made for the rest of the key conﬁguration parameters. 6.8. DECISION MAKING 131

Figure 6.18: Scaling Executor Memory from 1GB to 100GB. In all the plots, the flat-line shows the start of models performing badly. To explain the fluctuation in each plot, from the onset the performance improves as we increase the amount of memory up to a certain point then the performance degrades because assigning more unnecessary memory to the executors would result in unnecessary garbage collection. Our goal is to find the optimal memory setting for each setup.

6.8 Decision Making

The ﬁnal component of the methodology discussed in section 6.2 is to make decisions. Therefore, the models we build and validate in the previous sections would be used to make informed decisions. We can answer two broad questions: • How long it will take to take to run a particular application

• What the best configuration is for a particular application addressing both research questions. To answer these questions, we developed a REST API to act as an interface to interact with the models. For the first question, a call to REST API using curl or any other REST querying utility would return the time it takes to run a particular application with specific configurations. An example of doing this using curl on a Linux terminal is shown in Listing 6.6. The curl command sends a configuration specified by the d switch to the rest endpoint at https://sc306.host.cs.st- andrews.ac.uk/dfwc/. 1. Case Study 1: Predicting Execution Time How long it takes to run a particular application given certain configuration settings. This is demonstrated in Listing 4.1. The API call returns the predicted execution time in seconds. 132 CHAPTER 6. BENCHMARKING AND MODELLING DATAFLOW WITH CYCLES PATTERN (DFWC)

1 curl 2 -X POST 3 -d ’{"DataSizeGB":5, "NumEx":16, "ExCore":4, "ExMem":8, "LevelPar":4, "App":"SVM"}’ 4 -H ’Content-Type: application/json’ https://sc306.host.cs.st-andrews. ac.uk/dfwc/ 5 #Returns the time it takes to run 6 {"rf":[162.8945],"svm":[412.2975]}

Listing 6.6: Using curl calls to interact with the REST API to get the execution time of running an application using speciﬁc conﬁguration values. We are interested in the accuracy of our results. For each application, we only scale the data size and keep all other application settings based on HiBench’s recommended settings. In plain English, we are querying the endpoint to return how long it takes to predict an "SVM" application if we have 5GB of data to process, with Apache Spark having 16 executor cores each with 8GB of memory and four cores. We also set the level of parallelism to 4.

2. Case Study 2: Best Configuration Parameters The objective here is to get the best configuration in terms of execution time or meeting a particular deadline for an application. Listing 6.7 shows both the API call and the results of the best and Worst Configuration parameters for running k-means application processing 12GB of data.

1 curl 2 -X POST 3 -d ’{"DataSizeGB":12, "App":"KMEANS"}’ 4 -H ’Content-Type: application/json’ 5 https://sc306.host.cs.st-andrews.ac.uk/best/ 6 #"Best Config": |"Worst Config": 7 #"DataSize": 12, |"DataSize": 12, 8 #"Numex": 16, |"Numex": 4, 9 #"ExMem": 6, |"ExMem": 2, 10 #"ExCore": 6, |"ExCore": 2, 11 #"LevelPar": 8, |"LevelPar": 8, 12 #"Predictions": 88.14 |"Predictions": 912.99 Listing 6.7: The best and the worst conﬁguration settings for running k-means on 12 GB data. We are interested in the accuracy of our results. For each application, we only scale the data size and keep all other application settings based on HiBench’s recommended settings. or example, for K means we keep k constant and scale the dataset from 1GB to 12GB.

Using the API, we found out that the best performance for k-means application processing 12GB of data takes 88.14s with the following key conﬁguration values (Executors=16, 6.8. DECISION MAKING 133

Executor Memory=6GB, Executor Core=6, Level of Parallelism=8).

On the other hand, the worst performance for the same data size takes 912.99s when the following key conﬁguration settings are used (Executors=4, Executor Memory=2GB, Executor Core=2, Level of Parallelism=2).

Listing 6.8 shows how we address answering the second problem of recommending the best conﬁguration setting to run a particular application. For each application of interest, we perform a permutation of all the possible combinations of the conﬁguration parameters and call the rest API to return the execution time. These results are gathered and stored in persistent storage for querying.

1 #!/bin/bash 2 app=(RF KMEANS SVM BAYES LR LINEAR) 3 for h in "${app[@]}" 4 do 5 echo -e "DataSize\tNumex\tExMem\tExCore\tLevelPar\tPredictions" >> $h. txt 6 data=(1 2 4 6 8 12) 7 for i in "${data[@]}" 8 do 9 nex=(4 8 16 24 32) 10 for j in "${nex[@]}" 11 do 12 exm=(2 4 6 8) 13 for k in "${exm[@]}" 14 do 15 nCores=(2 4 6 8) 16 for l in "${nCores[@]}" 17 do 18 lp=(8 16 32 64) 19 for m in "${lp[@]}" 20 do 21 result=$(curl -X POST -d ’{"DataSizeGB":’"$i"’, "NumEx":’"$j"’, "ExCore":’"$l"’, "ExMem":’"$k"’, "LevelPar":’"$m"’, "App" :’\"$h\"’}’ -H \’Content-Type: application/json\’ https:// sc306.host.cs.st-andrews.ac.uk/dfwc/) 22 #echo $result 23 fresult=$(echo $result | jq --compact-output ’.svm’ | grep -Eoh ’[0-9]*\.[0-9]*’) 24 echo -e "$i\t$j\t$k\t$l\t$m\t$fresult" >> $h.txt 25 done

Listing 6.8: Prerequisite to Getting The Best Conﬁguration Parameters Listing 6.9 shows how we query the persistent data to return the best and the worst 134 CHAPTER 6. BENCHMARKING AND MODELLING DATAFLOW WITH CYCLES PATTERN (DFWC)

configuration settings for running an application. From lines 2 - 3, we read the data and filter it by the size of data we want to process. Lines 5 and 6 get the min execution time and best configuration respectively. To get the worst configuration settings, we get the max prediction and return that observation from the persistent dataset.

1 getBestConfig <- function(datafile, title,ds){ 2 f <- read.table(file =datafile,header=TRUE) 3 f <- filter(f, f$DataSize==ds) 4 5 min_val <- min(f$Predictions) 6 min_ob <- filter(f, f$Prediction==min_val) %>% head(1) 7 8 max_val <- max(f$Predictions) 9 max_ob <- filter(f, f$Prediction==max_val) %>% head(1) 10 11 return(list("minob"=min_ob, "maxob"=max_ob)) 12 } 13 #Now execute and get the 14 p <- generatePlotBestConfig("KMEANS.txt", "Title", "DataSize",12)

Listing 6.9: Dynamic Imputation of Conﬁguration Values

3. Case Study 3: Optimal Cloud Instance After getting the best parameter values for an application the next step would logically be running that application in the cloud or local cluster. One of the main goals of cloud users is to minimise the cost of operations in the cloud by renting cheaper instances for best performance. Listing 6.10 show how the API is used to recommend two optimal Amazon EC2 instances based on key conﬁguration parameters such as data size, executor memory, executor core and number of executors.

1 curl 2 -X POST 3 -d ’{"DataSize":12, "ExMem":10, "ExCore":5, "NumEx":10}’ 4 -H ’Content-Type: application/json’ 5 https://sc306.host.cs.st-andrews.ac.uk/bestec2/ 6 #"BestEC2_1": | #"BestEC2_2": 7 #"PricePerUnit": USD 0.12, | "PricePerUnit": USD 0.06, 8 #"vCPU": 32, | "vCPU": 16, 9 #"Memory": 60, | "Memory": 30, 10 #"TotalNodes": 2 | "TotalNodes": 4 Listing 6.10: Two best cost effective EC2 instances for running kmeans application. 6.9. RELATED WORK 135

6.9 Related Work

First, a summary of key comparison metrics is presented in Table 6.4. In this section, we discuss the related work in both benchmarking approaches and performance modelling and highlight our contributions. In benchmarking space, Huang et al. [74] developed one of the most comprehensive benchmarking suites for both Hadoop and Apache spark frameworks. Li et al. [87] developed an entire benchmarking suite for Apache Spark framework. Ceesay et al. [21], Varghese et al. [124] developed a pluggable benchmarking framework based on containers to address the technical difficulties associated with deploying benchmarking suites. Varghese et al. [125] built a benchmarking suite to evaluate the performance of scientific application in the cloud. Our main contributions in the benchmarking space are: a systematic method of selecting key configurations to benchmark and the extension of HiBench to include additional key configuration parameters not included in the default implementation. Table 6.4: Comparison of Related Work. Our framework achieves one of the highest accuracy compared to the most relevant related works. Our work is the only one that presents best configuration and cloud instance recommendation. As specified in the beginning of this thesis, the average prediction error is the average difference between the actual and predicted values.

Related Work Comparison Metric DFWC [59] [27] [127] [101] [132] [76] Key Conﬁg Studies " 7 7 " 7 7 " Multi Objective Framework 7 7 " 7 7 " " Benchmarking " 7 " 7 7 " " Performance Modelling 7 " " " 7 " " Execution Time " " " 7 " " " Best Conﬁguration 7 7 7 7 7 7 " Cloud Instance Recomm. 7 7 7 7 7 7 " Deadline 7 7 " 7 7 7 " Average Prediction Error 20% 17% 15-20% 15% 14-20% 7% 14%

We will now present the start-of-art in the performance modelling space and highlight our contributions making reference to works listed in Table 6.4. Gounaris and Torres [59] proposed a methodology of a trial-and-error approach to performance tuning and further improved that to use a systematic approach to identify candidate configurations for better performance. The main weaknesses of their work are: their work has less to do with performance prediction. Secondly, they only focus on investigating the impact of tunable Spark parameters relating to shuffling, compression and serialisation. This approach is less general because it excludes application and execution behaviour parameters such as executor memory, number of executors and executor 136 CHAPTER 6. BENCHMARKING AND MODELLING DATAFLOW WITH CYCLES PATTERN (DFWC) cores. In our work, we investigate the effects of all key framework parameters to gain a much better understanding of the entire framework. Chao et al. [27] developed a performance prediction model with a focus on the stage execution of each application. They focus only on the prediction of execution time without dwelling on other essential aspects such as best configurations. Our work is multi-objective because we include prediction of execution time, best configurations to meet specific objectives like a deadline. Secondly, in their benchmarking approach, they simply depend on previous studies to identify their key configuration parameters without providing empirical evidence to validate their choices. Venkataraman et al. [127] and Islam et al. [76] developed a multi-objective performance prediction framework, but the drawback of their work is the use of minimal high-level configuration settings. For example, in [127], the input of the prediction models depends on only the hardware configuration, such as number or types of machines. For [76], they have modelled the application completion time with respect to the number of executors only. Nguyen et al. [101] developed an execution model-driven framework that studies the effects of different configuration parameters. However, their work only involves investigating the impact of key configuration parameters but does not include any performance prediction or forecasting based on those key configuration parameters. They are only interested in answering questions such as How does configuration setting X affect the execution behaviour of Spark? Wang and Khan [132] presented a simulation-driven prediction model that can predict job performance for Apache Spark; however, their work is more focused on job prediction rather than evaluating the effects of configurations parameters. The main difference between their work and the rest of the works presented here is the use of workload traces such as IO, CPU and memory instead of configuration parameter settings. This work is more focused on the challenge of dealing with the ubiquity of configuration parameters and then use those parameters to predict performance. Also, they used the analytic approach instead of the ML approach to model the performance of applications. From all these works, it is clear that understanding and predicting the performance of highly configurable frameworks is of great interest to the research community. Our work in this area is unique in two folds. First, to avoid any bias, we have included configurations settings from the data processing framework, the computing cluster and the applications. Secondly and as shown in Figure 6.1, we have also developed a multi-objective framework that can be used to predict execution time, recommend the best configuration and optimal cloud instance to meet specific objectives like deadlines. 6.10. DISCUSSION 137

6.10 Discussion

There are so many algorithms that can fit under the dataflow with cycle pattern; we argue that the approach we used in this work could be used to model the performance of any DFWC application accurately. Our first research question is validated in the experiment section. • General Approach: Just like the MapReduce pattern, we used a grey-box approach by combining both application and framework performance drivers to model the performance of DFWC applications. As of writing this thesis, most works such as [127], [132], [102], [131] used either a black or white box approach. We have shown that using a hybrid approach gives a better understanding of the entire execution framework and therefore, would give better performance prediction results. As shown in Table 6.4, we have one of the highest prediction accuracy results with less than 14% mean prediction error.

• Use of Representative Workloads based on Communication Patterns: Using representative workloads to test and validate the accuracy of the models is very crucial in assessing the generality of the models. To address this challenge, we focus on Machine Learning algorithms because of their iterative nature and therefore fits the DFWC pattern. We have used a variety of ML algorithms from regression, clustering and classification domains. Although works such as [76], [118], [131] have good results in performance modelling; however, they used non-iterative workloads for the profiling and modelling building process. We, therefore, argue that these works could be improved when representative iterative workloads are used.

• Benchmarking Approach: The main going in our benchmarking approach is to avoid bias towards a speciﬁc application or few conﬁguration parameters. To achieve this goal, we designed a generic benchmarking framework to execute workloads with different parameter settings for both application and the computing framework. As discussed in the literature review, the holistic approach of gathering historical data is expensive, but it yields the best performance results. Works such as [127], [102], [131], [118] all used heuristic approach to benchmarking to avoid the challenge of exhaustive historical data collection.

• Curse of Configuration Parameters: Apache Spark has over 200+ configuration parameters which can affect the performance of applications. It is indeed unrealistic to model the behaviour of each of these configuration parameters. To achieve this, we first performed a thorough literature study to identify candidate configuration parameters. We performed further empirical analysis to identify key configuration parameters. We use these key configuration parameters to build the performance prediction models. 138 CHAPTER 6. BENCHMARKING AND MODELLING DATAFLOW WITH CYCLES PATTERN (DFWC)

6.11 Summary

In this chapter, we present our performance prediction framework for data-intensive application with emphasis on DFWC Communication pattern. We used grey-box as a hybrid approach to benchmarking and performance modelling. To avoid any bias in the benchmarking process, we have used a novel generic benchmarking approach to proﬁle the Apache cluster. In the experiment, we validate our work by using representative workloads, and our average error rate is 14% from the measured values. Finally, we provided a comprehensive rest API to test our ± implementation. One of the main challenges we have in this work is verifying our work on a much larger cluster or a cloud computing environment. Secondly, we would also be interested to see how we can model SQL based DFWC applications. Finally, we would like to upgrade our framework to automatically update conﬁguration settings for applications to achieve optimal performance. In the next chapter, we will draw lessons learned, highlight future work and conclusion. CHAPTER SEVEN CONCLUSION

Benchmarking and Performance modelling are integral in understanding the performance characteristics of new systems and frameworks. Benchmarking can be used to compare, identify strengths, weaknesses and performance bottlenecks of new and emerging systems. Performance modelling can be used to predict performance-related questions such as running time of an application as input or finding optimal settings to minimise goals7 such as deadlines or budget. In this thesis, we focused on benchmarking and performance modelling of data-intensive applications. With the advent of big data, these applications and frameworks are now prevalent in mainstream computing and business processes. There are tons of big data workloads, and our initial task was to identify how best we can group these workloads and organise them into related patterns. As a result, we identified the two most common communication patterns of big data applications, and they are MapReduce and Dataflow with cycle pattern. We then proceed to enable a multi-objective framework to benchmark and model the performance of applications in these patterns. Generally, we followed a four-step methodology in our research process. The first step deals with the problem of identifying key configuration parameters out of the 200+ configuration parameters for big data frameworks such as Apache Hadoop and Apache Spark. This helps to design a more general benchmarking and performance prediction framework. The second step enabled a generic benchmarking framework for each of these communication patterns base on the selected computing framework. The main goal of this step is to profile a cluster by executing generic or representative workloads on the cluster to gather historical data for the modelling phase. In the third step, we used a grey-box approach to model the performance of applications in MR and DFWC communication patterns. The goal of this step is to enable a performance prediction component to answer various performance-related questions. The final step enabled a decision-making component to help users make an informed decision regarding application

139 140 CHAPTER 7. CONCLUSION runtime, deadlines, and best conﬁguration parameters to minimise these goals. We make this available through a web interface and REST API.

7.1 Thesis Summary and Revisiting Research Contributions

The need to research, identify and improve big data frameworks is crucial to deal with the massive datasets being generated in recent times. Data-intensive applications are increasingly being used in almost all the domains such as education, healthcare, agriculture, business & ﬁnance and governments. Below we discuss the summary of each chapter of this thesis and highlight the summary of the contributions. • In Chapter 1, we introduce the content of this thesis, motivation and the contribution of this work. For each of the contribution, we highlight the problem and the solution to that problem.

• In Chapter 2, we cover the background of this research by discussing various critical concepts of data-intensive applications (DIA). We also covered multiple concepts and definition of big data frameworks. One of the significant aspects we introduced in this chapter is the Dataflow communication patterns such as the MapReduce and dataflow with cycle pattern.

• In Chapter 3, we surveyed existing works in benchmarking and performance modelling of data-intensive applications focusing on the MR and DFWC patterns. As a result of this survey, we developed a taxonomy highlighting research gaps in this domain. We then address these gaps in various sections of this research. We consider this in itself as one the signiﬁcant contribution of this research. As of writing this thesis, we found no comprehensive literature review of benchmarking and performance modelling for data-intensive applications.

• In Chapter 4, we introduced the general methodology of this research highlighting practical use cases. For each of the computing frameworks, we first discussed the methodology we used in identifying key configuration parameters out the 200+ configuration parameters. We then discussed benchmarking and performance modelling used in this work. Using the decision-making component, we present three broad use cases and highlight how we address them using the framework.

• In Chapter 5, we present benchmarking and performance modelling of MapReduce pattern. In section 5.3, we present the exploration of the internal of the Apache Hadoop framework by presenting theoretical models of each phase. We benchmark and model each of these phases and combine them to generate the ﬁnal cost model. In section 5.5, we present in details how we proﬁle each of these phases to create historic data for performance prediction. 7.2. LESSONS LEARNED 141

From the experiments, we were able to conclude that applications using the same MR design pattern do have similar performance characteristics. This chapter encompasses two of our main contributions. The ﬁrst contribution highlights the generation of theoretical models representing each of the six generic phases of the MR pattern. The second contribution is the enabling of a generic benchmarking and performance modelling framework for the MR pattern.

• In Chapter 6, we present benchmarking and performance modelling of the Dataflow with cycle pattern (DFWC). In section 6.2, we discussed the overall methodology we have used. Sections 6.3 and 6.4 highlights the systematic approach used in the selection of key configuration parameters for the framework and applications. We discuss benchmarking in section 6.5 where we iterate how we customise HiBench framework to add custom performance metrics. We train and generate models in section 6.6. Using the final model, we can answer the various performance-related question with high accuracy. Like chapter 5, this chapter has two main contributions, and they are: A performance prediction framework for DFWC pattern and a representative benchmarking framework.

7.2 Lessons Learned

In this section, we will summarise the key lessons we have learned in this research. • The generalisation of our approach: Making the models general was one of our core objectives. We have learned that to achieve this, the benchmarking or framework proﬁling process has to be generic and not bias towards a particular set of workloads. To achieve this, for the MapReduce pattern, we used conﬁgurable generic workloads that emulate different processing amount of data in each execution step. For the DFWC pattern, we used a different approach due to its internal complexity. We used representative workloads such as ML application to benchmark the computing framework. This is an essential step towards simplifying the process of building models that can accurately give users performance estimates of their applications. To see these models in action, see the case studies discussed in Chapter 4. At the end of this chapter, we have discussed some of the future works that can make these models even more general.

• Communication Patterns: Big data workloads are ubiquitous; however, we learned that they could be grouped into few categories based on their communication patterns. This is important in our research because using this knowledge; we were able to group workloads by pattern. E.g. We grouped and modelled iterative applications such as ML application using the dataﬂow with cycle pattern. 142 CHAPTER 7. CONCLUSION

• Complexity of Framework Internal: Understanding the performance characteristics of data-intensive application is not a trivial task. Several factors contribute to this. First, these applications run on frameworks that are simple to use on a high-level, but under the hood, these frameworks are highly complex. Secondly, the applications themselves can be quite complex. Therefore, understanding the internals of both the framework and application was crucial to the success of this work.

• Many Configuration Parameters: Computing frameworks such as Apache Spark or Apache Hadoop are highly configurable with some of the configurable parameter having many options. Optimal performance of an application could be achieved by using the correct values for each of these configurable parameters. However, this is not a trivial task. It requires complete domain knowledge or experience in running a specific application. In this work, what we have seen is that using modelling, best configuration parameters could be inferred based on previous runs of an application.

• Key Configuration Parameters: We have also learned that there are some framework and application-level configuration parameters that are more significant than the others. We consider the configuration parameters as key performance drivers. We cannot study the effect of all the configuration parameter due to the exponential nature of their combination. Therefore, identifying key configuration parameters for each communication pattern was a crucial step in the success of our research.

• General Approach: We have learned that there are generally three approaches to performance modelling. The ﬁrst one is the black-box which involves using high-level machine learning approaches using historical data to predict performance. The second approach is called the white-box, which requires a deeper understanding of frameworks and application using purely mathematical or analytic procedures. The ﬁnal one is the grey-box which is a hybrid approach. In this research, we used the hybrid approach by combining a mild understanding of the internals of the frameworks and application with machine learning. This approach works well and provides highly accurate results.

• Generic Benchmarking: Big data applications like any other applications do have a different internal implementation, however, what is common is they are typically executed on a platform which follows the same execution pipeline or dataﬂow for each application. We have learned that a general and generic benchmarking approach saves us much time. E.g. in the MR pattern, we designed a generic benchmarking framework that works for any application. In the DFWC pattern, we used representative workloads in the design of the benchmarking framework. 7.3. FUTURE WORK 143

• Performance Modelling: We have learned that models are not perfect, and their accuracy depends on several factors. First, the quality and the amount of training data plays a crucial role in model accuracy. This is not an easy problem to solve; generating training data is both time consuming and cumbersome. The generalisation of models is also hard to achieve, because a framework built using a few nodes cluster may not be applied to a hundred-plus or a thousand plus powerful cluster. In conclusion, this thesis has successfully addressed the research problems discussed in section 1.3 of chapter 1.

7.3 Future Work

Benchmarking and performance modelling of data-intensive applications is indeed a challenging problem. In this research, we focused on a fragment of the domain. There are a lot of big data frameworks and workloads that use different communication patterns. Below we discuss some of the critical areas that can be explored as future work. Scalability: Public cloud computing providers are the de facto platforms for running big data application because of the low startup cost and the pay-as-you-go model. We have executed our experiments on a single and an eight-node in-house cluster to evaluate how general our approach would be. The results we got were quite promising. However, it would have been much better to run these experiments on or public cloud platforms such as Amazon, Microsoft Azure or Google Cloud on a much larger cluster. This will measure the level of the generality of both our approach and the results. More Dataflow Patterns: The second possible future work is to include more big data workloads. We focused on only two communication patterns they are: MapReduce and Dataflow with cycles. Several other dataflow communication patterns are much suited for other big data workloads. Examples are (1) data flow with barriers; this is the case when several MR jobs are chained to solve an iterative style problem. (2) Graph processing applications use the Bulk Synchronous Parallel pattern. Examples of frameworks using these models are Pregel [91] and Giraph [65]. (3) The third one is the partition aggregate pattern; this is where data is aggregated from several worker nodes using a tree-like structure. Common applications of such are user-facing online services such as Google, Bing and Facebook. Better Generic Benchmarking: Although we have used generic benchmarking for MR, however, due to the complexity of DFWC, we use a general approach by using representative applications. Interesting future work in this area would be to explore possibilities of designing a generic benchmarking framework for DFWC pattern. More Key Configuration Parameters: Is there a possibility of improving our work by 144 CHAPTER 7. CONCLUSION adding more parameters as key configuration parameters. Or could there be a different set of key configuration parameters for some applications? E.g. memory-intensive applications may require different configuration settings compared to CPU intensive applications for optimal performance. APPENDIX A IMPLEMENTATION CODES AND DETAILS

In this appendix section, we highlight the codes used in the benchmarking and performance modelling of both MR and DFWC patterns. WeA will also provide a snapshot of the data with links to full source codes.

A.1 MapReduce Benchmark Runner

The Benchmark Runner, executes the generic benchmarking code programmed in Java. For each data size, we the map selectivity from 10% to 100%. This helps to proﬁle the MapReduce pipeline with different settings.

1 #!/bin/bash 2 echo "Generating Data Using Teragen" 3 dataSize=(500 750 1024 1536 2048 2560 3072 4096 4608 5120) 4 oneMegaByte=1048576 5 for i in "${dataSize[@]}" 6 do 7 numBytes=$(( $oneMegaByte*$i )) 8 numRows=$(( $numBytes/100 )) 9 echo Generatig $numRows Rows Using Teragen 10 hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce- examples-3.0.2.jar teragen -D dfs.blocksize=134217728 $numRows input$i 11 mapSelectivity=(0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1) 12 for j in "${mapSelectivity[@]}" 13 do 14 echo $j

145 146 APPENDIX A. IMPLEMENTATION CODES AND DETAILS

15 hadoop jar ClusterBenchmarking-0.0.1-SNAPSHOT.jar\ 16 bdl.standrews.ac.uk.PlatformDefinedPhaseProfiler\ 17 -D mapreduce.job.reduces=1\ 18 -D dfs.blocksize=134217728\ 19 input$i\ 20 output$i 128 $j 21 hadoop fs -rm -r -skipTrash output$i 22 done 23 hadoop fs -rm -R -skipTrash input$i 24 done

Listing A.1: The Benchmark Runner, executes the generic benchmarking code programmed in Java. For each data size, we the map selectivity from 10% to 100%

A.2 MapReduce Generic Benchmarking Implementation

A generic benchmarking implementation executed to proﬁle the different stages of the MapRe- duce framework. For each generic phase, we added relevant COUNTERS to track the progress of each phase by modifying the MapReduce source. We reference those counters in this implementation.

1 package bdl.standrews.ac.uk; 2 /** A generic program executed to profile the different stages of the MapReduce framework The generic phase COUNTERS are in the modified MapReduce source and this file also contains few counters.*/ 3 //Perform all necessary importation 4 public class PlatformDefinedPhaseProfiler { 5 private static final Logger LOG = LoggerFactory.getLogger(MapTask. class.getName()); 6 public static enum MAP_RED_CUSTOM { 7 MAP_TIME_MILLIS, REDUCE_TIME_MILLIS, TOTAL_RUNNING_MILLIS 8 }; 9 10 public static class PlatformDefinedMapper extends Mapper { 11 public void map(Object key, Text value, Context context) throws IOException, InterruptedException { 12 context.write(value, new Text("")); 13 } 14 15 @Override 16 public void run(Context context) throws IOException, InterruptedException { 17 long startTime = System.currentTimeMillis(); A.2. MAPREDUCE GENERIC BENCHMARKING IMPLEMENTATION 147

18 setup(context); 19 Configuration conf = context.getConfiguration(); 20 int splitSize = Integer.parseInt(conf.get("splitsize")); 21 double selectivity = Double.parseDouble(conf.get("selectivity" )); 22 long totalRows = Utility.getNumberOfRows(splitSize); 23 long rows = 0; 24 int threshold = (int)(selectivity*totalRows); 25 while (context.nextKeyValue()) { 26 if (rows++ <= threshold) { 27 map(context.getCurrentKey(), context.getCurrentValue() ,context); 28 } 29 } 30 cleanup(context); 31 long endTime = System.currentTimeMillis(); 32 long totalTime = endTime - startTime; 33 context.getCounter(MAP_RED_CUSTOM.MAP_TIME_MILLIS).increment( totalTime); 34 } 35 } 36 37 public static class PlatformDefinedReducer extends Reducer { 38 public void reduce(Text key, Text values, Context context) throws IOException, InterruptedException { 39 context.write(key, values); 40 } 41 42 public void run(Context context) throws IOException, InterruptedException { 43 long startTime = System.currentTimeMillis(); 44 setup(context); 45 while (context.nextKey()) { 46 reduce(context.getCurrentKey(), context.getValues(), context); 47 } 48 cleanup(context); 49 long endTime = System.currentTimeMillis(); 50 long totalTime = endTime - startTime; 51 context.getCounter(MAP_RED_CUSTOM.REDUCE_TIME_MILLIS). increment(totalTime); 52 } 53 } 54 55 public static void main(String[] args) throws Exception { 56 Configuration conf = new Configuration(); 57 String[] otherArgs = new GenericOptionsParser(conf, args). 148 APPENDIX A. IMPLEMENTATION CODES AND DETAILS

getRemainingArgs(); 58 conf.set("splitsize",otherArgs[2]); 59 conf.set("selectivity",otherArgs[3]); 60 61 if (args.length < 2) { 62 System.err.println("Usage: PlatformDefinedPhaseProfiler splitsize selectivity"); 63 System.exit(2); 64 } 65 Job job = Job.getInstance(conf, "PlatformDefinedPhaseProfiler"); 66 job.setJarByClass(PlatformDefinedPhaseProfiler.class); 67 job.setMapperClass(PlatformDefinedMapper.class); 68 // job.setCombinerClass(PlatformDefinedReducer.class); 69 job.setReducerClass(PlatformDefinedReducer.class); 70 job.setOutputKeyClass(Text.class); 71 job.setOutputValueClass(Text.class); 72 job.setInputFormatClass(TeraInputFormat.class); 73 job.setOutputFormatClass(TeraOutputFormat.class); 74 FileInputFormat.addInputPath(job, new Path(otherArgs[0])); 75 FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); 76 long startTime = System.currentTimeMillis(); 77 int status = job.waitForCompletion(true)?0:1; 78 long endTime = System.currentTimeMillis(); 79 long totalTime = endTime - startTime; 80 Counters counters = job.getCounters(); 81 counters.findCounter(MAP_RED_CUSTOM.TOTAL_RUNNING_MILLIS). increment(totalTime); 82 System.exit(status); 83 } 84 }

Listing A.2: A generic benchmarking implementation executed to proﬁle the different stages of the MapReduce framework.

A.3 MapReduce Log Parser Implementation

The generic implementation writes data to MR log ﬁles, the log parser processes those logs ﬁles to extract relevant parameters to use in the modelling stage.

1 package bdl.standrews.ac.uk; 2 public class ExtractDataFromLogs { 3 4 public static final int SPLIT_SIZE = 128; 5 A.3. MAPREDUCE LOG PARSER IMPLEMENTATION 149

6 public void readData(String fileName, PrintWriter out) throws FileNotFoundException { 7 Scanner in = new Scanner(new FileInputStream(fileName)); 8 9 in.nextLine(); // Skip first line 10 int mapSelectivity = 0; 11 System.out.println("Operation,Duration,MapSelectivity,MapInputRec, MapOutputRec,Mappers,DataSize,BlockSize"); 12 while (in.hasNextLine()) { 13 String line = in.nextLine().trim(); 14 15 if (line.trim().length() > 10) { 16 String parts[] = line.split(" "); 17 String operation = parts[3]; 18 String duration = parts[4].split("=")[1]; 19 int inputDataSize = getTotalInputData(fileName); 20 long mapInputRecords = getMapInputRecord(inputDataSize); 21 long mapOutputRecords = getMapOutputRecord(inputDataSize, mapSelectivity); 22 int numberOfMappers = getNumberOfMappers(fileName, SPLIT_SIZE); 23 24 System.out.println(operation + "," +duration+"," + mapSelectivity + "," +mapInputRecords+"," 25 +mapOutputRecords+"," +numberOfMappers+"," + inputDataSize + "," +SPLIT_SIZE); 26 out.println(operation + "," +duration+"," + mapSelectivity + "," +mapInputRecords+"," 27 +mapOutputRecords+"," +numberOfMappers+"," + inputDataSize + "," +SPLIT_SIZE); 28 } else if (mapSelectivity < 100) { 29 mapSelectivity += 10; 30 } 31 32 } 33 in.close(); 34 35 } 36 37 public int getTotalInputData(String fileName) throws NumberFormatException { 38 return Integer.parseInt(fileName.split("\\.")[0].substring(4, fileName.split("\\.")[0].length())); 39 } 40 41 public int getTotalDataProcessed(long fileSize, double mapSelectivity) { 42 return (int)(fileSize*mapSelectivity); 150 APPENDIX A. IMPLEMENTATION CODES AND DETAILS

43 } 44 45 public long getMapInputRecord(long fileSize) { 46 return (fileSize * 1048576L) / 100; 47 } 48 49 public long getMapOutputRecord(long fileSize, double mapSelectivity) { 50 return (long)(getMapInputRecord(fileSize)*(mapSelectivity/ 100)); 51 } 52 53 public int getNumberOfMappers(String fileName, int splitSize) { 54 return getTotalInputData(fileName) / splitSize; 55 } 56 57 public int getNumberOfReducers() { 58 return 2; 59 } 60 61 public static void main(String args[]) throws FileNotFoundException { 62 PrintWriter out = new PrintWriter(new File("allOut8Node.csv")); 63 out.println("Operation,Duration,MapSelectivity,MapInputRec, MapOutputRec,Mappers,DataSize"); 64 ExtractDataFromLogs dataFromLogs = new ExtractDataFromLogs(); 65 for (int i=0;i

Listing A.3: Log Parser Implementation. MR counters are written to log, this program extract relevant MR COUNTERS from the logs. A.4. BENCHMARK RUNNER DFWC 151

1 public static class GenericReducer extends Reducer { 2 public void reduce(Text key, Text values, Context context) throws Exception { 3 context.write(key, values); 4 } 5 @Override 6 public void run(Context context) throws IOException, InterruptedException{ 7 long startTime = System.currentTimeMillis(); 8 setup(context); 9 while (context.nextKey()) { 10 reduce(context.getCurrentKey(), context.getValues(), context) 11 } 12 cleanup(context); 13 long endTime = System.currentTimeMillis(); 14 long totalTime = endTime-startTime; 15 context.getCounter(MAP_RED_CUSTOM.REDUCE_TIME_MILLIS).increment( totalTime); 16 } 17 }

Listing A.4: Generic Reducer Implementation

A.4 Benchmark Runner DFWC

1 #!/bin/bash 2 scaleProfile=(tiny small large huge gigantic bigdata) 3 for i in "${scaleProfile[@]}" 4 do 5 sed -i.bak s/"hibench.scale.profile.*"/"hibench.scale.profile $i"/ gconf/hibench.conf 6 7 sed -i.bak s/"hibench.yarn.executor.num.*"/"hibench.yarn.executor.num 32"/g conf/spark.conf 8 sed -i.bak s/"spark.executor.memory.*"/"spark.executor.memory 8g"/ gconf/spark.conf 9 sed -i.bak s/"hibench.yarn.executor.cores.*"/"hibench.yarn.executor. cores 8"/g conf/spark.conf 10 sed -i.bak s/"spark.driver.memory.*"/"spark.driver.memory 4g"/g conf/spark.conf 11 12 echo Generatig Data 13 bin/workloads/ml/rf/prepare/prepare.sh 152 APPENDIX A. IMPLEMENTATION CODES AND DETAILS

14 echo Data Generation Finished 15 numExecutor=(32) 16 #numExecutor=(32) 17 for j in "${numExecutor[@]}" 18 do 19 sed -i.bak s/"hibench.yarn.executor.num.*"/"hibench.yarn.executor. num $j"/g conf/spark.conf 20 executorMem=(8) 21 for k in "${executorMem[@]}" 22 do 23 sed -i.bak s/"spark.executor.memory.*"/"spark.executor.memory $k"g/g conf/spark.conf 24 numCores=(5) 25 for l in "${numCores[@]}" 26 do 27 sed -i.bak s/"hibench.yarn.executor.cores.*"/"hibench.yarn. executor.cores $l"/g conf/spark.conf 28 levelOfPar=(2 4 6 8 16 24 32 64) 29 for m in "${levelOfPar[@]}" 30 do 31 echo Running benchmark with $i datasize numExe=$j exeMem=$k numCore=$l levelOfPar=$m Confirugration 32 #sed -i.bak s/"spark.driver.memory.*"/"spark.driver.memory $m"g/g conf/spark.conf 33 sed -i.bak s/"^spark.default.parallelism.*"/"spark.default. parallelism $m"/g conf/spark.conf 34 sed -i.bak s/"^spark.sql.shuffle.partitions.*"/"spark.sql. shuffle.partitions $m"/g conf/spark.conf 35 #codec=("lz4" "lzf" "snappy" "zstd") 36 codec=("lz4") 37 for n in "${codec[@]}" 38 do 39 sed -i.bak s/"spark.io.compression.codec.*"/"spark.io. compression.codec $n"/g conf/spark.conf 40 #ssc=("true" "false") 41 ssc=("true") 42 for o in "${ssc[@]}" 43 do 44 sed -i.bak s/"spark.shuffle.compress.*"/"spark.shuffle. compress $o"/g conf/spark.conf 45 #sssc=("true" "false") 46 sssc=("true") 47 for p in "${sssc[@]}" 48 do 49 sed -i.bak s/"spark.shuffle.spill.compress.*"/"spark. shuffle.spill.compress $p"/g conf/spark.conf 50 #sbc=("true" "false") 51 sbc=("true") A.4. BENCHMARK RUNNER DFWC 153

52 for q in "${sbc[@]}" 53 do 54 sed -i.bak s/"spark.broadcast.compress.*"/"spark.broadcast. compress $q"/g conf/spark.conf 55 bin/workloads/ml/rf/spark/run.sh 56 done 57 done 58 done 59 done 60 done 61 done 62 done 63 done 64 done 154 APPENDIX A. IMPLEMENTATION CODES AND DETAILS

1 #Load necessary R libraries. The list below is not exhaustive. 2 library(ggplot2) 3 library(dplyr) 4 library(caret) 5 #Training is expensive, so we use a PARALLEL PROCESSING LIBRARY 6 library(doMC) 7 registerDoMC(cores = 8) 8 9 genModelPlot <- function(datafile, title,xlabel,ylabel){ 10 data <- read.table(file =datafile,header=TRUE) 11 12 #Perform some data preprocessing steps. 13 data$DataSizeMB <- round((data$Input_data_size/1048576)) 14 15 if(datafile == "linear.report"){ 16 data <- filter(allDataOriginal, Time <= 100) 17 } 18 19 #Split the data 80% for training and 20% for training 20 split = sample.split(data$Time, SplitRatio = 0.8) 21 training_set = subset(data, split == TRUE) 22 test_set = subset(data, split == FALSE) 23 24 #Configure how the training should be conducted. 25 ctrl <- trainControl(method="repeatedcv",number=10,repeats=3,search=" random") 26 27 #Perform the modelling task 28 model <- train(Time ~ DataSizeGB + NumEx + ExCore + ExMem, data= training_set[c(-6)], method="svmRadial",trControl=ctrl) 29 30 #Perform Prediction 31 data$predict <- predict(model,test_set) 32 33 #Plot the prediction vs actual values 34 plot <- ggplot(test_set,aes(Time,predict)) + geom_point() + 35 geom_smooth(method = "lm", se =FALSE) 36 37 return (list("plot"=plot, "data"=test_set)) 38 }

Listing A.5: A general method of modelling and performance prediction In the ﬁrst few lines, we load the necessary libraries such as caret and doMC. Caret is the main R package used for the seamless model building. On a basic level, the doMC package provides parallel backend functionality for training the caret models. From lines 10 to 17, we A.4. BENCHMARK RUNNER DFWC 155

Table A.1: List of Github URLs. This URLs contains both the raw data and the implementation to generate the models.

SN Implementation URL 0 Plug and Play Bench https://github.com/sneceesay77/papb 1 MR Benchmarking https://github.com/sneceesay77/mr-performance-modelling 2 MR Modelling https://github.com/sneceesay77/mr-modelling HiBench Customised and 3 https://github.com/sneceesay77/HiBench-Customised DFWC Benchmarking 4 DFWC Modelling https://github.com/sneceesay77/Thesis-RModelling read the input data and perform some necessary data preprocessing such as removing outliers. From lines 20 to 22, we divide the data into training and testing set. Line 25 speciﬁes the training method of Repeated Cross Validation. The parameter number to 10 representing the number of folds or number of re-sampling iterations. We repeat this process three times by setting repeats to 3. Finally, we set the hyper-parameter optimisation method to either random or grid method[13]. On line 28, we kick off the actual model training process bypassing the model control settings and specifying the independent and dependent variables. We predict and plot on the test data to ﬁnalise the modelling process.

APPENDIX B CUSTOMISING HIBENCH

In this function, the most important section is in line 14. This is where we ass the values of all the calculated and conﬁguration values to the REPORT_LINE variable. This variable is then appended to the report ﬁle.

1 function gen_report() { B 2 local start=$1 3 local end=$2 4 local size=$3 5 6 local duration=$(echo "scale=3;($end-$start)/1000"|bc) 7 local tput=‘echo "$size/$duration"|bc‘ 8 9 local nodes=‘echo ${SLAVES} | wc -w‘ 10 nodes=${nodes:-1} 11 12 local tput_node=‘echo "$tput/$nodes"|bc‘ 13 14 REPORT_LINE=$(printf "${REPORT_COLUMN_FORMATS}" ${ HIBENCH_WORKLOAD_NAME} $(date +%F) $(date +%T) $size $duration $tput $tput_node ${YARN_NUM_EXECUTORS} ${YARN_EXECUTOR_CORES} ${ SPARK_YARN_EXECU_MEMORY} ${MAP_PARALLELISM} ${SICC} ${SSC} ${ SSSC} ${SBC}) 15 16 echo "${REPORT_LINE}" >> ${HIBENCH_REPORT}/${HIBENCH_REPORT_NAME} 17 }

Listing B.1: Add New Conﬁguration Parameters for Reporting. New

157 158 APPENDIX B. CUSTOMISING HIBENCH

B.1 Dynamic Conﬁguration

Listing B.2 shows a sample imputation implementation method where we use standard Linux utility sed to ﬁnd and replace conﬁg settings from old to new values. A detailed implementation is shown in Listing 6.2.

1 #!/bin/bash 2 scaleProfile=(tiny small large huge gigantic bigdata) 3 for i in "${scaleProfile[@]}" 4 do 5 sed -i.bak s/"hb.sc.profile.*"/"hb.scale.profile $i"/g hb.conf 6 sed -i.bak s/"hb.yarn.exe.num.*"/"hb.yarn.executor.num 32"/g spark. conf 7 sed -i.bak s/"spark.exe.mem.*"/"spark.executor.memory 8g"/g spark.conf 8 sed -i.bak s/"hb.yarn.exe.cores.*"/"hb.yarn.exe.cores 8"/g spark.conf 9 sed -i.bak s/"spark.driver.mem.*"/"spark.driver.memory 4g"/g spark. conf 10 done

Listing B.2: Dynamic Imputation of Conﬁguration Values:This is just a fragment of the full implementation showing how we impute various values of Number of Executors and Executor Memory for each job execution. REFERENCES

[1] Marco Aldinucci, Massimo Torquati, and Massimiliano Meneghin. FastFlow: Efﬁcient Parallel Streaming Applications on Multi-core. arXiv e-prints, art. arXiv:0909.1187, September 2009. URL https://ui.adsabs.harvard.edu/abs/2009arXiv0909. 1187A. Provided by the SAO/NASA Astrophysics Data System.

[2] Marco Aldinucci, Marco Danelutto, Peter Kilpatrick, Massimiliano Meneghin, and Massimo Torquati. Accelerating code on multi-cores with fastﬂow. In Proceedings of the 17th International Conference on Parallel Processing - Volume Part II, Euro-Par’11, pages 170–181, Berlin, Heidelberg, 2011. Springer-Verlag. ISBN 9783642233968. doi: 10.5555/2033408.2033428.

[3] Marco Aldinucci, Sonia Campa, Marco Danelutto, Peter Kilpatrick, and Massimo Torquati. Targeting distributed systems in fastﬂow. In Proceedings of the 18th International Confer- ence on Parallel Processing Workshops, Euro-Par’12, pages 47–56, Berlin, Heidelberg, 2012. Springer-Verlag. ISBN 9783642369483. doi: 10.1007/978-3-642-36949-0_7. URL https://doi.org/10.1007/978-3-642-36949-0_7.

[4] V. Arabnejad, K. Bubendorfer, and B. Ng. Budget and deadline aware e-science workﬂow scheduling in clouds. IEEE Transactions on Parallel and Distributed Systems, 30(1): 29–44, 2019. doi: 10.1109/TPDS.2018.2849396.

[5] Michael Armbrust, Armando Fox, Rean Grifﬁth, Anthony D. Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia. A view of cloud computing. Commun. ACM, 53(4):50–58, April 2010. ISSN 0001-0782. doi: 10.1145/1721654.1721672. URL https://doi.org/10.1145/1721654.1721672.

[6] Khari Armih, Greg Michaelson, and Phil Trinder. Cache size in a cost model for heterogeneous skeletons. In Proceedings of the Fifth International Workshop on High- Level Parallel Programming and Applications, HLPP ’11, page 3âA¸S10,New˘ York, NY, USA, 2011. Association for Computing Machinery. ISBN 9781450308625. doi: 10.1145/2034751.2034755. URL https://doi.org/10.1145/2034751.2034755.

159 160 REFERENCES

[7] M. Bakratsas, P. Basaras, D. Katsaros, and L. Tassiulas. Hadoop mapreduce performance on ssds for analyzing social networks. Big Data Research, 11:1 – 10, 2018. ISSN 2214-5796. doi: https://doi.org/10.1016/j.bdr.2017.06.001. URL http: //www.sciencedirect.com/science/article/pii/S221457961730014X. Selected papers from the 2nd INNS Conference on Big Data: Big Data Neural Networks.

[8] Blaise Barney et al. Introduction to parallel computing. Lawrence Livermore National Laboratory, 6(13):10, 2010.

[9] Anthony G. Barnston. Correspondence among the Correlation, RMSE, and Heidke Forecast Veriﬁcation Measures; Reﬁnement of the Heidke Score. Weather and Forecasting, 7(4):699–709, 12 1992. ISSN 0882-8156. doi: 10.1175/1520-0434(1992)007<0699: CATCRA>2.0.CO;2. URL https://doi.org/10.1175/1520-0434(1992)007<0699: CATCRA>2.0.CO;2.

[10] Z. Bei, Z. Yu, H. Zhang, W. Xiong, C. Xu, L. Eeckhout, and S. Feng. Rfhoc: A randomforest approach to auto-tuning hadoop’s conﬁguration. IEEE Transactions on Parallel and Distributed Systems, 27(5):1470–1483, 2016. doi: 10.1109/TPDS.2015.2449299.

[11] M. R. Bendre, R. C. Thool, and V. R. Thool. Big data in precision agriculture: Weather forecasting for future farming. In 2015 1st International Conference on Next Generation Computing Technologies (NGCT), pages 744–750, 2015. doi: 10.1109/NGCT.2015. 7375220.

[12] Anne Benoit, Murray Cole, Stephen Gilmore, and Jane Hillston. Flexible skeletal programming with eskel. Euro-Par’05, pages 761–770, Berlin, Heidelberg, 2005. Springer-Verlag. ISBN 3540287000. doi: 10.1007/11549468_83. URL https: //doi.org/10.1007/11549468_83.

[13] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. 13:281–305, February 2012. ISSN 1532-4435. doi: 10.5555/2188385.2188395.

[14] Holger Bischof, Sergei Gorlatch, and Emanuel Kitzelmann. Cost optimality and pre- dictability of parallel programming with skeletons. In Harald Kosch, László Böszörményi, and Hermann Hellwagner, editors, Euro-Par 2003 Parallel Processing, pages 682–693, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg. ISBN 978-3-540-45209-6. doi: 10.1007/978-3-540-45209-6_97.

[15] Dhruba Borthakur, Jonathan Gray, Joydeep Sen Sarma, Kannan Muthukkaruppan, Nicolas Spiegelberg, Hairong Kuang, Karthik Ranganathan, Dmytro Molkov, Aravind Menon, REFERENCES 161

Samuel Rash, Rodrigo Schmidt, and Amitanand Aiyer. Apache hadoop goes realtime at facebook. SIGMOD ’11, pages 1071–1080, New York, NY, USA, 2011. Association for Computing Machinery. ISBN 9781450306614. doi: 10.1145/1989323.1989438. URL https://doi.org/10.1145/1989323.1989438.

[16] Christopher Brown, Marco Danelutto, Kevin Hammond, Peter Kilpatrick, and Archibald Elliott. Cost-directed refactoring for parallel erlang programs. International Journal of Parallel Programming, 42(4):564–582, 2014. doi: 10.1007/s10766-013-0266-5.

[17] Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. Haloop: Efﬁcient iterative data processing on large clusters. Proceedings of the VLDB Endowment, 3(1-2): 285–296, September 2010. ISSN 2150-8097. doi: 10.14778/1920841.1920881. URL https://doi.org/10.14778/1920841.1920881.

[18] Alexandru Calotoiu, David Beckinsale, Christopher W Earl, Torsten Hoeﬂer, Ian Karlin, Martin Schulz, and Felix Wolf. Fast multi-parameter performance modeling. In 2016 IEEE International Conference on Cluster Computing (CLUSTER), pages 172–181, 2016. doi: 10.1109/CLUSTER.2016.57.

[19] Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. Apache ﬂink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.

[20] Rich Caruana and Alexandru Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, page 161âA¸S168,New˘ York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933832. doi: 10.1145/1143844.1143865. URL https://doi.org/10.1145/1143844.1143865.

[21] S. Ceesay, A. Barker, and B. Varghese. Plug and play bench: Simplifying big data benchmarking using containers. In 2017 IEEE International Conference on Big Data (Big Data), pages 2821–2828, Dec 2017. doi: 10.1109/BigData.2017.8258249.

[22] S. Ceesay, A. Barker, and Y. Lin. Benchmarking and performance modelling of mapreduce communication pattern. In 2019 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), pages 127–134, 2019. doi: 10.1109/CloudCom. 2019.00029.

[23] Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. Scope: Easy and efﬁcient parallel processing of massive data 162 REFERENCES

sets. 1(2):126–1276, August 2008. ISSN 2150-8097. doi: 10.14778/1454159.1454166. URL https://doi.org/10.14778/1454159.1454166.

[24] Afroz Chakure. Random forest regression. https://towardsdatascience.com/ random-forest-and-its-implementation-71824ced454f, 2019. [Online; accessed 25-March-2020].

[25] Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams, Robert R. Henry, Robert Bradshaw, and Nathan Weizenbaum. Flumejava: Easy, efﬁcient data-parallel pipelines. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10, pages 36–375, New York, NY, USA, 2010. Association for Computing Machinery. ISBN 9781450300193. doi: 10.1145/ 1806596.1806638. URL https://doi.org/10.1145/1806596.1806638.

[26] William Y Chang, Hosame Abu-Amara, and Jessica Feng Sanford. Transforming enterprise cloud services. Springer Science & Business Media, 2010.

[27] Zemin Chao, Shengfei Shi, Hong Gao, Jizhou Luo, and Hongzhi Wang. A gray-box performance model for apache spark. Future Generation Computer Systems, 89:58 – 67, 2018. ISSN 0167-739X. doi: https://doi.org/10.1016/j.future.2018.06.032. URL http://www.sciencedirect.com/science/article/pii/S0167739X17323233.

[28] Chaser. Lzf: Lempel-ziv-free. https://github.com/Chaser324/LZF, 2016. [Online; accessed 26-10-2020].

[29] Min Chen, Shiwen Mao, and Yunhao Liu. Big data: A survey. Mobile Network Applica- tions, 19(2):171–209, April 2014. ISSN 1383-469X. doi: 10.1007/s11036-013-0489-0. URL https://doi.org/10.1007/s11036-013-0489-0.

[30] Min Chen, Yixue Hao, Kai Hwang, Lu Wang, and Lin Wang. Disease prediction by machine learning over big data from healthcare communities. IEEE Access, 5:8869–8879, 2017. doi: 10.1109/ACCESS.2017.2694446.

[31] Mosharaf Chowdhury and Ion Stoica. Coﬂow: A networking abstraction for cluster applications. HotNets-XI, pages 31–36, New York, NY, USA, 2012. Association for Computing Machinery. ISBN 9781450317764. doi: 10.1145/2390231.2390237. URL https://doi.org/10.1145/2390231.2390237.

[32] Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng, and Kunle Olukotun. Map-reduce for machine learning on multicore. In Proceedings REFERENCES 163

of the 19th International Conference on Neural Information Processing Systems, NIPS’06, pages 281–288, Cambridge, MA, USA, 2006. MIT Press. doi: 10.5555/2976456.2976492.

[33] Peter Cohen, Robert Hahn, Jonathan Hall, Steven Levitt, and Robert Metcalfe. Using big data to estimate consumer surplus: The case of uber. Technical report, National Bureau of Economic Research, 2016.

[34] Yann Collet. Lz4. https://lz4.github.io/lz4/, 2011. [Online; accessed 26-10- 2020].

[35] Yann Collet and M Kucherawy. Zstandard compression and the application/zstd media type. RFC 8478, 2018.

[36] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. Mapreduce online. NSDI’10, page 21, USA, 2010. USENIX Association. doi: 10.5555/1855711.1855732.

[37] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC ’10, pages 143–154, 2010. ISBN 9781450300346. doi: 10.1145/1807128.1807152.

[38] Wardman Dan. Bring big data to the enterprise. http://biblioteca.uoc.edu/ en/resources/resource/bringing-big-data-enterprise-ibm, 2012. Online; accessed 06-09-2018.

[39] John Darlington, Yi-ke Guo, Hing Wing To, and Jin Yang. Parallel skeletons for structured composition. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP ’95, pages 19–28, New York, NY, USA, 1995. Association for Computing Machinery. ISBN 0897917006. doi: 10.1145/209936.209940. URL https://doi.org/10.1145/209936.209940.

[40] Aster Data. Aster data ncluster: "always on" availability. Aster Data Systems, 2009.

[41] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simpliﬁed data processing on large clusters. 51(1):107–113, January 2008. ISSN 0001-0782. doi: 10.1145/1327452.1327492. URL https://doi.org/10.1145/1327452.1327492.

[42] Jean-Pierre Dijcks. Oracle: Big data for the enterprise. Oracle white paper, page 16, 2012. 164 REFERENCES

[43] MongoDB Doc. Benchmarking and stress testing an hadoop cluster with terasort, testdfsio and co. http://www.michael-noll.com/blog/2011/04/09/benchmarking-and- stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench- mrbench/, 2016. Online, Accessed: 2017-05-19.

[44] Jack J Dongarra, Cleve B Moler, James R Bunch, and Gilbert W Stewart. LINPACK users’ guide. SIAM, 1979.

[45] Harris Drucker, Chris J. C. Burges, Linda Kaufman, Alex Smola, and Vladimir Vapnik. Support vector regression machines. In Proceedings of the 9th International Conference on Neural Information Processing Systems, NIPS’96, page 155âA¸S161,Cambridge,˘ MA, USA, 1996. MIT Press.

[46] Johan Enmyren and Christoph W. Kessler. Skepu: A multi-backend skeleton programming library for multi-gpu systems. In Proceedings of the Fourth International Workshop on High-Level Parallel Programming and Applications, HLPP ’10, page 5âA¸S14,New˘ York, NY, USA, 2010. Association for Computing Machinery. ISBN 9781450302548. doi: 10.1145/1863482.1863487. URL https://doi.org/10.1145/1863482.1863487.

[47] Tabatha A Farney. Click analytics: Visualizing website use data. Information Technology and Libraries, 30(3), 2011.

[48] K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the efﬁciency of gpu algorithms for matrix-matrix multiplication. In Proceedings of the ACM SIGGRAPH/EU- ROGRAPHICS Conference on Graphics Hardware, HWWS ’04, page 133âA¸S137,New˘ York, NY, USA, 2004. Association for Computing Machinery. ISBN 3905673150. doi: 10.1145/1058129.1058148. URL https://doi.org/10.1145/1058129.1058148.

[49] Joao Fernando Ferreira, João Luís Sobral, and Alberto José Proença. Jaskel: a java skeleton-based framework for structured cluster and grid computing. In Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID’06), volume 1, pages 4 pp.–304, 2006. doi: 10.1109/CCGRID.2006.65.

[50] Yoav Freund and Llew Mason. The alternating decision tree learning algorithm. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, pages 124–133, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. ISBN 1558606122. doi: 10.5555/645528.657623.

[51] Sean Gallagher. How ibm’s deep thunder delivers hyper-local forecasts 3-1/2 days out. Ars Technica (March 14, 2012). url = "http://arstechnica. com/business/2012/03/how- REFERENCES 165

ibms-deep-thunderdelivers-hyper-local-forecasts-3-12-days-out/(retrieved January 18, 2015)", 2012.

[52] Sumit Ganguly, Akshay Goel, and Avi Silberschatz. Efﬁcient and accurate cost models for parallel query optimization (extended abstract). In Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS ’96, pages 172–181, New York, NY, USA, 1996. Association for Computing Machinery. ISBN 0897917812. doi: 10.1145/237661.237707. URL https://doi.org/10.1145/237661. 237707.

[53] John Gantz and David Reinsel. The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze the future, 2007 (2012):1–16, 2012.

[54] Wanling Gao, Chunjie Luo, Jianfeng Zhan, Hainan Ye, Xiwen He, Lei Wang, Yuqing Zhu, and Xinhui Tian. Identifying dwarfs workloads in big data analytics. arXiv preprint arXiv:1505.06872, 2015.

[55] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google ﬁle system. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03, pages 29–43, New York, NY, USA, 2003. Association for Computing Machinery. ISBN 1581137575. doi: 10.1145/945445.945450. URL https://doi.org/10.1145/ 945445.945450.

[56] Daria Glushkova, Petar Jovanovic, and Alberto AbellÃ¸s. Mapreduce performance model for hadoop 2.x. Information Systems, 79:32 – 43, 2019. ISSN 0306-4379. doi: https://doi.org/10.1016/j.is.2017.11.006. URL http://www.sciencedirect.com/ science/article/pii/S0306437917304659. Special issue on DOLAP 2017: Design, Optimization, Languages and Analytical Processing of Big Data.

[57] Google. Google snappy compression. http://google.github.io/snappy/, 2011. [Online; accessed 26-10-2020].

[58] Google. Google data proc. https://cloud.google.com/dataproc, 2020. [Online; accessed 19-January-2020].

[59] Anastasios Gounaris and Jordi Torres. A methodology for spark parameter tuning. Big Data Research, 11:22 – 32, 2018. ISSN 2214-5796. doi: https://doi.org/10.1016/j. bdr.2017.05.001. URL http://www.sciencedirect.com/science/article/pii/ S2214579617300114. Selected papers from the 2nd INNS Conference on Big Data: Big Data Neural Networks. 166 REFERENCES

[60] Ananth Grama, Vipin Kumar, Anshul Gupta, and George Karypis. Introduction to parallel computing. Pearson Education, 2003.

[61] William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum. A high-performance, portable implementation of the mpi message passing interface standard. Parallel Comput., 22(6):789–828, September 1996. ISSN 0167-8191. doi: 10.1016/0167-8191(96)00024-5. URL https://doi.org/10.1016/0167-8191(96)00024-5.

[62] John L. Gustafson. Reevaluating amdahl’s law. Commun. ACM, 31(5):532âA¸S533,May˘ 1988. ISSN 0001-0782. doi: 10.1145/42411.42415. URL https://doi.org/10.1145/ 42411.42415.

[63] Apache Hadoop. Hadoop. http://hadoop.apache.org, 2009. [Online; accessed 14-04-2019].

[64] Kevin Hammond and Greg Michaelson. Research directions in parallel functional programming. Springer Science & Business Media, 2012.

[65] Minyang Han and Khuzaima Daudjee. Giraph unchained: Barrierless asynchronous parallel execution in pregel-like graph processing systems. Proc. VLDB Endow.,8 (9):950–961, May 2015. ISSN 2150-8097. doi: 10.14778/2777598.2777604. URL https://doi.org/10.14778/2777598.2777604.

[66] Rui Han, Lu Xiaoyi, and Xu Jiangtao. On big data benchmarking. Lecture Notes in Computer Science (including subseries Lecture Notes in Artiﬁcial Intelligence and Lecture Notes in Bioinformatics), 8807:3–18, 2014. ISSN 16113349. doi: 10.1007/978-3-319- 13021-7_1.

[67] Rui Han, Lizy Kurian John, and Jianfeng Zhan. Benchmarking Big Data Systems: A Review. IEEE Transactions on Services Computing, 11(3):580–597, 2018. ISSN 19391374. doi: 10.1109/TSC.2017.2730882.

[68] Katrin Hänsel, Natalie Wilde, Hamed Haddadi, and Akram Alomainy. Challenges with current wearable technology in monitoring health data and providing positive behavioural support. In Proceedings of the 5th EAI International Conference on Wireless Mobile Communication and Healthcare, MOBIHEALTH’15, pages 158–161, Brussels, BEL, 2015. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering). ISBN 9781631900884. doi: 10.4108/eai.14-10-2015.2261601. URL https://doi.org/10.4108/eai.14-10-2015.2261601. REFERENCES 167

[69] John A Hartigan and Manchek A Wong. Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1): 100–108, 1979. ISSN 00359254, 14679876. URL http://www.jstor.org/stable/ 2346830.

[70] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Unsupervised Learning, pages 485–585. Springer New York, New York, NY, 2009. ISBN 978-0-387-84858-7. doi: 10. 1007/978-0-387-84858-7_14. URL https://doi.org/10.1007/978-0-387-84858- 7_14.

[71] Herodotos Herodotou. Hadoop performance models. arXiv preprint arXiv:1106.0940, 2011.

[72] Herodotos Herodotou and Shivnath Babu. Proﬁling, what-if analysis, and cost-based optimization of mapreduce programs. 4(11):1111–1122, August 2011. ISSN 2150- 8097. doi: 10.14778/3402707.3402746. URL https://doi.org/10.14778/3402707. 3402746.

[73] Torsten Hoeﬂer, William Gropp, William Kramer, and Marc Snir. Performance modeling for systematic performance tuning. In State of the Practice Reports, SC ’11, New York, NY, USA, 2011. Association for Computing Machinery. ISBN 9781450311397. doi: 10.1145/2063348.2063356. URL https://doi.org/10.1145/2063348.2063356.

[74] Shengsheng Huang, Jie Huang, Jinquan Dai, Tao Xie, and Bo Huang. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In Data Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on, pages 41–51. IEEE, 2010.

[75] Intel. Hibench suite. https://github.com/Intel-bigdata/HiBench, 2012. [Online; accessed 26-10-2018].

[76] Muhammed Tawﬁqul Islam, Shanika Karunasekera, and Rajkumar Buyya. dspark: Deadline-based resource allocation for big data applications in apache spark. In 2017 IEEE 13th International Conference on e-Science (e-Science), pages 89–98, 2017. doi: 10.1109/eScience.2017.21.

[77] Jenkov Jakob. Amdhal’s law. http://tutorials.jenkov.com/java-concurrency/ amdahls-law.html, 2015. [Online; accessed 04-09-2020].

[78] Josh James. Data never sleeps 3.0. Retrieved Syyskuu, 28:2016, 2016. [Online; accessed 06-05-2018]. 168 REFERENCES

[79] Selvi Kadirvel and José AB Fortes. Grey-box approach for performance prediction in map-reduce based platforms. In 2012 21st International Conference on Computer Communications and Networks (ICCCN), pages 1–9, 2012. doi: 10.1109/ICCCN.2012. 6289311.

[80] Mukhtaj Khan, Yong Jin, Maozhen Li, Yang Xiang, and Changjun Jiang. Hadoop performance modeling for job estimation and resource provisioning. IEEE Transactions on Parallel and Distributed Systems, 27(2):441–454, 2016. doi: 10.1109/TPDS.2015. 2405552.

[81] Shahzad Khan. Ethem alpaydin. introduction to machine learning (adaptive computation and machine learning series). the mit press, 2004. isbn: 0 262 01211 1 415 pages. Natural Language Engineering, 14(1):133–137, 2008.

[82] Martin Kleppmann. Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems. " O’Reilly Media, Inc.", 2017.

[83] Gunnar Kreitz and Fredrik Niemela. Spotify–large scale, low latency, p2p music-on- demand streaming. In 2010 IEEE Tenth International Conference on Peer-to-Peer Computing (P2P), pages 1–10, 2010. doi: 10.1109/P2P.2010.5569963.

[84] Max Kuhn. Building predictive models in r using the caret package. Journal of Statistical Software, Articles, 28(5):1–26, 2008. ISSN 1548-7660. doi: 10.18637/jss.v028.i05. URL https://www.jstatsoft.org/v028/i05.

[85] Palden Lama and Xiaobo Zhou. Aroma: Automated resource allocation and conﬁguration of mapreduce environment in the cloud. In Proceedings of the 9th International Conference on Autonomic Computing, ICAC ’12, pages 63–72, New York, NY, USA, 2012. Associa- tion for Computing Machinery. ISBN 9781450315203. doi: 10.1145/2371536.2371547. URL https://doi.org/10.1145/2371536.2371547.

[86] Douglas Laney. 3D data management: Controlling data volume, velocity, and variety. Technical report, META Group, February 2001. URL http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data- Management-Controlling-Data-Volume-Velocity-and-Variety.pdf.

[87] Min Li, Jian Tan, Yandong Wang, Li Zhang, and Valentina Salapura. Sparkbench: A comprehensive benchmarking suite for in memory data analytic platform spark. In Pro- ceedings of the 12th ACM International Conference on Computing Frontiers, CF ’15, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450333580. doi: 10.1145/2742854.2747283. URL https://doi.org/10.1145/2742854.2747283. REFERENCES 169

[88] Andy Liaw, Matthew Wiener, et al. Classiﬁcation and regression by randomforest. R news, 2(3):18–22, 2002.

[89] Gilles Louppe. Understanding Random Forests: From Theory to Practice. arXiv e- prints, art. arXiv:1407.7502, July 2014. URL https://ui.adsabs.harvard.edu/ abs/2014arXiv1407.7502L. Provided by the SAO/NASA Astrophysics Data System.

[90] Hosein Mohammadi Makrani, Hossein Sayadi, Sai Manoj Pudukotai Dinakarra, Setareh Rafatirad, and Houman Homayoun. A comprehensive memory analysis of data intensive workloads on server class architecture. In Proceedings of the International Symposium on Memory Systems, MEMSYS ’18, page 19âA¸S30,New˘ York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450364751. doi: 10.1145/3240302.3240320. URL https://doi.org/10.1145/3240302.3240320.

[91] Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 135–146. ACM, 2010.

[92] Stathis Maroulis, Nikos Zacheilas, Thanasis Theocharis, and Vana Kalogeraki. Fast. efﬁ- cient performance predictions for big data applications. In 2019 IEEE 22nd International Symposium on Real-Time Distributed Computing (ISORC), pages 126–133, 2019. doi: 10.1109/ISORC.2019.00034.

[93] Kiminori Matsuzaki, Zhenjiang Hu, and Masato Takeichi. Parallel skeletons for manipulating general trees. Parallel Computing, 32(7):590 – 603, 2006. ISSN 0167-8191. doi: https://doi.org/10.1016/j.parco.2006.06.002. URL http://www.sciencedirect. com/science/article/pii/S0167819106000287. Algorithmic Skeletons.

[94] Andrew McAfee, Erik Brynjolfsson, Thomas H Davenport, DJ Patil, and Dominic Barton. Big data: the management revolution. Harvard business review, 90(10):60–68, 2012.

[95] P M Mell and T Grance. The NIST deﬁnition of cloud computing. Technical report, 2011. URL https://doi.org/10.6028%2Fnist.sp.800-145.

[96] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. Mllib: Machine learning in apache spark. 17(1):1235–1241, January 2016. ISSN 1532-4435. 170 REFERENCES

[97] Jeremy Miles. R-Squared, Adjusted R-Squared. American Cancer Society, 2005. ISBN 9780470013199. doi: https://doi.org/10.1002/0470013192.bsa526. URL https:// onlinelibrary.wiley.com/doi/abs/10.1002/0470013192.bsa526.

[98] Donald Miner and Adam Shook. MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O’Reilly Media, Inc., 1st edition, 2012. ISBN 1449327176, 9781449327170.

[99] Douglas C Montgomery, Elizabeth A Peck, and G Geoffrey Vining. Introduction to linear regression analysis, volume 821. John Wiley & Sons, 2012.

[100] Raghunath Othayoth Nambiar and Meikel Poess. The making of tpc-ds. In Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB ’06, pages 104–1058. VLDB Endowment, 2006. doi: 10.5555/1182635.1164217.

[101] Nhan Nguyen, Mohammad Maifi Hasan Khan, Yusuf Albayram, and Kewen Wang. Understanding the influence of configuration settings: An execution model-driven framework for apache spark platform. In 2017 IEEE 10th International Conference on Cloud Computing (CLOUD), pages 802–807, 2017. doi: 10.1109/CLOUD.2017.119.

[102] Nhan Nguyen, Mohammad Maiﬁ Hasan Khan, and Kewen Wang. Towards automatic tuning of apache spark conﬁguration. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pages 417–425, 2018. doi: 10.1109/CLOUD.2018.00059.

[103] Owen O’Malley. Terabyte sort on apache hadoop. Yahoo, available online at: http://sortbenchmark. org/Yahoo-Hadoop. pdf,(May), pages 1–3, 2008.

[104] Ramprasad Pedapatnam. Understanding resource allocation conﬁgurations for a spark application. http://site.clairvoyantsoft.com/understanding-resource- allocation-configurations-spark-application/, 2016. [Online; accessed 26-10- 2020].

[105] Panagiotis Petridis, Anastasios Gounaris, and Jordi Torres. Spark parameter tuning via trial-and-error. In Plamen Angelov, Yannis Manolopoulos, Lazaros Iliadis, Asim Roy, and Marley Vellasco, editors, Advances in Big Data, pages 226–237, Cham, 2017. Springer International Publishing. ISBN 978-3-319-47898-2. doi: 10.1007/978-3-319-47898-2_24.

[106] Lukasz Piwek, David A. Ellis, Sally Andrews, and Adam Joinson. The rise of consumer health wearables: Promises and barriers. PLOS Medicine, 13(2):1–9, 02 2016. doi: 10.1371/journal.pmed.1001953. URL https://doi.org/10.1371/journal.pmed. 1001953. REFERENCES 171

[107] Adrian Daniel Popescu, Andrey Balmin, Vuk Ercegovac, and Anastasia Ailamaki. Predict: Towards predicting the runtime of large scale iterative analytics. Proc. VLDB Endow.,6 (14):1678âA¸S1689,September˘ 2013. ISSN 2150-8097. doi: 10.14778/2556549.2556553. URL https://doi.org/10.14778/2556549.2556553.

[108] Daryl Pregibon. Logistic regression diagnostics. The Annals of Statistics, 9(4):705–724, 1981. ISSN 00905364. doi: 10.2307/2240841. URL http://www.jstor.org/stable/ 2240841.

[109] Kritwara Rattanaopas and Sureerat Kaewkeeree. Improving hadoop mapreduce performance with data compression: A study using wordcount job. In 2017 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), pages 564–567, 2017. doi: 10.1109/ECTICon. 2017.8096300.

[110] Joshua Rosen, Neoklis Polyzotis, Vinayak Borkar, Yingyi Bu, Michael J. Carey, Markus Weimer, Tyson Condie, and Raghu Ramakrishnan. Iterative MapReduce for Large Scale Machine Learning. arXiv e-prints, art. arXiv:1303.3517, March 2013. URL https:// ui.adsabs.harvard.edu/abs/2013arXiv1303.3517R. Provided by the SAO/NASA Astrophysics Data System.

[111] Jeff Schultz. How much data is created on the internet each day. Micro Focus Blog, 10, 2017. URL https://blog.microfocus.com/how-much-data-is-created-on- the-internet-each-day/. [Online, Accessed 14-01-2019].

[112] Burr Settles. Active learning literature survey. Technical report, University of Wisconsin- Madison Department of Computer Sciences, 2009.

[113] Ceesay Sheriffo. Benchmarking and Performance Modelling of DFWC implementation. https://github.com/sneceesay77/Thesis-RModelling, 2019. [Online, Accessed: 2019-09-14].

[114] Juwei Shi, Jia Zou, Jiaheng Lu, Zhao Cao, Shiqiang Li, and Chen Wang. Mrtuner: A toolkit to enable holistic optimization for mapreduce jobs. Proc. VLDB Endow., 7(13): 1319–1330, August 2014. ISSN 2150-8097. doi: 10.14778/2733004.2733005. URL https://doi.org/10.14778/2733004.2733005.

[115] Norbert Siegmund, Alexander Grebhahn, Sven Apel, and Christian Kästner. Performance- inﬂuence models for highly conﬁgurable systems. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, pages 284–294, New 172 REFERENCES

York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450336758. doi: 10.1145/2786805.2786845. URL https://doi.org/10.1145/2786805.2786845.

[116] D. B. Skillicorn. The bird-meertens formalism as a parallel model. In Janusz S. Kowalik and Lucio Grandinetti, editors, Software for Parallel Computation, pages 120–133, Berlin, Heidelberg, 1993. Springer Berlin Heidelberg. ISBN 978-3-642-58049-9. doi: 10.1007/ 978-3-642-58049-9_9.

[117] D.B. Skillicorn and W. Cai. A cost calculus for parallel functional programming. Journal of Parallel and Distributed Computing, 28(1):65 – 83, 1995. ISSN 0743-7315. doi: https: //doi.org/10.1006/jpdc.1995.1089. URL http://www.sciencedirect.com/science/ article/pii/S0743731585710891.

[118] Ge Song, Zide Meng, Fabrice Huet, Frederic Magoules, Lei Yu, and Xuelian Lin. A hadoop mapreduce performance prediction method. In 2013 IEEE 10th International Conference on High Performance Computing and Communications 2013 IEEE Interna- tional Conference on Embedded and Ubiquitous Computing, pages 820–825, 2013. doi: 10.1109/HPCC.and.EUC.2013.118.

[119] Michael Stein. Large sample properties of simulations using latin hypercube sampling. Technometrics, 29(2):143–151, 1987. ISSN 00401706. doi: 10.2307/1269769. URL http://www.jstor.org/stable/1269769.

[120] J. A. K. Suykens and J. Vandewalle. Least squares support vector machine classiﬁers. Neural Process. Lett., 9(3):293–300, June 1999. ISSN 1370-4621. doi: 10.1023/A: 1018628609742. URL https://doi.org/10.1023/A:1018628609742.

[121] Hassan Tariq, Harith Al-Sahaf, and Ian Welch. Modelling and prediction of resource utilization of hadoop clusters: A machine learning approach. In Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing, UCC’19, pages 93–100, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450368940. doi: 10.1145/3344341.3368821. URL https://doi.org/10.1145/ 3344341.3368821.

[122] Y. C. Tay. Data generation for application-speciﬁc benchmarking. Proc. VLDB Endow.,4 (12):1470–1473, August 2011. ISSN 2150-8097. doi: 10.14778/3402755.3402798. URL https://doi.org/10.14778/3402755.3402798.

[123] Md. Wasi ur Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, Dipti Shankar, and Dhabaleswar K. (DK) Panda. Mr-advisor: A comprehensive tuning, proﬁling, and prediction tool for REFERENCES 173

mapreduce execution frameworks on hpc clusters. Journal of Parallel and Distributed Computing, 120:237 – 250, 2018. ISSN 0743-7315. doi: https://doi.org/10.1016/j. jpdc.2017.11.004. URL http://www.sciencedirect.com/science/article/pii/ S0743731517303027.

[124] Blesson Varghese, Lawan Thamsuhang Subba, Long Thai, and Adam Barker. Container- based cloud virtual machine benchmarking. In 2016 IEEE International Conference on Cloud Engineering (IC2E), pages 192–201, 2016. doi: 10.1109/IC2E.2016.28.

[125] Blesson Varghese, Ozgur Akgun, Ian Miguel, Long Thai, and Adam Barker. Cloud benchmarking for maximising performance of scientiﬁc applications. IEEE Transactions on Cloud Computing, 7(1):170–182, 2019. doi: 10.1109/TCC.2016.2603476.

[126] Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC ’13, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450324281. doi: 10.1145/2523616.2523633. URL https://doi.org/10.1145/2523616.2523633.

[127] Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica. Ernest: Efﬁcient performance prediction for large-scale advanced analytics. NSDI’16, pages 36–378, USA, 2016. USENIX Association. ISBN 9781931971294. doi: 10.5555/2930611.2930635.

[128] Abhishek Verma, Ludmila Cherkasova, and Roy H. Campbell. Aria: Automatic resource inference and allocation for mapreduce environments. ICAC ’11, page 235âA¸S244,New˘ York, NY, USA, 2011. Association for Computing Machinery. ISBN 9781450306072. doi: 10.1145/1998582.1998637. URL https://doi.org/10.1145/1998582.1998637.

[129] Abhishek Verma, Ludmila Cherkasova, and Roy H. Campbell. Resource provisioning framework for mapreduce jobs with performance goals. In Proceedings of the 12th International Middleware Conference, Middleware ’11, page 160âA¸S179,Laxenburg,˘ AUT, 2011. International Federation for Information Processing. ISBN 9783642258206. doi: 10.5555/2414338.2414351.

[130] Emanuel Vianna, Giovanni Comarela, Tatiana Pontes, Jussara Almeida, Virgilio Almeida, Kevin Wilkinson, Harumi Kuno, and Umeshwar Dayal. Analytical performance models for mapreduce workloads. International Journal of Parallel Programming, 41(4):495–525, 174 REFERENCES

2013. doi: 10.1007/s10766-012-0227-4. URL https://doi.org/10.1007/s10766- 012-0227-4.

[131] Guolu Wang, Jungang Xu, and Ben He. A novel method for tuning conﬁguration parameters of spark based on machine learning. In 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/S- martCity/DSS), pages 586–593, 2016. doi: 10.1109/HPCC-SmartCity-DSS.2016.0088.

[132] Kewen Wang and Mohammad Maiﬁ Hasan Khan. Performance prediction for apache spark platform. In 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, pages 166–173, 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.246.

[133] Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, et al. Bigdatabench: A big data benchmark suite from internet services. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pages 488–499, 2014. doi: 10.1109/HPCA. 2014.6835958.

[134] Jonathan Stuart Ward and Adam Barker. Undeﬁned By Data: A Survey of Big Data Deﬁnitions. arXiv e-prints, art. arXiv:1309.5821, September 2013. URL https:// ui.adsabs.harvard.edu/abs/2013arXiv1309.5821W. Provided by the SAO/NASA Astrophysics Data System.

[135] Jinliang Wei, Jin Kyu Kim, and Garth A Gibson. Benchmarking apache spark with machine learning applications. Parallel Data Lab., Carnegie Mellon Univ., Pittsburgh, PA, USA, 2016.

[136] Tom White. Hadoop: The Deﬁnitive Guide. O’Reilly Media, Inc., 4th edition, 2015. ISBN 1491901632.

[137] Jack Wisdom and Matthew Holman. Symplectic maps for the n-body problem. The Astronomical Journal, 102:1528–1538, 1991.

[138] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, page 2, USA, 2012. USENIX Association. doi: 10.5555/2228298.2228301. REFERENCES 175

[139] Zhuoyao Zhang, Ludmila Cherkasova, and Boon Thau Loo. Benchmarking approach for designing a mapreduce performance model. In Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering, ICPE ’13, page 253âA¸S258,New˘ York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450316361. doi: 10.1145/2479871.2479906. URL https://doi.org/10.1145/2479871.2479906.

[140] Zhuoyao Zhang, Ludmila Cherkasova, and Boon Thau Loo. Getting more for less in optimized mapreduce workﬂows. In 2013 IFIP/IEEE International Symposium on Integrated Network Management (IM 2013), pages 93–100, 2013.

[141] Weizhong Zhao, Huifang Ma, and Qing He. Parallel k-means clustering based on mapreduce. In Proceedings of the 1st International Conference on Cloud Computing, CloudCom ’09, pages 674–679, Berlin, Heidelberg, 2009. Springer-Verlag. ISBN 9783642106644. doi: 10.1007/978-3-642-10665-1_71. URL https://doi.org/10.1007/978-3-642- 10665-1_71.

[142] Zhi-Qiang Zeng, Hong-Bin Yu, Hua-Rong Xu, Yan-Qi Xie, and Ji Gao. Fast training support vector machines using parallel sequential minimal optimization. In 2008 3rd International Conference on Intelligent System and Knowledge Engineering, volume 1, pages 997–1001, 2008. doi: 10.1109/ISKE.2008.4731075.

[143] Eric R. Ziegel and Raymond Myers. Classical and modern regression with applications. Technometrics, 33(2):248, may 1991. doi: 10.2307/1269070. URL https://doi.org/ 10.2307%2F1269070.

[144] Matt Zwolenski and Lee Weatherill. The Digital Universe Rich Data and the Increasing Value of the Internet of Things. AUSTRALIAN JOURNAL OF TELECOMMUNICATIONS AND THE DIGITAL ECONOMY, 2(3), 2014. ISSN 2203-1693. doi: 10.7790/ajtde.v2n3. 47. URL http://doi.org/10.7790/ajtde.v2n3.4747.1.