Big Data Processing on Arbitrarily Distributed

Dataset

by

Dongyao Wu

A THESIS SUBMITTED IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

SCHOOLOF COMPUTER SCIENCEAND ENGINEERING FACULTY OF ENGINEERING

Thursday, 1st June, 2017

All rights reserved. This work may not be

reproduced in whole or in part, by photocopy

or other means, without the permission of the author.

c 2017 by Dongyao Wu PLEASE TYPE THE UNIVERSITY OF NEW SOUTH WALES Thesis/Dissertation Sheet

Surname or Family name: Wu

First name: Dongyao Other name/s:

Abbreviation for degree as given in the University calendar: PhD

School: School of Computer Science and Engineering Faculty: Faculty of Engineering

Title: Big Data Processing on Arbitrarily Distributed Dataset

Abstract 350 words maximum: (PLEASE TYPE)

Over the past years, frameworks such as MapReduce and Spark have been Introduced to ease the task of developing big data programs and applications. These frameworks significantly reduce the complexity of developing big data programs and applications. However, in reality, many real-world scenarios require pipelining and Integration of multiple big data jobs. As the big data pipelines and applications become more and more complicated, It Is almost Impossible to manually optimize the performance for each component not to mention the whole pipeline/application. At the same time, there are also increasing requirements to facilitate interaction, composition and Integration for big data analytics applications In continuously evolving, Integrating and delivering scenarios. In addition, with the emergence and development of cloud computing, mobile computing and the Internet of Things, data are Increasingly collected and stored In highly distributed infrastructures (e.g. across data centres, clusters, racks and nodes).

To deal with the challenges above and fill the gap In existing big data processing frameworks, we present the Hierarchically Distributed Data Matrix (HOM) along with the system Implementation to support the writing and execution of composable and integrable big data applications. HOM Is a light-weight, functional and strongly-typed meta-data abstraction which contains complete Information (such as data format, locations, dependencies and functions between input and output) to support parallel execution of data-driven applications. Exploiting the functional nature of HOM enables deployed applications of HOM to be natively integrable and reusable by other programs and applications. In addition, by analysing the execution graph and functional semantics of HDMs, multiple automated optimizations are provided to Improve the execution performance of HOM data flows. Moreover, by extending the kernel of HOM, we propose a multi­ cluster solution which enables HOM to supportlarge scale data analytics among multi-cluster scenarios. Drawing on the comprehensive information maintained by HOM graphs, the runtime execution engine of HOM Is also able to provide provenance and history management for submitted applications. We conduct comprehensive experiments to evaluate our solution compared with the current state-of-the-art big data processing framework ••• .

Declaration relating to disposition of project thesis/dissertation

I hereby grant to the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or in partin the University libraries in all formsof media, now or here afterknown, subject to the provisions of the Copyright Act 1968. I retain all property rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or partof this thesis or dissertation.

I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstracts International (this is applicable to doctoral theses only).

. Date

The University recognises that there may be exceptional circumstances requiring restrictions on copying or conditions on use. Requests for restriction for a period of up to 2 years must be made in writing. Requests for a longer period of restriction may be considered in exceptional circumstances and reauire the aooroval of the Dean of Graduate Research.

FOR OFFICE USE ONLY Date of completion of requirements for Award:

COPYRIGHT STATEMENT

‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.'

Signed ……………………………………………......

Date ……………………………………………......

AUTHENTICITY STATEMENT

‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’

Signed ……………………………………………......

Date ……………………………………………...... ORIGINALITY STATEMENT

'I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.'

Signed ...... Date ...... To my family, friends and supervisors. Abstract

Over the past years, frameworks such as MapReduce and Spark have been introduced to ease the task of developing big data programs and applications. These frameworks significantly reduce the complexity of developing big data programs and applications.

However, in reality, many real-world scenarios require pipelining and integration of mul- tiple big data jobs. As the big data pipelines and applications become more and more complicated, it is almost impossible to manually optimize the performance for each com- ponent not to mention the whole pipeline/application. At the same time, there are also increasing requirements to facilitate interaction, composition and integration for big data analytics applications in continuously evolving, integrating and delivering scenarios. In addition, with the emergence and development of cloud computing, mobile computing and the Internet of Things, data are increasingly collected and stored in highly distributed infrastructures (e.g. across data centers, clusters, racks and nodes).

To deal with the challenges above and fill the gap in existing big data processing frameworks, we present the Hierarchically Distributed Data Matrix (HDM) along with the system implementation to support the writing and execution of composable and in- tegrable big data applications. HDM is a light-weight, functional and strongly-typed meta-data abstraction which contains complete information (such as data format, loca- tions, dependencies and functions between input and output) to support parallel exe- cution of data-driven applications. Exploiting the functional nature of HDM enables

i deployed applications of HDM to be natively integrable and reusable by other programs and applications. In addition, by analyzing the execution graph and functional seman- tics of HDMs, multiple automated optimizations are provided to improve the execution performance of HDM data flows. Moreover, by extending the kernel of HDM, we pro- pose a multi-cluster solution which enables HDM to support large scale data analytics among multi-cluster scenarios. Drawing on the comprehensive information maintained by HDM graphs, the runtime execution engine of HDM is also able to provide prove- nance and history management for submitted applications. We conduct comprehensive experiments to evaluate our solution compared with the current state-of-the-art big data processing framework — Apache Spark.

ii Acknowledgments

Thanks for everyone who have helped and/or accompanied me during my PhD.

iii Contents

Abstract i

Acknowledgments iii

List of Figures x

List of Tables xi

1 Introduction 1

1.1 Background ...... 1

1.1.1 Big Data ...... 1

1.1.2 Big Data Enabling Technologies ...... 3

1.2 Motivation ...... 13

1.3 Publications ...... 19

2 Literature Review and Related Work 22

2.1 Big Data Processing Frameworks: State-of-the-Art ...... 23

2.1.1 MapReduce ...... 23

2.1.2 Spark ...... 26

2.1.3 Flink ...... 28

2.1.4 Other Big Data Processing Frameworks ...... 30

iv 2.1.5 Discussion ...... 33

2.2 Optimizations on Big Data Processing Engines ...... 34

2.2.1 Optimization frameworks ...... 34

2.2.2 Categorization of Optimizations for Big Data Processing . . . . 41

2.3 Big Data Processing on Heterogeneous Environment ...... 45

2.4 Pipelining and Integration in Big Data Processing ...... 49

2.4.1 Pipeline Frameworks ...... 49

2.4.2 Discussion ...... 53

2.5 Conclusion and Summary ...... 53

3 A Functional Meta-data Abstraction for Big Data Processing - HDM 56

3.1 Attributes of HDM ...... 56

3.2 Categorization of HDM ...... 58

3.3 Data Dependencies of HDM ...... 60

3.4 Programming on HDM ...... 62

3.4.1 HDM Functions ...... 62

3.4.2 HDM Composition ...... 64

3.4.3 Interaction with HDM ...... 64

3.5 System Implementation ...... 65

3.5.1 Architecture Overview ...... 65

3.5.2 Runtime Engine ...... 67

3.6 Sample Examples ...... 72

3.6.1 Top K ...... 72

3.6.2 Linear Regression ...... 73

3.6.3 KMeans Clustering ...... 73

4 Functional Dataflow Optimization based on HDM 75

v 4.1 Local Aggregation ...... 77

4.2 Function Fusion ...... 78

4.3 Re-ordering/Re-construction Operations ...... 79

4.4 HDM Cache ...... 81

4.5 Comparison with Optimizations in Other Frameworks ...... 82

4.6 Performance Evaluation ...... 83

4.6.1 Experimental setup ...... 83

4.6.2 Experimental Benchmarks ...... 84

4.6.3 Experiment Results ...... 86

4.6.4 Comparison and Discussion ...... 91

5 Towards a Multi-Cluster Architecture 94

5.1 Core Execution Engine - HDM ...... 96

5.2 Coordination of Multi-clusters ...... 96

5.2.1 Hierarchical Architecture ...... 97

5.2.2 Decentralized Architecture ...... 99

5.2.3 Dynamic Architecture Switching ...... 101

5.3 Job Planning on Multi-clusters ...... 102

5.3.1 Categorizations of Jobs ...... 102

5.3.2 Job Explanation ...... 103

5.4 Scheduling on Multi-clusters ...... 106

5.4.1 Multi-layer Scheduler Design ...... 106

5.4.2 Scheduling Strategies ...... 107

5.5 Experimental Evaluation of Multi-cluster ...... 109

5.5.1 Experimental Setup ...... 110

5.5.2 Benchmark and Test Cases ...... 110

5.5.3 Experimental Results ...... 111

vi 6 Dependency and Execution History Management on HDM 117

6.1 History Traces Management ...... 117

6.2 Dependency Trace Synchronization in HDM Cluster ...... 122

6.3 Reproduce of HDM Applications ...... 123

6.4 Composition of HDM applications ...... 124

6.5 Fault Tolerance in HDM ...... 126

6.6 Case Study ...... 126

7 Data Pipeline on Multiple Execution Platforms 128

7.1 Motivating Scenarios ...... 129

7.2 Pipeline on Heterogeneous Execution Contexts ...... 131

7.2.1 Pipeline Model ...... 132

7.2.2 Execution Engine ...... 134

7.2.3 Data Service ...... 135

7.2.4 Dependency and Version Manager ...... 136

7.3 Case Study ...... 138

7.4 Comparison and Discussion ...... 140

8 Conclusion and Future Work 143

8.1 Conclusion ...... 143

8.2 Future work ...... 145

Bibliography 147

vii List of Figures

1.1 5 Vs of Big Data...... 2

1.2 Illustration of Cloud Computing...... 3

1.3 Categories of Machine learning Algorithms...... 6

1.4 Mobile Computing...... 8

1.5 Four Layers of IoT...... 10

1.6 Illustration of Social Media...... 12

1.7 A simple image classification pipeline...... 16

2.1 Flow of the Literature Review...... 22

2.2 Taxonomy of the Literature Review...... 23

2.3 Data Flow of MapReduce...... 24

2.4 An Example of RDD DAG in Spark...... 27

2.5 Architecture of Flink Execution Engine...... 29

2.6 PACT Component in Flink...... 36

2.7 Four Main Stages for Big Data Optimizations...... 42

2.8 A Basic Component of Flume...... 52

3.1 Data Model of HDM...... 59

3.2 Data Dependencies of HDM...... 61

3.3 Example of writing a word-count program in HDM ...... 64

3.4 Print output of HDM on the client side ...... 65

viii 3.5 System Architecture of HDM Runtime System...... 66

3.6 Process of executing HDM jobs...... 67

3.7 Physical execution graph of HDM (parallelism = 4)...... 69

4.1 Traversing on the Data flow of HDM during Optimization...... 75

4.2 Logical flow of word-count...... 78

4.3 Data flow optimized by local aggregation...... 78

4.4 Data flow of word-count after function fusion...... 80

4.5 Data flow Reconstruction in HDM ...... 80

4.6 Cache Detecting by Reference Counting...... 81

4.7 Comparison of Job Completion Times for HDM and Spark ...... 88

4.8 Comparison of Job Completion Times of ML algorithms for HDM and

Spark ...... 90

5.1 A Better Solution for a Multi-party Computation Architecture...... 95

5.2 Hierarchical Multi-cluster Architecture...... 97

5.3 Message Coordination between Hierarchical Clusters...... 97

5.4 P2P Master Architecture...... 100

5.5 Message Coordination between P2P Clusters...... 101

5.6 Job Explanation Example...... 104

5.7 Multi-layer Job Scheduling Example...... 107

5.8 Multi-cluster Infrastructure for Experiments...... 110

5.9 Comparison the scheduling cost of single and multi-cluster architecture. 111

5.10 Comparison of Job Completion Times on a two-cluster Infrastructure . . 114

5.11 Comparison of Data Transfer on a two-cluster Infrastructure ...... 115

6.1 Dependency and Execution Traces of HDM...... 121

6.2 Reproducing an existing word-count application ...... 124

ix 6.3 Applying a new operation to an existing program ...... 125

6.4 Replacing the input of an existing program ...... 125

6.5 Creating an image classification pipeline in HDM ...... 127

7.1 The data process pipeline of a real-world suspicion detection system. . . 130

7.2 Architecture overview of Pipeline61...... 131

7.3 Historical and dependency information maintained in Pipeline61. . . . . 137

x List of Tables

1.1 The logical mapping between the chapters and the involved papers . . . 20

2.1 Comparison of related works on big data processing in multi-

cluster/highly distributed environment...... 47

2.2 Basic Relational Operators in Pig Latin...... 49

2.3 Common Data Processing Patterns in Crunch...... 51

2.4 Comparison of Existing Data Pipeline Frameworks ...... 53

3.1 Attributes of HDM ...... 57

3.2 Semantics of basic functions ...... 63

3.3 Actions and responses for integration ...... 65

4.1 Rewriting patterns in hdm ...... 79

4.2 Comparison of major big data frameworks ...... 91

6.1 Information Maintained in Dependency Trace ...... 119

6.2 Attributes maintained for each Execution Instance ...... 119

6.3 Attributes maintained for each Executed Task ...... 120

xi Chapter 1

Introduction

1.1 Background

1.1.1 Big Data

Data is a key resource in the modern world. The amount of digital data generated and consumed has been explosively increasing for the past decade. According to a recent report from IDC, by 2020 the total digital data size will be 300 times bigger than it was in 2005 and the amount of data is predicted to almost double every two years in the future (Gantz and Reinsel, 2012). In particular, the rate of data creation is accelerating, driven by many technologies and applications including mobile phones, social media, video surveillance, medical imaging, sensors, smart grids and the Internet of Things.

For example, mobile devices provide geo-spatial location data of users, phone calls, text message in addition to data which is generated from numerous applications which are available on smart phones. Social network applications provide an easy and accessible platform for billions of users to upload huge number photos and video footage to the

World Wide Web in each second. These technologies and others are just examples of instruments which constantly create huge amounts of new data that must be stored and

1 Chapter 1 2 processed to achieve various data analytics purposes.

Figure 1.1: 5 Vs of Big Data.

Big Data has become a popular term to describe the exponential growth and avail- ability of data. The phenomenon of Big Data is commonly described using the “five Vs” features (Fig. 1.1):

• Volume. It refers to the massive amount of data generated very day.

• Velocity. It refers to the speed at which new data is generated and the speed that

data is moved and transformed around.

• Variety. It refers to the rich diversity of formats for the generated data in current

digital world.

• Veracity. It refers to the quality and trustworthiness of data that has been generated

and collected.

• Value. It refers to the value and significance that big data can bring to the realistic

applications and products.

Since the emergence of the concept of big data and the development of its related Chapter 1 3 technologies and frameworks, big data has significantly motivated and affected the de- velopment of data-driven technologies and applications in various areas such as cloud computing, large scale machine learning, mobile computing, Internet of Things and any social networking service. In the next section, we will present a brief overview about the main big data enabling technologies.

1.1.2 Big Data Enabling Technologies

Cloud Computing

Figure 1.2: Illustration of Cloud Computing.

Cloud computing is “a model for allowing ubiquitous, convenient, and on-demand network access to a number of configured computing resources (e.g., networks, server, storage, application, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction”(Mell et al., 2011). The key characteristics of cloud computing include: Chapter 1 4

• Service Oriented Accessibility. Cloud computing infrastructures normally provide

service-oriented and well-defined interfaces for user to store and manage the data

and computing resources.

• Virtualization. Virtualization provides the ability of resource sharing and isolation

for the underlying hardware to increase the resource utilization, efficiency and

scalability. Virtualization is one of the most important technologies applied that

facilitates the the realization of cloud computing. In principle, all the resources,

components and services in cloud computing are virtualized in order to be better

managed by administrators as well as shared among different tenants.

• Scalability. The data storage and computing resources in cloud computing are

virtualized and can be easily scaled to massive size, which releases users from the

need to worry and deal with tedious and error-prone management of large scale

clusters and applications.

• Elasticity. The provision of storage and computing resources in the Cloud can be

automatically expanded and shrunk based on the demand of user requirements or

the scale of the applications.

In addition, based on the level of abstraction and virtualization, the service models of cloud computing are typically classified as SaaS, Paas and IaaS as shown in Fig. 1.2:

• SaaS (Software as a Service): In SaaS model, software applications are wrapped

as web accessible services via browsers or . The web delivery model of SaaS

eliminates the need for installing softwares on individual computers. As the same

time, it also facilitates software providers to deploy, maintain and update their

applications as everything is running transparently to end users. Commonly used

SaaS applications include: Wordpress, Google Docs, Gmail, etc.

• Paas (Platform as a Service): PaaS provides cloud based components for devel-

oping, customizing and executing software applications. PaaS is working as a Chapter 1 5

middleware layer while inheriting cloud characteristics such as scalability, high

availability and elasticity, etc. Common PaaS platforms include: Google App En-

gine, Microsoft Azure and Salesforce.com, etc.

• IaaS (Infrastructure as a Service): IaaS provides services for accessing, moni-

toring and managing of infrastructure resources (such as compute, storage and

networking) in remote data centers. IaaS is a highly virtualized resources manage-

ment services layer which relieves system administrators from the tedious work

of constructing, monitoring and scaling their execution infrastructures. The exam-

ples of IaaS platforms include: Amazon AWS, Microsoft Azure, Google Compute

Engine and Apache CloudStack.

Big data is conjoint together with cloud computing. On one hand, Big data technolo- gies provide the ability to process and manage large scale data sets using commodity computing resources. Cloud computing provides highly scalable and elastic infrastruc- tures for constructing and deploying big data processing frameworks such as Hadoop,

Spark and Flink. On the other hand, while cloud computing provides crucial features and infrastructures to store, access and manage large scale data sets, services and resources in cloud computing also generate tremendous amounts of data (from cloud users, system logs and cloud storage services, etc.), which brings both opportunities and challenges for big data processing and analytics technologies.

Large Scale Machine Learning

Nowadays, machine learning has become one of the most promising technologies which drives the development of our computer programs and devices to become smarter and smarter. Traditionally, machine learning is known as one branch of artificial intelligence

(AI) and was born from pattern recognition and computational learning theory. It gives computers the ability to learn without being explicitly programmed (Samuel, 1959). Ma- Chapter 1 6

Figure 1.3: Categories of Machine learning Algorithms. chine learning is about learning from experience with respect to certain performance measurement in some class of tasks (Mitchell, 1997).

Depending on whether learning feedbacks (labels) are available in a learning system, classic machine learning tasks are typically categorized into two classes: Supervised learning and Unsupervised learning. Based on the expected output of a learning system, machine learning tasks can also be classified as four main groups:classification, regres- sion, clustering and dimension reduction. An illustration of machine learning categories is shown in Fig. 1.3.

Although machine learning algorithms have been in existence for a long time, the applications of machine learning was limited due to insufficient amount of data or the inherent complexities that are needed to learn a complex machine learning model. Data is the core resource that is used to power up machine learning tasks. The performance of Chapter 1 7 machine learning tasks is significantly affected by the quality and the amount of training data. The emergence and development of big data processing technologies bridges the gap of applying complex algorithms and calculations on massive data sets within a rea- sonable time period. By drawing on distributed and parallel computing technologies, big data frameworks are able to automatically apply complex mathematical computations on large scale training data sets over a large number of iterations with ever faster response cycles. Big data boosts the realization and application of machine learning while ma- chine learning pushes the further development and progress of modern big data analytics frameworks (e.g. Mahout1, MLlib2, Flink-ML, H2O3 and SAMOA4).

Mobile Computing

With the development of wireless technology and the proliferation of smart phones and devices, people and mobile devices are significantly more connected than ever before.

The use of mobile computing technologies has grown at an exponential rate and the num- ber of mobile users has surpassed the traditional desktop users (Murphy and Meeker,

2011). Data communication has entered the era of mobile computing. Mobile com- puting enables people to access information at anywhere and anytime using lightweight computer devices such as smart phones, smart watches and tablets. The characteristics of mobile computing consist of:

• Portability: Devices and nodes in mobile computing should have sufficient pro-

cessing capability and physical portability to facilitate the operations in a movable

environment.

• Connectivity: Devices and nodes in mobile computing should have the ability

1http://mahout.apache.org/ 2http://spark.apache.org/mllib/ 3http://www.h2o.ai/ 4https://samoa.incubator.apache.org/ Chapter 1 8

Figure 1.4: Mobile Computing.

to stay connected to the network and maintain minimal amount of lag/downtime

without being affected by the mobility of devices/nodes.

• Interactivity: Devices and nodes in mobile computing should be able to commu-

nicate and interact with other active devices and nodes by exchanging data within

same environment.

• Individuality: A mobile device or nodes often denote an individual in mobile com-

puting environment. Mobile computing systems should be able to meet individual

requirements and maintain contextual information for the devices. Chapter 1 9

With the phenomenal growth of mobile users, the amount of data generated from the mobile devices continues to grow unabated. Dealing and processing such large scale data bring both new challenges and opportunities to the development of big data processing platforms.

First of all, mobile applications bring a large number of daily active users, who generate massive amounts of data every day, every hour and even every minute. Data generated and collected in the mobile computing environment is normally real-time, location-based and widely distributed. The desire to address and process those data has inspired and pushed the advancement of modern real-time, stream computing frame- works (e.g. Storm5, SparkStreaming6 and Flink7) and distributed messaging frameworks

(e.g. Kafka8, Kinesis9 and ZeroMQ10).

Meanwhile, mobile computing also provides additional application scenarios to ap- ply and evaluate existing big data analytics technologies. Nowadays, mobile devices are becoming more and more powerful to the extent that many mobile devices can al- ready be utilized as potential computing resources for distributed data processing and analytics (Shiraz et al., 2013).

Internet of Things

The Internet of Things (IoT) is the network of physical objects (things) embedded with electronics, software, sensors and network connectivity. The network enables those ob- jects and devices to communicate and exchange data among each other. The communi- cation and information sharing in the IoT are based on stipulated protocols to achieve

5http://storm.apache.org/ 6http://spark.apache.org/streaming/ 7https://flink.apache.org/ 8https://kafka.apache.org/ 9https://aws.amazon.com/kinesis 10http://zeromq.org/ Chapter 1 10

Figure 1.5: Four Layers of IoT. smart controlling, monitoring, recognition and administration of smart devices. The IoT architecture consists of four typical layers: device/sensor layer, gateway and network layer, management service layer and application layer (Vermesan and Friess, 2014) as illustrated in Fig. 1.6:

• Device/sensor layer. The first layer is composed of smart devices with integrated

sensors. The sensors are connected to the IoT network to enable information col-

lection from the physical world. The sensor layer is acting as the interconnection

between the physical world and the digital world.

• Gateway and network layer: The second layer consists of networks with a set of

gateways. The massive sensors will generate large volumes of data in a short pe-

riod. Therefore, the networks and gateways in this layer provide high performance

and robust infrastructure as the medium of data transportation.

• Management service layer: The service layer processes, wraps and renders the

raw data collected from the underlying layer to provide meaningful information

and interfaces for the application layer. Data provided in the layer are normally Chapter 1 11

streamed as massive events and analytics tools are used to extract relevant infor-

mation and feed them to related applications. A set of management service such

as data access control, privacy and security management are also provided in this

layer.

• Application layer: Utilizing the useful and meaningful information served from

the underlying layers, applications are deployed in this layer to benefit people’s

realistic life. Typical IoT applications includes transportation, agriculture, supply

chain, construction, energy and environment-focused domains.

Since its emergence, IoT has become one major data source for big data analytics and applications. According to a report (Bradley et al., 2013), by the year 2020, the

Internet of Things will generate a stunning 14.4 trillion GB of data in the world. With the growth of IoT, more demands are placed on big data processing capabilities for gath- ering, analyzing, sharing, and transmitting highly distributed data in real time.

One major challenge brought by IoT is on realtime communication and computation at large scale. Data collection in the sensors of IoT is normally performed in realtime or near realtime and those tiny sensors can produce tremendous data in every small period of time. To address the problem of how to effectively collect and process these collected data within the acceptable duration leads to the development of advanced stream and realtime computing frameworks.

Another challenge is about learning valuable insights from the large volume of raw data collected in the IoT network. Finding useful patterns and models from those large scale, noisy data is difficult. Moreover, applications in IoT normally involve time-line based analytics in which the software programs are require to detect values, threads and exceptions during a continuous time line. Chapter 1 12

Figure 1.6: Illustration of Social Media.

Social Media and Social Network

Social media is the product that is produced by a combination of the traditional media and the development of Internet technologies. Social media consists of technologies that provide platforms and ecosystems for users to share information like ideas, inter- ests, insights and other forms of expressions via virtual communications and networks.

A similar terminology is social networking service (SNS), which represents the online platforms that enable people to build their social connections and relations through the virtual internet and mobile networks. Despite the partially different definition, there are some common features shared by the two types of applications (Obar and Wildman,

2015):

• Web 2.0 internet-based applications: The development of Web 2.0 functionalities

provides rich user experience to the Internet and enables users to participate and

contribute towards the creation of more content for web pages.

• User generated contents (UGC): User generated contents such as comments, dig-

ital photos and videos are the crucial data sources for social media and social

networking service. Without those information being continuously offered by the Chapter 1 13

end users, the online social media platforms and sites cannot survive.

• Service specific profiles for users: Normally, user are required to provide specific

profiles to facilitate the maintenance and development of services in social media

platforms.

• Facilitating social connections and interactions among users: With those user pro-

files and data, the services can do analytics and help users to build the interested

social connections and interactions among individuals/groups that are of most in-

terest and relevant for users.

Social media is considered as part of big data. According to a recent industry re- port (Gantz and Reinsel, 2012), 90 percent of the worlds data was created within the past two years. Of these data, 80% are unstructured data which mainly consist of photos, videos or social media posts. That is, social media has become one of Big Datas most significant sources. At the same time, companies and markets are attempting to learn user behaviours and user patterns for business values from the tremendous amount of data that has been generated and collected in social media platforms. The development and advancement of big data platforms offers the ability to store, process and analyze those massive scale of social media data in an effective and affordable manner.

1.2 Motivation

The growing demand for large scale data analytics applications in various areas has spurred the development of novel solutions to tackle the challenges (Sakr and Gaber,

2014) for processing vast amount of data. For about a decade, the MapReduce frame- work has represented the defacto standard of big data technologies and has been widely utilized as a popular mechanism to harness the power of large clusters of computers. In general, the fundamental principle of the MapReduce framework is to move analysis to Chapter 1 14 the data, rather than moving the data to a system that can analyze it. It allows program- mers to think in a data-centric fashion where they can focus on applying transformations to sets of data records while the details of distributed execution and fault tolerance are transparently managed by the framework. However, in recent years, with the increas- ing requirements of applications in the data analytics domain, various limitations of the

MapReduce framework have been recognized and thus we have witnessed an unprece- dented interest in tackling these challenges with new solutions which constituted a new wave of mostly domain-specific, optimized big data processing platforms (Sakr et al.,

2013).

In recent years, several frameworks (e.g. Spark (Zaharia, Chowdhury, Franklin,

Shenker and Stoica, 2010), Flink, Pregel (Malewicz et al., 2010), Storm11) with dif- ferent focus and approach have been presented to tackle the ever larger data sets using distributed clusters of commodity machines. These frameworks significantly reduce the complexity of developing big data programs and applications. However, in reality, many real-world scenarios require pipelining and integration of multiple big data jobs. As the big data pipelines and applications have become more and more complicated, it is almost impossible to manually optimize the performance for each component not to mention the whole pipeline/application. To address the auto-optimization problem, Tez (Saha et al.,

2015) and FlumeJava (Chambers et al., 2010) were introduced to optimize the DAG

(Directed Acyclic Graph) of MapReduce-based jobs while Spark relies on Catalyst to optimize the execution plan of SparkSQL (Armbrust, Xin, Lian, Huai, Liu, Bradley,

Meng, Kaftan, Franklin, Ghodsi and Zaharia, 2015).

With the emergence and fast development of cloud computing, mobile computing and the Internet of Things, data are increasingly being collected and stored in highly distributed infrastructures (e.g. across data centers, clusters, racks and nodes). Although,

11http://storm.apache.org/ Chapter 1 15 many frameworks have been presented and developed to deal with ever larger data sets on ever larger distributed clusters during the past years, the majority of these big-data- processing frameworks such as Hadoop12 and Spark (Zaharia, Chowdhury, Franklin,

Shenker and Stoica, 2010) are designed and implemented based on the single-cluster design in which, there are two basic assumptions:

• Centralized cluster management assumption: which assumes the whole cluster

is managed by a centralized master (sometimes with a standby backup) which is

responsible for the resource management, job explanation and scheduling.

• Homogeneous assumption: which generally assumes that all nodes in the cluster

are symmetrically connected and data are generally distributed among them.

However, these two assumptions are not a good fit for other scenarios in which data are highly distributed in a heterogeneous environment and there could be either logical or physical boundaries for different groups of computational nodes.

• On one hand, data are increasingly distributed across different organizations that

may hold different features for the same entities due to the diversified products

they provide. As a result, there are increasing requirements to apply data ana-

lytics across organizations in order to discover more comprehensive patterns and

knowledge for data scientists and end users. However, in this case, current big data

processing frameworks such as MapReduce and Spark are not natively designed

to support coordination and computation across multi-clusters.

• On the other hand, the data center infrastructure itself is becoming bigger and more

complicated so that the connectivity difference between different sub-sets of nodes

are not negligible. In practice, for widely distributed data sets, the connectivity

including bandwidth and latency between different blocks/paritions of data can

12http://hadoop.apache.org/ Chapter 1 16

significantly differ. For example, transportation between nodes under the same

rack can be very fast; transportation across different racks can be considerably

slower; transportation across different geo-located data centers can be much slower

and even be more limited. Omitting the heterogeneity of underlying infrastructure

can result in significant performance degradation as shown in our experiments.

Figure 1.7: A simple image classification pipeline.

Furthermore, in reality, there are more challenges when applying big data technology to complicated applications in continuously evolving, integrating and delivering sce- narios. For example, consider a typical online machine learning pipeline as shown in

Fig. 1.7, the pipeline consists of three main parts: the data parser/cleaner, feature ex- tractor and classification trainer. In the pipeline, components like feature extractor and classification trainer are normally commonly-used algorithms for many machine learn- ing applications. However, in current big data platforms such as MapReduce and Spark, there is no proper way to share and expose a deployed and well-tuned online component to other developers. Therefore, there are massive and even unseen redundant develop- ment in big data applications. In addition, as the pipeline evolves, each of the online components might be updated and re-developed, new components can also be added in the pipeline. As a result, it is very hard to track and check the effects when the process is continuously evolving. Google’s recent report (Sculley et al., 2014) shows the chal- lenges and problems that they have encountered in managing and evolving large scale data analytic applications.

To fill the gap in existing big data processing frameworks discussed above, my re- search tackles the issues in applying big data applications on complicated computing Chapter 1 17 infrastructures. The main challenges we are trying to address in the research include:

• Automatic optimizations for executing complicated and pipelined big data appli-

cations. Many real-world applications require a chain of operations or even a

pipeline of data processing programs. Optimizing a complicated job is difficult

and optimizing pipelined ones are even harder. Additionally, manual optimiza-

tions are time-consuming and error-prone and it is almost impossible to manually

optimize every program.

• Supporting multi-cluster architectures. A multi-cluster architecture is required

for performing computation on highly distributed infrastructures (e.g. multi-party

computations, geo-distributed data centers).

• Maintenance and management of evolving big data applications are complex and

tedious. In a realistic data analytic process, many practical data analytics and ma-

chine learning algorithms require combination of multiple processing components

each of which is responsible for a certain analytical functionality. Data scientists

need to explore the datasets and tune the algorithms in each component over many

iterations to find overall optimal solutions. Support of integration, composition

and interaction with big data programs/jobs is necessary to facilitate the contin-

uous processes of development and exploration. More importantly, mechanisms

such as history tracking and reproducibility of old-version programs are of great

significance to help data scientists to not be lost during their task of exploring and

evolving their data analytic programs.

In order to tackle the above challenges, we believe that these problems could be ad- dressed to a great extent at the big data execution engine level by improving the basic meta-data abstraction along with end-to-end optimizations. In particular, we present the

Hierarchically Distributed Data Matrix (HDM) (Wu et al., 2015) along with the sys- tem implementation to support the writing and execution of composable and integrable Chapter 1 18 big data applications. HDM is a light-weight, functional and strongly-typed meta-data abstraction which contains complete information (such as data format, locations, de- pendencies and functions between input and output) to support parallel execution of data-driven applications. Exploiting the functional nature of HDM enables deployed applications of HDM to be natively integrable and reusable by other programs and ap- plications. In addition, by analyzing the execution graph and functional semantics of

HDMs, multiple optimizations are provided to automatically improve the execution per- formance of HDM data flows. Moreover, by extending the kernel of HDM, we propose a multi-cluster solution which enables the capability of performing large scale data ana- lytics among multi-cluster scenarios. Drawing on the comprehensive information main- tained by HDM graphs, the runtime execution engine of HDM is also able to provide provenance and history management for submitted applications. In particular, the main contributions of this thesis can be summarized as follows:

• HDM, a lightweight, functional, strongly-typed data abstraction along with its ex-

ecution engine for developing, describing and the execution of data-parallel appli-

cations.

• Based on the functional data dependency graph, optimizations include function

fusion, local aggregation, operation reordering and caching are introduced to im-

prove the performance of HDM jobs.

• Two multi-cluster architecture extensions which enable applying data-oriented ap-

plications on multi-cluster infrastructures with minimum trade-off in scheduling

and multi-cluster coordination.

• Framework-level mechanisms provisioning support for the composition, integra-

tion, interaction, history and dependency management of HDM jobs to facilitate

the requirements for continuously evolving and integrating applications.

• Comprehensive experiments to evaluate our framework. The benchmarks that Chapter 1 19

are used in our experiments include: basic primitives, pipelined operations, SQL

queries and iterative jobs (ML algorithms). All the test cases are executed to com-

pare the performance of HDM with the current state-of-art big data processing

framework - Apache Spark.

The remainder of this thesis are organized as follows. Chapter 2 presented the lit- erature review and related work. Chapter 3 introduces the representation, attributes, programming model and the system realization of HDM. Chapter 4 presents the gen- eral data flow optimizations applied to the HDM applications. Chapter 5 presents the architecture and realization to extend HDM towards multi-cluster infrastructures. Chap- ter 6 describes the dependency and history management of HDM applications. Chapter 7 presents a pipeline framework which supports the execution of data applications across heterogeneous execution contexts. In Chapter 8, we discuss the conclusion and future work of our research.

1.3 Publications

This thesis is based on a series of refereed research papers. The logical mapping between the chapters and the number of involved papers can be summarized in Table 1.1.

And the detailed information of each involved paper are listed as follows:

1. Dongyao Wu, Sherif Sakr, Liming Zhu and Huijun Wu. Towards Big Data Analyt-

ics across Multiple Clusters. 17th IEEE/ACM International Symposium on Cluster,

Cloud and Grid Computing, CCGrid’17, 2017. ACM/IEEE Computer Society.

2. Dongyao Wu, Sherif Sakr and Liming Zhu. HDM: Optimized Big Data Process-

ing with Data Provenance. 20th International Conference on Extending Database

Technology, EDBT’17, Venice, Italy, 2017. ACM/IEEE Computer Society. Chapter 1 20

3. Dongyao Wu, Sherif Sakr, Liming Zhu and Qinghua Lu. HDM: A Composable

Framework for Big Data Processing. IEEE Transaction on Big Data, 2016. IEEE

Computer Society.

4. Dongyao Wu, Liming Zhu, Xiwei Xu, Sherif Sakr, Daniel Sun, Qinghua Lu.

Building Pipelines for Heterogeneous Execution Environments for Big Data Pro-

cessing. In IEEE Software, pages 60–67, 2016. IEEE Computer Society. Available

at: https://doi.org/10.1109/MS.2016.35.

5. Dongyao Wu, Sherif Sakr, Liming Zhu, Qinghua Lu. Composable and effi-

cient functional big data processing framework, 2015 IEEE International Con-

ference on Big Data, Big Data’15, pages 279–286. Saint Clara, 2015. IEEE

Computer Society. Available at: https://doi.org/10.1109/BigData.

2015.7363765.

6. Qinghua Lu, Liming Zhu, He Zhang, Dongyao Wu, Zheng Li, Xiwei Xu. MapRe-

duce Job Optimization: A Mapping Study. In Proceedings of the 2015 IEEE

International Conference on Cloud Computing and Big Data, CCBD’15, pages

81–88, Taipei, Taiwan, 2015. IEEE Computer Society. Available at: https:

Table 1.1: The logical mapping between the chapters and the involved papers

Chapters Involved papers

Chapter 1 [1], [3], [4], [5] Chapter 2 [1], [3], [4], [5], [6], [8] Chapter 3 [2], [3], [5] Chapter 4 [2], [3], [5] Chapter 5 [1] Chapter 6 [2], [3], [4] Chapter 7 [4] Chapter 1 21

//doi.org/10.1109/CCBD.2015.33.

7. Donna Xu, Dongyao Wu, Xiwei Xu, Liming Zhu, Len Bass. Making Real

Time Data Analytics Available as a Service. In Quality of Software Architec-

ture, QoSA’15, Montreal, QC, Canada, 2015. ACM. Available at: https:

//doi.org/10.1145/2737182.2737186.

Moreover, during the Ph.D. study, there are two book chapters published out of the above papers. They are:

8. Dongyao Wu, Sherif Sakr and Liming Zhu. Big Data Programming Mod-

els. Handbook of Big Data Technologies, pages 3–29, 2017. Springer Inter-

national Publishing. ISBN 978-3-319-49340-4. Available at: http://www.

springer.com/gp/book/9783319493398.

9. Dongyao Wu, Sherif Sakr and Liming Zhu. Big Data Storage Models. Handbook

of Big Data Technologies, pages 31–63, 2017. Springer International Publishing.

ISBN 978-3-319-49340-4. Available at: http://www.springer.com/gp/

book/9783319493398.

Finally, the work of this thesis has also been released as an open source project, available at https://github.com/dwu-csiro/HDM. A demonstration screen- cast about the framework is also available at https://youtu.be/Gsz7z5bQ1zI. Chapter 2

Literature Review and Related Work

Figure 2.1: Flow of the Literature Review.

As mentioned in Chapter 1, the goal of this research is to fill in the gap of existing big data processing frameworks from three aspects:

• Provide automatic optimizations for executing big data jobs especially for compli-

cated and pipelined big data applications.

• Enable the capabilities of performing large scale data analytics in highly dis-

tributed environments.

• Offer additional supports such as integration, composition, interaction with big

data programs/jobs to facilitate the management and maintenance of continuous

22 Chapter 2 23

evolving big data applications.

Figure 2.2: Taxonomy of the Literature Review.

Bearing in mind the research targets of this work, in this chapter, we present a com- prehensive literature review for related domains based on the flow shown in Fig. 2.1:

Firstly, we investigate the state-of-the-art big data processing frameworks. Secondly, we present an overall review about current optimizations for big data applications. Thirdly, we present the existing work about supporting big data processing in highly distribut- ed/heterogeneous environments. Lastly, we give a overview about frameworks which support pipelining and integration of complicated big data applications.The taxonomy of frameworks and research works involved in this chapter is shown in Fig. 2.2.

2.1 Big Data Processing Frameworks: State-of-the-Art

2.1.1 MapReduce

In the past decade, several frameworks have been developed for providing distributed big data processing platforms (Sakr et al., 2013). MapReduce (Dean and Ghemawat,

2008) is a commonly used big data processing paradigm which pioneered this domain. Chapter 2 24

Figure 2.3: Data Flow of MapReduce.

Hadoop1 is the open-source implementation of the MapReduce paradigms. MapReduce uses key-value pairs as the basic data format during processing. Map and Reduce are two primitives which are inherited from functional programming. The semantics of these two primitives are listed as follows:

Map : list < value >→ list < key2, value2 >

Reduce : List < key2, List < value2 >>→ List < value3 > A typical MapReduce data flow is shown in the Figure 5, there is an implied step between Map and Reduce called Shuffle. In the shuffle step, the output of Map is re- partitioned then copied and merged as the input of Reduce. As we can see in Figure 2.3, the Shuffle step requests N to N communication between Mapper and Reducer so that it is a heavily network-intensive operation.

In terms of performance, Hadoop/MapReduce jobs are not guaranteed to be fast. All the intermediate data during execution are written into a distributed storage to enable it to recover from failures. This is a trade-off which sacrifices the efficiency of using memory and local storage to gain fault tolerance.

1http://hadoop.apache.org/ Chapter 2 25

The advantages of MapReduce are:

• High Scalability: Both Map and Reduce functions are designed to facilitate par-

allelization, MapReduce applications are generally linearly-scalable to thousands

of nodes. Although Shuffle step involves N to N communication, this does not

prevent it to scale to very large clusters.

• Fault Tolerance: Hadoop uses the distributed file system (HDFS) (Shvachko et al.,

2010) for all data (input, output and intermediate) persistence. All data blocks

in HDFS are replicated with 3 copies by default. Thus, if data is lost for some

partitions the final result can still be computed from other copies.

• Simple paradigm: In MapReduce programming, users only need to write the logic

of the Mapper and Reducer while the logic of shuffling, partitioning and sorting is

automatically done by the execution engine. Complex applications and algorithms

can be implemented by connecting a sequence of MapReduce jobs. Due to this

simple programming paradigm, it is much more convenient to write data-driven

parallel applications, because users only need to consider the logic of processing

data in each Mapper and Reducer without worrying about how to parallelize and

coordinate the jobs.

However, MapReduce also has some key limitations:

• Poor Performance: Hadoop/MapReduce jobs are not guaranteed to be fast. All the

intermediate data during execution are written into a distributed storage to enable

crash recovery, which sacrifices the efficiency of using memory and local storage.

For fast completing processes and where the data can fit into memory, using a

MapReduce framework is usually not effective.

• Restricted Programming Model: Although two-primitive based programming

model simplified the learning threshold for users, it also limits the flexibility of Chapter 2 26

programming. In practice, users have to fit all kinds of queries or jobs into the

MapReduce paradigm, which is difficult and awkward to use. Besides, the MapRe-

duce paradigm requires an acyclic dataflow which causes the limitation of not

being able to support scenarios that require repeated querying and iterative algo-

rithms such as machine learning.

• Little Optimization for Heterogeneous Network:As MapReduce is designed to be

executed on a single cluster environment, the data flow is constructed with the

assumption that all nodes are symmetrically connected. Thus, when facing more

complicated network situations, the shuffle step of MapReduce will become the

bottleneck in many cases. Although MapReduce has data localization mecha-

nisms, they can only address the problem of where to place the Mappers and Re-

ducers, this cannot solve the problem of the Shuffle step.

2.1.2 Spark

Spark (Zaharia, Chowdhury, Franklin, Shenker and Stoica, 2010) is a big data processing work which is initially introduced by Berkley in 2012. Spark was initially developed for providing efficient performance for interactive queries and iterative algorithms in response to the limitations of the MapReduce framework, in which those two types of applications are not well supported.

Another major limitation of MapReduce is that, it requires all the mediate output data to be persistented into the HDFS. This enables the MapReduce framework to provide strong fault tolerance through checkpointing and data replication in the HDFS. How- ever, it also significantly slows down the execution process and harms the overall perfor- mance. In comparison, Spark utilizes memory as the major data carrier during execution and takes the advantage of pushing lineage (Li et al., 2014), a well known technique in the storage domain, to provide fault tolerance. As a result of effectively using memory Chapter 2 27 during execution and more compact execution flow, Spark achieves much better perfor- mance than MapReduce.

Figure 2.4: An Example of RDD DAG in Spark.

In terms of its programming model, unlike MapReduce which forces distributed pro- grams to be written in a linear and coarsely-defined dataflow as a chain of connected

Mapper and Reducer tasks, in Spark, programmers are facilitated by using a rich set of high-level function primitives, actions and transformations to implement complicated al- gorithms in a much easier and compact way. Essentially, Spark programs are represented as Resilient Distributed Dataset (RDD) (Zaharia et al., 2012). During job planning and execution, a Spark job is represented as a DAG of RDD (as shown in Fig 2.4) and the data dependencies are also maintained by the edges in the graph. Each functional oper- ation or transformation is represented as a parallel step in the DAG while the whole job graph is divided into several execution stages based on the data dependencies between Chapter 2 28 operations. By drawing on the dependencies maintained in the DAG of RDDs, Spark supports compute/re-compute data from the predecessors in the data flow.

To sum up, the advantages of Spark include:

• Fast data processing: By effectively using memory and better scheduling of jobs,

Spark is much faster (20x - 80x) than Hadoop.

• Rich primitives: Spark provides dozens of functional operators which are better

suited for data processing; compared with using Hadoop, users can write their

jobs and tasks in more flexible ways.

The limitations of Spark include:

• Spark is very memory consuming and less stable (easier to lose data) than MapRe-

duce as it is mainly executed in memory when it performs data processing.

• Spark provides rich and flexible interfaces which also induces more complexity in

understanding, tuning and optimizing the programs.

• The same as Hadoop, Spark is designed to be executed on single cluster environ-

ments so it provides very limited support for heterogeneous network and infras-

tructures.

2.1.3 Flink

Apache Flink2 is another memory-based distributed data processing framework that can be used as an alternative to MapReduce. Flink originated from the Stratosphere project (Alexandrov et al., 2014) which is a software stack for parallel data analysis and was initially introduced at the Technical University of Berlin in 2009. Flink provides built-in primitives for both batch and stream processing and it leverages a directed graph

2https://flink.apache.org/ Chapter 2 29 approach and also utilizes in-memory storage to improve the performance for job execu- tion.

Figure 2.5: Architecture of Flink Execution Engine. (Alexandrov et al., 2014)

Fig 2.5 shows the architecture of Flink. Basically, user written scripts are parsed by the Sopremo (Heise et al., 2012) component and interpreted as Parallelization Con- tract(PACT) (Alexandrov et al., 2010; Battre´ et al., 2010) programs. The PACT compiler explains and optimizes the PACT programs and then generates the job graph for ex- ecution. In Flink, the execution engine is called Nephele which takes the job graphs generated from PACT and then schedule and execute them in parallel.

In contrast to MapReduce programs which are chains of Mapper and Reducer ob- jects, Flink considers functions as first-order citizens. Jobs in Flink are represented as directed acyclic graphs (DAG) during execution. Flink relies on its built-in query opti- mizer to automatically parallelizes and optimizes the submitted jobs prior to execution. Chapter 2 30

In particular, Flink leverages the Parallelization Contracts (PACTs) programming model, in which, a PACT consists of exactly one second-order function called an Input Contract and an optional Output Contract. Due to the declarative nature of the PACT program- ming model, the PACT compiler can apply different optimizations to select one from a set of execution plans based on the estimated costs of the PACT programs.

In terms of programming model, Flink is an emerging competitor to Spark as it also provides functional programming interfaces that are quite similar to those of Spark. Flink programs are regular programs which are written with a rich set of transformation oper- ations (such as mapping, filtering, grouping, aggregating and joining) to the input data sets. Data sets in Flink are based on a table based model, therefore programmers are able to use index numbers to specify a certain field of a data set. Flink shares a lot of functional primitives and transformations in the same way as what Spark does for batch processing. In addition to normal transformations, Flink also natively supports stream processing in its kernel engine. Flink provides a couple of window-based operations to apply functions and transformations on different groups of elements in the stream according to their time of arrival.

2.1.4 Other Big Data Processing Frameworks

Apart from MapReduce, Spark and Flink, there have been many other frameworks that have been introduced in the past years in both industry and academia to close the gaps and enhance the capability for big data processing each with different focus and ap- proach. They are described in more detailed in the following sections.

Dryad

Dryad (Isard et al., 2007) was introduced by Microsoft in 2007 as a general purpose execution engine for data parallel applications. A Dryad application consists of a set of Chapter 2 31 vertices and channels to form a dataflow graph. Each vertex in the data flow is a user developed program normally written as sequential procedures. The whole dataflow graph of the application is executed by the Dryad execution engine which generally schedules and distributes the tasks on the vertices to a set of distributed workers. Dryad achieves concurrency by scheduling and executing multiple vertices simultaneously on a set of computation nodes.

To facilitate the development of data-oriented applications, DryadLinQ (Yu et al.,

2008) was introduced to support writing SQL-like data-intensive programs which are executed on top of Dryad. DryadLinQ involves optimizations which are quite similar as the optimizations that were implemented by FlumeJava (Chambers et al., 2010).

SCOPE/Cosmos

SCOPE (Chaiken et al., 2008) is a declarative query processing language originally in- troduced by Microsoft. SCOPE borrows many features from classic SQL queries while data are modeled as row-based fields with schemas. SCOPE also supports programmers to write their own user-defined operators and allows nested expressions.

At runtime, SCOPE scripts are explained and executed on the Cosmos platform which is designed for massive data analytics applications without explicit parallelism while being able to efficiently execute in parallel on large clusters. The Cosmos plat- form is a distributed computation platform which provides high availability, reliability and scalability for executing data parallel applications. Similar to Dryad, jobs in Cosmos are broken down into smaller units and organized as DAG with vertices representing pro- cesses and edges representing data flow dependencies. The runtime component of the execution engine is called the Job Manager which is responsible for coordinating all processing units within an application.The Job Manager schedules a DAG vertex onto a set of processing nodes when all the inputs are ready. In addition, the Job manager also Chapter 2 32 monitors the progress of the processes. If there are failures during execution, the Job

Manager can re-schedule and re-execute part of the DAG graph.

Hyracks/ASTERIX

Hyracks (Borkar et al., 2011) is a parallel data flow execution platform which sup- ports the ability to effectively divide the computation on large scale of data collections across share-nothing clusters. Hyracks provides programming primitives such as Map- per, Sorter, Joiner and Aggregator. In addition, it also provides support for expressing data-type-specific operations such as comparisons and hash functions.

ASTERIX (Behm et al., 2011) is a data query and storage engine built on top of

Hytacks. ASTERIX is designed based on its semi-structured model called the ASTERIX

Data Model (ADM) of which each individual ADM data instance is typed and self- describing. In ASTERIX, data are accessed and manipulated through the use of the

ASTERIX Query Language (AQL) which borrows the declarative syntax from XQuery and Jaql.

RHIPE

RHIPE (Guha et al., 2012) is a R-based framework built on top of Hadoop to support statistic processing on big data. RHIPE borrows the D&R (Divide and Recombine) pattern from statistical analytics for execution and optimization of MapReduce applica- tions. Based on the D&R paradigm, RHIPE also provides two primitives: Divide and

Recombine for writing analytics programs.

During execution, input data are divided into subsets at the Divide stage. Then, ana- lytics operations and functions are applied to each of the subsets, and the outputs of each operation are recombined to form the result for the entire input data. As direct appli- cation of statistical analytics on the entire large input data is very expensive and almost Chapter 2 33 infeasible, RHIPE enables comprehensive analysis to minimize the risk of losing impor- tant information during job parallelization. In addition, the functions in RHIPE are also optimized with statistical approximation to provide better parallelism and performance.

DATAMPI

DataMPI (Lu et al., 2014) is an open source framework that leverages the Message Pass- ing Interfaces (MPI) of parallel computing to support the execution of MapReduce-like applications. DataMPI bridges the gap between high performance computation and big data computation by extending the MPI primitives to support MapReduce-like jobs.

Basically, DataMPI abstract the key-value pair pattern of MapReduce jobs into a bi- partite communication model which capture the essential communication characteristics of MapReduce-like Big Data applications. DataMPI provides a set of extension func- tions based on MPI to support operators for MapReduce-like jobs. DataMPI is built on top of JavaMPI library and uses JNI to connect Java-based routines to native MPI li- braries. By drawing on the high performance of MPI primitives, DataMPI significantly improves the efficiency of data transfers for data-intensive jobs during execution.

Apart from DataMPI, there are also other works (Hoefler et al., 2009)(Plimpton and

Devine, 2011) that try to use other frameworks in parallel computing as a candidate to improve the performance of data-intensive applications.

2.1.5 Discussion

The majority of current big data processing engines such as Spark and MapReduce do not provide built-in data-flow optimizations for the pipelined and complicated jobs. In addition, existing frameworks are mostly designed for single-cluster architectures that assume a homogeneous underlying infrastructure, which does not fit into the highly dis- tributed environments and heterogeneous environments where current data centers are Chapter 2 34 becoming ever larger.

In the following sections of this chapter, we are going to review the related works that are grouped into three main aspects: general optimizations in processing engines, support for highly distributed environments and pipeline frameworks for big data appli- cations.

2.2 Optimizations on Big Data Processing Engines

In this section, we present the existing industry frameworks and related research works that address the optimizations for big data applications. Firstly, we present the key in- dustrial frameworks which provide the ability to automatically optimize the execution of big data programs. Secondly, we present a categorized review about research works of optimizations for big data applications.

2.2.1 Optimization frameworks

As the main state-of-the-art big data processing frameworks provide very limited built- in support for automated optimizations of big data jobs, there are several extensions and supporting frameworks that attempt to address the issue outside the execution engine.

Catalyst Optimizer

In Spark, the SparkSQL component uses Catalyst Optimizer (Armbrust, Xin, Lian, Huai,

Liu, Bradley, Meng, Kaftan, Franklin, Ghodsi et al., 2015) to optimize the execution plans of SQL queries by constructing SparkSQL into trees and applying traditional SQL plan optimizations on the tree structures. Catalyst leverages the features of advanced pro- gramming languages such as pattern matching and quasiquotes (a Scala notation library that lets developers manipulate Scala syntax trees). Catalyst provides both rule-based Chapter 2 35 and cost-based models for selecting the optimal execution plans.

At the core of Catalyst, it represents SparkSQL queries as trees and apply rules to manipulate them during the optimization process. Rules in Catalyst are functions that transform a tree to another. A rule can run arbitrary code on the input tree and then construct the output tree. The most common approach to apply rules is that of using pattern matching in the tree structure to replace the sub-trees with optimized ones at each tree node. In Catalyst, a rule contains a transform method which is applied recursively over all the nodes in the input tree. In practice, a rule can take multiple executions to fully complete the transformation of the tree. During execution, Catalyst groups rules into batches and executes each batch until it reaches a fixed point, at which point the tree stops changing after applying its rules.

For the consideration of flexibility, Catalyst is designed with the purposes of being as extensive as possible for adding new techniques as well as being easy to be extended by developers.

PACT

Flink relies on its PACT (Alexandrov et al., 2010; Battre´ et al., 2010) (inherited from the Parallelization Contract programming model) component to optimize the execution

flow of Flink jobs. PACT is a generalization of the MapReduce paradigm while it utilize second-order functions to perform the parallel computation. is a open source platform that has implemented PACTs in its compiler. Basically, in Flink, PACT draws on standard query optimization mechanisms along with specific optimizations for iterative and streaming jobs to improve the performance of the Flink applications which are represented as PACT programs during planing and execution as shown in Fig 2.6.

In principle, a PACT contains a second-order function called Input Contract and an optional Output Contract. The Input Contract can take task-specific first-order functions Chapter 2 36

Figure 2.6: PACT Component in Flink. (Alexandrov et al., 2014)

(user defined) and a set of data set as input parameters. The Input Contracts of PACT include:

• Map: The Map contract has the same semantics as the map function in MapRe-

duce. It has only one input and each record in the input is processed independently

to obtain the output.

• Reduce: The Reduce contract has the same semantics as the reduce function in

MapReduce. It groups the records that have the identical key and each group of

the records is processed independently by the user defined function.

• Cross: The Cross contract takes two inputs and builds the Cartesian product of the

records of both inputs. Each record in the Cartesian product is handled separately Chapter 2 37

by the user defined function.

• Match: The Match contract takes two inputs and it join the records that has iden-

tical key from the two inputs as pairs. Each joined pair is processed by the user

defined function.

• CoGroup: The CoGroup function also takes two inputs. CoGroup works in the

similar manner as the Reduce contract but it takes records from two inputs. In

contrast to Match, the user function is also called if there are records from one

input in a group.

Basically, Input Contracts split the input data into multiple sub-sets which are indepen- dently handled by user defined first order functions in PACT. The user defined functions are invoked multiple times with the sub-set of input data. As the user defined first order functions have no side effects, they can be executed in parallel.

In addition, an Output Contract in PACT is an optional component and it allows users to provide guarantees about the data that are generated by the user defined function.

Some commonly used Output Contracts of PACT include:

• Same-Key: Each record pair in the output is generated from the function that pro-

cess on the same key of the input data. It means the output will preserve the

partition and ordering of related properties during execution.

• Super-Key: Each record pair in the output is generated from the function that pro-

cess on the super key of the input data. The output will preserve the partitioning

property and partial ordering during execution.

• Unique-Key: Every record in the output has a unique key across all the parallel

instances and partitions. All data are therefore grouped and partitioned by key. Chapter 2 38

• Partitioned-by-Key: This contract is similar to Super-key Contract apart from that

the indexes of partitioning is given.

In the PACT compiler, it applies multiple SQL optimization techniques (Selinger et al.,

1979) which exploit the information provided by the Output Contracts and apply cost- based estimations and optimizations. In particular, the optimizer generates a set of can- didate execution plans in a bottom-up manner (from data sources) and then the more expensive plans are pruned based on the cost estimation.

In terms of data model, PACT uses a generic model of records which are tuple-based data with free schemas. The formats of the fields in the records are up to user defined functions. As a special case, key-value pairs are records with only two fields (key and value). More specifically, the MapReduce paradigm is just an example of PACT as it can be considered as Map and Reduce contracts with key-value pairs as input.

FlumeJava

FlumeJava (Chambers et al., 2010) is a java library that was introduced by Google in

2010. It is built on MapReduce and provides a higher level wrapping and a set of op- timizations for better execution plans. The core of FlumeJava is a set of classes which represent parallel collections called PCollection. The parallel collections support a set of specifically defined high-level operations for writing data-oriented parallel applicaitons.

The primitives provided in FlumeJava include:

• ParallelDo: It is the most important primitive in FlumeJava. It provides element-

wise computation on the input collection and generates output collection based on

the parameter of user-defined function DoFn.

• GroupByKey: It groups key-value collections based on the identical keys of the

key-value pairs. GroupByKey handles the essence of Shuffle in MapReduce. Chapter 2 39

• CombineValues: It takes a grouped collection < Key, collection < V alue >>

and a combining function then returns a collection which preserves the keys but

with the aggregated and combined values.

• Flatten: It takes a set of collections and returns a single merged collection that

contains all the elements from the input collections.

FlumeJava defers the evaluation and constructs a execution plan data flow before the

final results is needed. FlumeJava firstly optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives such as MapReduce. The optimizations involved in FlumeJava include:

• Fuse ParallelDos: If there are two connected ParallelDo operations, combine

those two operation into one sinlgle ParallelDo and apply the two DoFn functions

sequentially in the fused ParallelDo operation.

• Sink Flattens: A Flatten operation is connected with a chain of ParallelDo opera-

tions, the Flatten operation can then be pushed down by duplicating the ParallelDo

before each input to the Flatten. This optimization can create more opportunities

for optimizing Fuse ParallelDos.

• Lift CombineValues: If a CombineValues operation is directly connected with a

GroupByKey operation, it will be treated as a ParallelDo operation and subjected

to Fuse ParallelDos optimization.

• Insert fusion blocks: If two GroupByKey operations are connected by a chain of

ParallelDo operations, the optimizer will choose and fuse the ParallelDo opera-

tions up or down into the output/input channels of the GroupByKey operations.

• Fuse MSCRs: MSCR (represents MapShuffleCombineReduce) operation is an in-

termediate operation in FlumeJava’s optimizer. A MSCR operation takes M inputs Chapter 2 40

and generate N outputs. Basically, MSCR is a generalization of MapReduce by al-

lowing multiple Reducers and Combiners. It also allows each Reducer to generate

multiple outputs.

The performance benefits from FlumeJava are quite close to manually optimized MapRe- duce jobs but FlumeJava provides high-level programming abstractions to release pro- grammers from the redundant and tedious optimizing process.

TeZ

Tez (Saha et al., 2015) is a client-side framework which allows users to build dataflow- oriented applications by specifying complex directed acyclic graph of tasks. Tez sup- ports both batch and iterative applications and it contains a graph-based optimizer which can significantly optimize MapReduce jobs written in Pig (Olston et al., 2008) and

Hive (Huai et al., 2014). Basically, Tez simplifies the MapReduce pipelines by combin- ing multiple redundant Mappers and Reducers. In addition, it supports directly stream- ing the outputs of previous jobs to subsequent ones, which reduces the cost of writing intermediate data into HDFS. The data movement between the vertices can happen in memory, streamed over the network, or written to the disk for the sake of checkpointing.

The execution of a DAG of MapReduce jobs on Tez can be more efficient than its execu- tion by Hadoop because of Tez’s application of dynamic performance optimizations that uses real information about the data and the resources required to process them. The Tez scheduler considers several factors on task assignments including task-locality require- ments, total available resources on the cluster, compatibility of containers, automatic parallelization, priority of pending task requests, and freeing up resources that the ap- plication cannot use anymore. It also maintains a connection pool of pre-warmed JVMs with shared registry objects. Therefore, it can improve the performance of executing pipelined jobs when compared with the native MapReduce framework. Chapter 2 41

Discussion

Although the frameworks above do provide certain optimizations to big data applications written in related platforms, they are mostly external components or frameworks that work outside the core execution engines. Therefore, there are still some opportunities for achieving the more optimal performance by leveraging more comprehensive information of which only the execution engine is aware.

• Catalyst only provides optimizations based on SQL queries while the majority of

the kernel functional programming interfaces do not get enough optimizations.

• PACT has a more declarative programming model and it also provides optimiza-

tions based on standard SQL techniques (Selinger et al., 1979) which are not suf-

ficient to cover all the functional operators that Flink has provided to end users.

• FlumeJava completely re-defines the programming paradigms of MapReduce in

order to get sufficient semantics information during optimization. Thus, develop-

ers need to re-write their big data applications with brand new APIs to obtain the

benefits provided by FlumeJava.

• Tez only works on graph structure level optimizations and it does not have suffi-

cient detailed information about the executing semantics.

We believe that by providing built-in optimizers with more complete logical and physical information in the execution kennel can solve the issues that existing external optimizing frameworks can hardly solve.

2.2.2 Categorization of Optimizations for Big Data Processing

In the past decade, there have been a lot of research works that provide various op- timizations for big data jobs based on MapReduce-like frameworks. According to our Chapter 2 42

Figure 2.7: Four Main Stages for Big Data Optimizations. investigation, related research works about the optimizations of big data applications can be mainly classified into the following groups of approaches:

• Performance modeling: Use mathematical methods to model the performance of

big data jobs as optimization problems and then find the optmial/near-optimal so-

lutions.

• Data placement: Locate or relocate the input data and its distribution to get better

performance during execution.

• Data movement: Optimize how data is transferred between different nodes in the

execution flow to obtain a better performance.

• Configuration optimization: Tune the configuration parameters to obtain a better

performance.

• Programming model optimization: Provide specific primitives and operations

which achieve better performance for certain scenarios.

• Task scheduling: Optimize scheduling strategies to get more optimal execution

plans for tasks.

• Partitioning: Optimize partition algorithms to get better performance via better

distribution of the data. Chapter 2 43

• Infrastructure service: Utilize underlying infrastructure service such as SDN and

IaaS to tune the performance based on runtime information of jobs.

Additionally, from the perspective of when to apply optimizations, those approaches can be distributed into four stages of big data jobs: job preparation stage, job explanation stage, job scheduling stage and job execution stage as shown in Fig. 2.7.

• Stage I-Job Preparation: Before a specific job is submitted to a big data processing

system such as Hadoop, the basic infrastructure is provided and input data is pre-

pared. At this stage, as there are no specific information about jobs, optimizations

mainly focus on micro-level aspects for clusters, including: performance modeling

(Gu et al., 2014; Han et al., 2013; Heintz et al., 2012; Herodotou, 2011; Lim et al.,

2012; Wang et al., 2009, b), data/worker placement (Gu et al., 2014; Maheshwari

et al., 2012; Wang et al., 2009; Xie et al., 2010), configuration optimization (Babu,

2010; Jiang et al., 2010) and data movement (Maheshwari et al., 2012).

• Stage II -Job Explanation (logical planning): After a job is submitted, it is ex-

plained or compiled as a logical execution model so that the processing platform

knows how to take it into execution. The majority of big data frameworks repre-

sent the execution of jobs using a DAG-based graph. A change of job explanation

sometimes means a change of programming model or operational semantics. Ap-

proaches applied on this stage includes: data movement (Cardosa et al., 2011;

Chowdhury et al., 2011; Kailasam et al., 2014) (Condie et al., 2010; Mandal et al.,

2011) and programming model optimization (Jayalath et al., 2014a; Luo et al.,

2011; Luo and Plale, 2012; Yang et al., 2007).

• Stage III-Job Scheduling (physical planning): With a constructed logical execu-

tion flow, a platform needs to assign resources for execution and coordinate the

communication among task segments to complete the whole jobs. At this stage, Chapter 2 44

scheduling policy involves trade-offs among resource utilization, fairness, and

job completion time. It is very hard to achieve all of them at the same time,

so each optimization basically focuses on one or two of them. Approaches in-

volved in scheduling stage including: Task scheduling (Hammoud et al., 2012;

Kailasam et al., 2014; Wang et al., 2013; Zaharia, Borthakur, Sen Sarma, Elmele-

egy, Shenker and Stoica, 2010) (Ahmad et al., 2013; Chen et al., 2012; Hammoud

and Sakr, 2011; He et al., 2012; Luo and Plale, 2012; Palanisamy et al., 2011;

Riteau et al., 2011; Sharma et al., 2013; Su et al., 2011; Zaharia et al., 2008),

worker placement (Chowdhury et al., 2011; Heintz et al., 2012; Jayalath et al.,

2014a,b; Kondikoppa et al., 2012; Luo et al., 2011), data placement (Heintz et al.,

2012; Palanisamy et al., 2011), data movement (Jayalath et al., 2014a; Tomasiˇ c´

et al., 2013), performance modeling (Kondikoppa et al., 2012), infrastructure ser-

vice (Narayan et al., 2012; Palanisamy et al., 2011).

• Stage IV-Job Executing: After a job is scheduled into execution, during this stage,

optimizations can rely on the underlying infrastructure to improve runtime per-

formance: for example SDN and IaaS support. Note that, as a whole a big data

processing job is usually cut into task segments, therefore, job scheduling and job

execution are usually overlapped (some tasks are under scheduling while others are

under execution). Therefore, the optimization approaches in this stage can also be

applied on the scheduling stage. At this stage, the work is mainly focused on data

movement (Hammoud et al., 2012; Jayalath et al., 2014a; Lim et al., 2012), data

placement (Lim et al., 2012), partitioning (Ibrahim et al., 2010) and infrastructure

service (Mattess et al., 2013; Wang, Ng and Shaikh, 2012).

As we can see from the lists above, there have been a lot of works that provide meaning- ful optimizations for big data applications. The majority of them work on parts of the life cycles of big data jobs as shown in Fig. 2.7. In addition, most of the existing opti- Chapter 2 45 mizations are based on the MapReduce framework. Therefore, most of the optimization approaches are limited to the MapReduce paradigm. Meanwhile, a very limited amount of research works can be found that addresses the optimization of more recent functional big data processing frameworks such as Spark and Flink.

2.3 Big Data Processing on Heterogeneous Environment

The MapReduce framework is originally designed to operate on single-cluster environ- ments. Therefore, it is not well developed to support the execution on highly distributed infrastructures and widely-networked clusters. To address this issue, many research works have attempted to extend the MapReduce framework to support highly distributed environments.

G-Hadoop (Wang et al., 2013) enables Hadoop to support scheduling MapReduce jobs on multi-data centers/clusters. As a result, it can provide a larger pool of job exe- cution and data storage. G-Hadoop involves a two-layer architecture where each layer is responsible for resource management and scheduling: the top layer is responsible for inter-cluster management of slave masters; the bottom layer is responsible for intra- cluster management of actual computing nodes. However, G-Hadoop is designed for

High End Computing (HEC) clusters which require specialized high-performance net- works such as Infini-band. Therefore, it does not provide specific optimization for com- munication across data centers.

Jayalath, et al. presented G-MR (Jayalath et al., 2014a) which supports the execution of Map-Reduce jobs on Geo-distributed datasets (data distributed among multiple data centers). Besides, G-MR optimizes data movement by using the Data Transformation

Graph (DTG). By finding the shortest weighted path in DTG, it can reduce the commu- nication cost for most MapReduce jobs by around 35%. However, the complexity of Chapter 2 46 constructing and analyzing the DTG is significantly high, which makes it less scalable to very large data sets. Additionally, the optimization of G-MR is highly associated with the Map-Reduce paradigm.

Luo, et al. presented a Hierarchical MapReduce framework (Luo and Plale, 2012) which introduces a global Reduce operation and locality-aware scheduling. They also presented another hierarchical framework (Luo et al., 2011) which can coordinate mul- tiple clusters to run MapReduce jobs among them.

As SQL based queries are one of the most commonly used interfaces for data pro- cessing and analytics, there are a group of works that support geo-distributed queries for highly distributed environments. CLARINET (Viswanathan et al., 2016) optimizes the query execution plans with the consideration of WAN-awareness to improve the perfor- mance of SQL queries on geo-distributed analytics. Iridium (Pu et al., 2015) achieves low query response time for query execution by optimizing the placement of both tasks and data. Iridium uses online heuristics to redistribute data sets among sites prior to query evaluations. Vulimiri, et al.implemented a prototype Geode (Vulimiri et al., 2015), which schedules and optimizes query plans and data replications to improves the SQL analytics over geographically distributed data sets. While those works focused on SQL query planning, general data analytics programs (such as machine learning and graph- based algorithms) can hardly benefit from the optimizations.

There are also many other works that try to extend the MapReduce frame- work for highly distributed environments such as Grid (He et al., 2012) (Su et al.,

2011), multi-clusters/clouds (Kailasam et al., 2014) (Wang, Tao, Marten, Streit, Khan,

Kolodziej and Chen, 2012) (Riteau et al., 2011) (Heintz et al., 2012) (Jayalath et al.,

2014a) (Kondikoppa et al., 2012) or heterogeneous clusters (Sharma et al., 2013) (Za- haria et al., 2008). A comparison of key related works for multi-cluster and highly distributed environments is listed in Table 2.1. Chapter 2 47 No No No No No No No No No No Multi-cluster No No Yes Yes Yes Yes Yes Yes Yes Yes Network Awareness Cluster Centralized Centralized Centralized Centralized Centralized Centralized Centralized Centralized Centralized Centralized Architecture Scheduler Scheduler Scheduling Optimizations Query Planning data flow analysis Extend MR to Grid Extend MR to Grid Query Optimizations Data and Task Placement Location-aware scheduling MapReduce MapReduce MapReduce MapReduce MapReduce MapReduce MapReduce Execution Model SQL queries (Hive) SQL queries (Hive) ) a Hog (He et al., 2012) Ussop (Su et al., 2011) Iridium(Pu et al., 2015) Geode (Vulimiri et al., 2015) SWAG (Vulimiri et al., 2015) G-MR (Jayalath et al., 2014 G-Hadoop (Wang et al., 2013) G-Hadoop (Wang et al., 2013) Table 2.1: Comparison of related works on big data processing in multi-cluster/highly distributed environment. CLARINET (Viswanathan et al., 2016) Chapter 2 48

Additionally, as the current MapReduce implementations are mainly based on the homogeneous-cluster assumption, there have been a lot of works that try to optimize the framework for heterogeneous clusters. Zaharia, et al. presented the LATE (Longest Ap- proximate Time to End) scheduling algorithm (Zaharia et al., 2008) to improve the per- formance of executing MapReduce jobs on heterogenous clusters. HybridMR (Sharma et al., 2013) introduces a two-layer scheduler for hybrid clusters. Adaptive scheduling schedulers are developed to improve the performance of MapReduce for heterogeneous workloads (Tian et al., 2009) (Polo et al., 2010), heterogeneous hardware (Polo et al.,

2010) and varying environments (Chen et al., 2010). Data locality awareness (Zhang et al., 2011) and data placement (Xie et al., 2010) are also considered as meaningful factors during optimizations for heterogeneous environments. Furthermore, frameworks such as Tarazu (Ahmad et al., 2012) and MARLA (Fadika et al., 2012) are developed to provide more comprehensive optimization throughout the MapReduce scheduling and execution stages.

Through investigating this group of works, we found that there have been a number of work that tried to provide support for highly distributed infrastructures such as multi- clusters and clouds. However, most of them just provide the feasibility for performing data-driven jobs on those highly distributed infrastructures and therefore they lack the required details of systematic optimizations and solutions to support multi-cluster/clouds architectures .

In addition, there are numerous work that provide optimizations or extensions to im- prove the performance of executing MapReduce-like jobs on clusters with heterogenous computation power. Although, these works cannot be simply applied to multi-cluster in- frastructures, they provide useful insights and inspirations for our multi-cluster solution. Chapter 2 49

2.4 Pipelining and Integration in Big Data Processing

Many real-world scenarios require pipelining and integration of multiple data process- ing and analytics jobs. To support the integration and pipelining of big data jobs, many higher-level pipeline frameworks have been proposed. In this section, we present an in- vestigation and comparison of the major existing pipeline frameworks in big data ecosys- tems.

2.4.1 Pipeline Frameworks

Apache Pig

Table 2.2: Basic Relational Operators in Pig Latin. Operators Description LOAD Load data from underlying file systems. FILTER Select matched tuples from data set based on some conditions. FOREACH Generate new data transformations based on each columns of a data set. GROUP Group a data set based on some relations. JOIN Join two or more data sets based on expressions of the values of their column fields. ORDERBY Sort the data set based on one or more columns. DISTINCT Remove duplicated elements from a given data set. MAPREDUCE Execute native MapReduce jobs inside the Pig scripts. LIMIT Limit the number of elements in the output.

Apache Pig (Olston et al., 2008) is a high level platform for creating data centric programs on top of Hadoop. The programming interface of Pig is called Pig Latin which is an ETL-like query language. Table.2.2 below shows the basic relational operators provided in Pig Latin. In comparison to SQL, Pig Latin uses extract, transform, load

(ETL) as its basic primitives. During execution of Pig programs, it is able to store data at any point in a data pipeline. At the same time, Pig supports the ability to declare execution plans and pipeline splits. Thus, it allows workflows to proceed along DAGs Chapter 2 50 instead of strictly that of sequential pipelines. Lastly, Pig Latin scripts are automatically compiled to generate equivalent MapReduce jobs for execution.

In addition, Pig also provides some extent of reusability by supporting registering and loading User Defined Functions in Pig scripts. Pig offers a set of operators to support transformation and manipulation on input data sets.

Apache MRQL

Apache MRQL3 is a framework that has been introduced as a query processing and opti- mization framework for distributed and large-scale data analysis, built on top of Apache

Hadoop, Spark, Hama and Flink. In particular, it provides an SQL-like query lan- guage that can be valuated in four independent modes: MapReduce mode using Apache

Hadoop, Spark mode using Apache Spark, BSP mode using and Flink mode using Apache Flink.

The query language of MRQL is sufficiently expressive to express common data analytics over several forms of raw data such as XML, JSON, CSV and binary files.

Therefore, it is considered more powerful than other high-level MapReduce languages such as Hive and Pig Latin as it can operates more complex jobs on more diversified sets of data. In addition, MRQL also allows users to write complex data analysis tasks such as PageRank, KMeans and Matrix Factoration using SQL-like queries.

Apache Crunch

Apache Crunch 4 is high-level library that supports the writing of testing and running data-driven pipelines on top of Hadoop and Spark. The programming interface of Crunch is partially inspired by Google’s FlumeJava (Chambers et al., 2010). Crunch wraps native MapReduce interface into high level declarative primitives such as parallelDo,

3https://mrql.incubator.apache.org/ 4https://crunch.apache.org/ Chapter 2 51

Table 2.3: Common Data Processing Patterns in Crunch. Pattern Description groupByKey Group and shuffle data set based on the key of the tuples. combineValues Aggregate elements in a grouped data set based on the combination function. aggregations Common aggregation patterns are provided as methods on the PCollection data type, including count, max, min, and length. join Join two keyed data sets by group the elements with the same key. sorting Sort data set based on the value of a selected column. groupByKey, combineValues and union to make it easy for programmers to write and read their applications. Crunch provides a couple of high level processing patterns (as shown in Table.2.3) to facilitate developers to write data-centered applications.

In Crunch, each job is considered as a Pipeline and data are considered as Collec- tions. Programmers write their process logic within DoFn interfaces and use basic prim- itives to apply transformation, filtering, aggregation and sorting to the input data sets to implement expected applications.

Cascading

Cascading 5 is another software abstraction layer for the Hadoop framework. Cascading supports the creation and execution of data processing work flows on a Hadoop cluster using any JVM-based languages (Java, JRuby, etc.), hiding the underlying complexity of the Hadoop framework. Cascading follows the Source-Pipe-Sink paradigm in pro- gramming. A Cascading job is defined as a flow, in which it can contains multiple pipes.

Each pipe is actually a function block which is responsible for a certain data process step such as GroupBy, Filtering, Joining and Sorting. Pipes are connected to construct the

final Flow for execution. In addition, pipes and flows can be also reused and reordered to support different integration and composition needs.

One primary feature of Cascading is the portability. Cascading abstracts the stan-

5http://www.cascading.org/ Chapter 2 52 dard data processing operations from the underlying execution engines while keeping the scalability for the distributed execution. Users are facilitated to divide programming logic from the integration logic during writing Cascading programs. Therefore, Cascad- ing scripts can be ported to be executed on multiple platforms (e.g Hadoop, Flink and

Tez) with only changes of a few lines of code.

Flume

Figure 2.8: A Basic Component of Flume.

Flume6 is a distributed, fault-tolerant framework that is originally designed for log- based pipelines. Flume offers convenient services for collecting, aggregating and moving large scale log data. It also allows the creation of a pipeline using configuration files and parameters and uses a simple extensible data model to facilitate online analytic applica- tions. In the data flow of Flume, it uses events as the input data and the events flow from

Source to Channel and then to Sink as shown in Fig 2.8. A Sink can be appended with another Source to construct a data pipeline.

Flume is also designed to be highly reliable during execution. Flume provides an end-to-end reliability for events. In the flow of Flume, Sources and Sinks encapsulate the retrieval and storage operations into Transactions. Sink only removes an Event after it has been stored into the Channel of the next agent or the terminal repository. Based

6https://flume.apache.org/ Chapter 2 53 on the transactional mechanisms provided in Flume, it guarantees the reliable delivery of Events.

2.4.2 Discussion

Table. 2.4 provides an overview and comparison about main pipelining frameworks for big data applications.

Table 2.4: Comparison of Existing Data Pipeline Frameworks Crunch Pig Flume Cascading Execution context MR & Spark MR Flume MR(Spark and Flink soon) Pipe definition Operators Queries Configurations Operators Pipeline connection Source, Target Query composition Source, Sink Branches, Joins, mapTo Data model PCollection Schema No abstraction Pipe Programming Lan- Java Pig Latin Configuration files Scala/ Java guages Interaction non-interactive interactive in Shell non-interactive interactive asyn- cronously

As we can see from the table above, most of the pipeline frameworks define their own DSL (Domain Specific Language) to support integration and composition of mul- tiple data processing components (i.e. pipes in a pipeline). However, all the pipeline frameworks that we have investigated do not provide sufficient support for data prove- nance and other facilities (such as version controlling, traceability and reproducibility) for continuous integration of evolving applications .

2.5 Conclusion and Summary

In this chapter, we have systematically reviewed related works for big data processing engines, general optimizations on these processing engines, approaches for highly dis- tributed or heterogenous environments and pipeline frameworks in big data stacks. After the literature review, we discovered that there are still some gaps in the following direc- tions to which we can provide improvements: Chapter 2 54

• The kernel of the majority of current big data processing engines such as Spark

and MapReduce do not provide built-in support for data-flow optimizations for

the pipelined and complicated jobs.

• Existing big data processing frameworks are mostly designed for single-cluster ar-

chitectures with the assumption of a homogeneous underlying infrastructure. The

assumption does not fit into the highly distributed environments and heterogeneous

environments where current data centers are becoming larger and larger.

• Although there are several frameworks which provide some optimizations to big

data applications written in MapReduce-like platforms, these frameworks are ei-

ther working as external components (Tez, FlumeJava) or only provide SQL-based

optimizations (Catalyst and PACT). Therefore, there is still a lack of comprehen-

sive optimizations which cover the kernel primitives for big data processing en-

gines.

• There have been a lot of research works that provide meaningful optimizations

for big data applications. However, the majority of these works apply the opti-

mizations only on parts of the life cycle of big data jobs. In addition, most of the

existing optimizations are based on the MapReduce framework and a very limited

number of research works are found for optimizations in functional data process-

ing frameworks such as Spark and Flink.

• There have been some works that attempted to provide support for highly dis-

tributed infrastructures such as multi-clusters and clouds. However, most of them

merely provide the feasibility for performing data-driven jobs on highly distributed

infrastructures and there is a lack of systematic support and optimizations for

multi-cluster/clouds architectures and solutions. Chapter 2 55

• There is clearly a lack of sufficient support for data provenance related facilities

such as version controlling, traceability and reproducibility. However, those fea-

tures are of great significance for maintaining and supporting continuous integra-

tion of evolving applications.

To fill the gaps of existing big data processing frameworks discussed above, we propose our solution with the following three main targets:

• Core big data processing engines should provide the ability of automatic optimiza-

tions for big data applications/jobs provide overall optimized performance.

• Design/develop an architecture with native support for single and multi-cluster

environments and optimizations which enable the feasibility of applying big data

applications across highly distributed environments such as multi-clusters.

• Provide additional support functions such as composition, version controlling,

traceability and reproducibility which are very significant for continuous integra-

tion and the maintenance of evolving applications in practice.

In the following chapters of the thesis, we are going to present our solution - Hi- erarchically Distributed Data Matrix (HDM) which aims for achieving the three targets above step by step. Chapter 3

A Functional Meta-data Abstraction for Big Data Processing - HDM

Programming abstraction is a core component of big data processing frameworks. In this chapter, we are going to introduce our solution - Hierarchically Distributed Data

Matrix (HDM) which is a functional, strongly-typed meta-data abstraction for writing data-parallel programs.

3.1 Attributes of HDM

Basically, a HDM is represented as HDM[T, R], in which T and R are data types for input and output, respectively. The HDM itself represents the function that transforms data from input to output. Apart from these core attributes, HDM also contains infor- mation like data dependencies, location, distribution to support optimization and execu- tion. The attributes of a HDM are listed in TABLE I. Those attributes are chosen based on the design requirements of HDM. ’inType’ and ’outType’ are used to guarantee the type correctness during optimization and composition. ’category’ is used to differen- tiate Distributed Functional Matrix (DFM) and Distributed Data Matrix (DDM) during

56 Chapter 3 57

Table 3.1: Attributes of HDM Attribute Description ID The identifier of a HDM. It must be unique within each HDM con- text. inType The input data type of computing this HDM. outType The output data type of computing this HDM. category The node type of this HDM. It refers to either DFM or DDM. children The source HDMs of a HDM. It describes from where this HDM can be computed. distribution The distribution relation of children blocks, including horizontal and vertical. dependency The data dependency for computing this HDM from its children. There are four types of data dependencies as 1:1, 1:N, N:1, N:N. function The function applied on input to calculate the output. This function can be a composed one and must have the same input and output type as this HDM. blocks The data blocks of this HDM. For DFM it can be an array of IDs for children DDM; This field is only available after all children of this HDM are computed. location It refers to the URL address of this HDM on local or remote nodes. For DDM, the actual data are loaded according to the protocol in the URL such as hdfs, file, mysql and hdm. state Current state of this HDM. A HDM can exist in different phases such as Declared, Computed and Removed. job planing, optimization and execution. ’children’ and ’dependency’ are used to recon- struct the HDM DAG during job planing and optimization. The attribute ’function’ is the core function of the HDM, it illustrates the how to compute the output of this HDM. the ’blocks’ attribute is used to specify the location of the output for this HDM and it can be used as the input of a subsequent HDM computation. ’location’ represents in which context or where the HDM is declared and managed. ’state’ is used to manage and check the runtime status of HDMs. Based on the attributes above, HDM supports the following basic features:

• Functional: A HDM is essentially a structured representation of a function that

computes the output from some input. The computation of a HDM is focused on

the evaluation of the contained function on the input data set (as children in HDM).

During the computation of a HDM, no side effects are involved.

• Strongly-typed: HDM contains at least two explicit data types, the input type and Chapter 3 58

output type, which are derived from the formats of the input and output based on

the enclosed function. Note that strongly-typed in HDM means that type informa-

tion is explicitly included in HDM job interpretation, optimization and composi-

tion to guarantee the compatibility of data types.

• Portable: A HDM is an independent object that contains complete information for

a computation task. Therefore, a HDM task is portable and can be moved to any

nodes within the HDM context for execution.

• Location-aware: HDMs contains the information of the location (represented as

formatted URL) of the inputs and outputs. Although, some location information is

only available during runtime, it facilitates the ability to apply some optimizations

for data localizations during the planning phases.

3.2 Categorization of HDM

As shown in Fig. 3.1, HDM is a tree-based structure which consists of the following two types of nodes:

• Distributed Data Matrix (DDM): The leaf-nodes in a HDM hierarchy hold the

references of actual data and are responsible for performing atomic operations on

data blocks. A DDM maintains the actual information such as: ID, size, location

and status of one data block.

• Distributed Functional Matrix (DFM): The non-leaf nodes hold both the opera-

tional and distribution of relations for children HDMs; DFMs hold the specific

functions about how to compute its output from the children HDM(can be either

DFM or DDM). During execution, it is also responsible for collecting and aggre-

gating the results from children nodes when necessary. Chapter 3 59 Figure 3.1: Data Model of HDM. Chapter 3 60

From the functional perspective, a DDM can be considered as a function which maps a path to an actual data set. Essentially, a DDM can be represented as HDM [Path, T].

During execution, data parsers are wrapped to load data from the data path according to their protocols and then the input is transformed to the expected outgoing formats of the

DDM. A DFM is considered as a higher-level representation which focuses on the func- tional dependency for HDMs to serve the planning phases. Before execution, DFMs will be further explained as DDM dependencies according to data locations and the expected parallelism.

The separation of DFM and DDM provides different levels of views to support dif- ferent levels of planning and optimization. In addition, the hierarchy of DFM and DDM also ensures that the local computation on data node is not concerned about data move- ment and coordination between siblings, thus leaving the parent nodes free to apply the aggregation steps.

3.3 Data Dependencies of HDM

In principle, data dependencies between HDMs affect when and how to compute HDMs from their children or predecessors. In particular, by performing operations on HDM, data dependencies are implicitly added between pre and post HDM nodes in the data

flow. Basically, there are four types of dependencies in HDM (as shown in Fig. 3.2):

• One-To-One (1:1): One partition of input is only used to compute one partition of

the output; Therefore, different partitions of the HDM can be executed in parallel

without any intercommunication. Operations such as Map, Filter, Find would

introduce a One-To-One dependency in the dataflow.

• One-To-N (1:N): One partition of input is used to compute multiple partitions of

the output while one output partition only requires the input from one partition; Chapter 3 61

Figure 3.2: Data Dependencies of HDM.

Depending on the partition function, a Partition/Repartition operations would in-

troduce a One-To-N dependency in the dataflow.

• N-To-One (N:1): One partition of input is only used to compute one partition of

the output while one output partition requires multiple input partitions; Operations

such as Join, Reduce, ReduceByKey would introduce a N-To-One dependency in

the dataflow.

• N-To-N (N:N): Any other dependencies are considered as a N-To-N dependency

where one partition of input are used to compute multiple output partitions while

one output partition also requires multiple input partitions. GroupBy, CoGroup

and some specific Partition operation introduces N-to-N dependencies to the

dataflow. Chapter 3 62

In practice, data dependency information represent a crucial aspect during both ex- ecution and optimization in order to decide how a HDM is computed, and which opti- mizations can be applied on the data flow, if at all.

3.4 Programming on HDM

One major target of contemporary big data processing frameworks is to ease the com- plexity for developing data-parallel programs and applications. In HDM, functions and operations are defined separately to balance between performance and programming

flexibility.

3.4.1 HDM Functions

In HDM, a function specifies how input data are transformed as the output. Functions in HDM have different semantics targeting different execution context. Basically, one

HDM function may have three possible semantics, indicated as Fp, Fa, Fc:

Fp : List[T ] → List[R] (3.1)

Fa :(List[T ], List[R]) → List[R] (3.2)

Fc :(List[R], List[R]) → List[R] (3.3)

Fp is the basic semantics of a function which specifies how to process one data block. The basic semantics of HDM function assume that the input data is organized as a se- quence of records with type T . Similarly, the output of all the functions are also consid- ered as a sequence of records. Based on the type compatibility, multiple functions can be directly pipelined.

Fa is the aggregation semantics of a function which specifies how to incrementally aggregate a new input partition to the existing results of this function. Functions are Chapter 3 63

Table 3.2: Semantics of basic functions Function Semantics Null Do nothing but return the input. Map Fp: List[T ].map(f) f : T → R Fa: List1[R] += List2[T ].map(f) Fc: List1[R] + List2[R] GroupBy Fp: List[T ].groupBy(f) S f : T → K Fa: List1[K,T ] List2[T ].groupBy(f) S Fc: List1[K,T ] List2[K,T ] Reduce Fp: List[T ].reduce(f) f :(T,T ) → T Fa: List2[T ].foldBy(zero = T1)(f) Fc: f(T1,T2) Filter Fp: List[T ].filterBy(f) f : T → Bool Fa: List1[T ] += List2[T ].filterBy(f) Fc: List1[T ] + List2[T ] required to be performed on multiple data partitions when the input is too large to be fit into one task. The aggregation semantics are very useful under the situations in which accumulative processing could get better performance. Aggregation semantics exist for a function only when it is capable to be represented and calculated in an accumulative manner.

Fc is the combination semantics for merging multiple intermediate results from a se- ries of sub-functions to obtain the final global output. It is also a complement for the aggregation semantics when a function is decomposable using the divide-combine pat- tern.

During the explanation of HDM jobs, different semantics are automatically chosen by planers to hide users from functional level optimizations. An illustration of the se- mantics of some basic HDM functions are listed in TABLE 2. During programming, operations in HDM are exposed as functional interfaces for users to use. Due to the declarative and more powerful abstractions offered by functional interfaces, users are able to write a WordCount program in HDM as shown in Fig. 3.3. Chapter 3 64

wordcount = HDM.string(“path1”,“path2”).map( .split(“,”)) .flatMap( w ⇒ (w, 1) .groupBy(t ⇒ t. 1).reduceByKey( + )

Figure 3.3: Example of writing a word-count program in HDM

3.4.2 HDM Composition

In functional composition, one function f : X → Y can be composed with another function g : Y → Z to produce a high-order function h : X → Z which maps X to g(f(X)) in Z. HDM inherits the idea of functional composition to support two basic types of composition:

HDM[T,R] compose HDM[I,T ] ⇒ HDM[I,R] (3.4)

HDM[T,R] andT hen HDM[R,U] ⇒ HDM[T,U] (3.5)

• compose: A HDM with input type T and output type R can accept a HDM with

input type I and output type T as an input HDM to produce a HDM with input I

and output R.

• andThen: A HDM with input type T and output type R can be followed by a HDM

with input type any R and output type U as the post-operation to produce a new

HDM with input T and output U.

These two patterns are commonly used in functional programming and can be recur- sively used in HDM sequences to achieve complicated composition requirements. In our system, composition operations are implemented as the basic primitives for HDM compositions and data flow optimizations.

3.4.3 Interaction with HDM

HDM applications are designed to be interactive during runtime in an asynchronous manner. In particular, HDM programs can be written and embedded into other programs Chapter 3 65

Table 3.3: Actions and responses for integration Action Response compute References of computed HDMs. sample Iterator of a sampled sub-set for computed records. count Length of computed results. traverse Iterator of all computed records. trace Iterator of task information for last execution. as normal code segments. Then, by triggering the action interfaces (listed in TABLE 3), jobs are dynamically submitted to the related execution context which can be either a multi-core threading pool or a cluster of workers. Fig. 3.4 shows how to interact with the WordCount job and print out the output on the client side.

wordcount.traverse(context = “10.10.0.100:8999”) onComplete { case Success(resp) ⇒ resp.foreach(println) case Failure(exception) ⇒ println(exception)}

Figure 3.4: Print output of HDM on the client side

3.5 System Implementation

The kernel of the HDM runtime system is designed to support the execution, coordina- tion and management of HDM programs. For the current version, only memory-based execution is supported in order to achieve better performance.

3.5.1 Architecture Overview

Fig. 3.5 illustrates the system architecture of HDM runtime environment which is com- posed of three major components:

• Runtime Engine is responsible for the management of HDM jobs such as explain-

ing, optimization, scheduling and execution. Within the runtime engine, the App Chapter 3 66

Figure 3.5: System Architecture of HDM Runtime System.

Manager manages the information of all deployed jobs. It maintains the job de-

scription, logical plans and data types of HDM jobs to support composition and

monitoring of applications; Task Manager maintains the activated tasks for run-

time scheduling in Schedulers; Planers and Optimizers interpret and optimize the

execution plan of HDMs in the explanation phases; The HDM Manager maintains

the HDM information and states in each node of the cluster and they are coordi-

nated together as an in-memory cache of HDM blocks; Executor Context is an

abstraction component to support the execution of scheduled tasks on either local

or remote nodes.

• Coordination Service is composed of three types of coordination: cluster coordi-

nation, HDM block coordination and executor coordination. They are responsible

for coordination and management of node resources, distributed HDM blocks and

distributed executions within the cluster context, respectively.

• IO interface is a wrapped interface layer for data transfer, communication and

persistence. IO interfaces are categorized as transportation interfaces and storage

interfaces in implementation. The former is responsible for communications and

data transportation between distributed nodes while the latter is mainly responsible Chapter 3 67

for reading and writing data on storage systems.

In the following parts of this section, additional details about the major components are presented.

3.5.2 Runtime Engine

Figure 3.6: Process of executing HDM jobs.

The main responsibility of the components in the runtime engine is the coordination and cooperation of tasks so that the jobs specified as HDMs can be completed success- fully. Fig. 3.6 shows the main process of executing HDM jobs in the runtime system.

As shown, the main phases for executing HDM jobs include logical planning, optimiza- tion, physical planning, scheduling and execution. Before execution, HDMs need to be explained as executable tasks for executors. The explanation process is mainly divided into two sub-steps: logical planning and physical planning.

Logical Planning

In the logical planning step, a HDM program will be represented as a data flow in which every node is a HDM object that keeps the information about data dependencies, trans- formation functions and input output formats. The logical planning algorithm is pre- sented in Algorithm 1. Basically, the planner traverses the HDM tree from the root node in a depth-first manner and extracts all the nodes into the resulting HDM list which con- Chapter 3 68

Algorithm 1: LogicalPlan Data: a HDM h for computation

Result: a list of HDM Listh sorted by dependency begin

if children of h is not empty then

for each c in children of h do

Listh+ = LogicalP lan(c); end

Listh+ = h;

else

return h;

end

return Listh;

end tains all the nodes for a logical data flow. After the construction of the data flow, all the necessary HDMs will be declared and registered into the HDM Block Manager. In the next step, optimizations will be performed on the logical data flow based on the rules discussed in Chapter 4. So far, the logical data flow is still an intermediate format for execution. In order to make the job fully understandable and executable for executors, further explanation is required in the physical planning phase.

Physical Planning

In the physical planning phase, given the logical data flow, the planner explains it further as a DAG task graph according to the parallelism on which the job is expected to be executed. The pseudo code of physical planning is presented in Algorithm 2. Essentially, physical planning is a low-level explanation which splits the global HDM operations into Chapter 3 69

Figure 3.7: Physical execution graph of HDM (parallelism = 4). parallel ones according to the parallelism and data locations. As presented in Fig. 3.7, the physical planning phase contains three main steps:

• First, the data source is grouped/clustered based on location distance and data size.

• Second, the following operations are divided as multiple phases based on the

boundary of shuffle dependencies.

• Last, parallel operations in each phase are split into multiple branches according

to the expected execution parallelism (p=4). Then every branch in each phase of

the graph is assigned as a task for follow-up scheduling and execution. Chapter 3 70

Algorithm 2: PhysicalPlan Data: the root HDM h and the parallelism p expected for execution

Result: a list of HDM Listh sorted by dependency begin

if children of h is empty then

ls ← block locations of h;

gs ← clustering ls as p groups according to distance;

for each g in gs do d ← create a new DFM with input g and function of h;

Listh+ = d;

end

else

for each c in children of h do

Listh+ = P hysicalP lan(c); end

ds ← split h as p branches of DFM;

listh+ = ds;

end

return Listh;

end

Job Scheduling

In the scheduling phase, current HDM scheduler provides three basic scheduling poli- cies:

• FIFO: Tasks are scheduled according to the time order of their arrival.

• Min-min: Tasks are scheduled based on the minimum completion time. 1) First,

the set of all minimum expected completion time are found for every task and Chapter 3 71

resource; 2) Second, the task with the minimum value within the above set is

selected for execution.

• Min-max: This has the same first step as the Min-min scheduling; But in the

second, the task with the maximum expected completion time is selected for exe-

cution.

The last two policies are inherited from classic Grid Computing scheduling. In practice,

Min-min scheduling is more suitable for heavier tasks whereas Max-min scheduling is more suitable for lighter tasks. For both these two scheduling policies, an estimation function is required to calculate the expected completion time of each scheduling task for each candidate worker. In current HDM scheduler, the estimation function is defined as equation 5.1, in which the total time is calculated by the sum of CPU time (Tc), disk

IO time (Td) and network IO time (Tn).

T = Tc + Td + Tn = d × fc + d × fd + d × fn (3.6)

Each part of the time above is calculated by the multiplication of input data size d with the related resource factor. By default, we consider the cpu factor fc = 1.0, disk IO factor fd = 5.0 and network IO factor fn = 10.0 during estimation. These factors can be configured to fit into different execution environments in practice.

Job Execution

HDM provides a virtualized abstraction of the execution context which can either be a multi-threading pool or a cluster of distributed nodes. The execution context can be specified when a user tries to invoke the computational interfaces (as listed in TABLE III) of HDM programs. If it is an address of the master of a cluster the job will be submitted and executed in the cluster; if the address is not specified or specified as “Localhost”, a backend threading pool will be started by default to execute the HDM jobs locally. Chapter 3 72

The abstraction of the execution context decouples the job execution plan from the actual execution environment, therefore we can provide different implementations that are specific to different underlying infrastructure. In addition, it also provides a con- venient way for developers to test, execute and migrate their HDM applications among different environments.

3.6 Sample Examples

We have already presented the example of writing a WordCount program (as shown in

Fig 3.3) in HDM in a previous section of this chapter. In this section, we are going to present several additional examples about how other common data analytics algorithms can be written in HDM.

3.6.1 Top K

The code snippet presented in Listing 3.1 shows the code example of computing Top K elements in HDM. The code firstly loads the data and converts them into arrays with the

first elements as the key for each array. Then, the Top operation of HDM automatically applies parallel computation over the distributed cluster to find the top 100 elements for the input data set.

Listing 3.1: Top K example of HDM (K=100) val text = HDM("hdfs://127.0.0.1:9001/user/text") val topk = text.map{ w => w.split(",")}

.map{arr => (arr(0).toFloat, arr)}

.top(100) topk.compute(parallelism = 4) Chapter 3 73

3.6.2 Linear Regression

The code snippet in Listing 3.2 shows an example of writing Linear Regression (Stochas- tic Gradient Decent) in HDM. Basically, the program loads the data and transform it into

DataPoints at first. Then it just applies the classic Linear Regression formulas to the loaded data and updates the learned Weights in every iteration accordingly. The code is quite straightforward and similar to the example code written in Spark as both HDM and

Spark utilize functional and data-oriented programming interfaces.

Listing 3.2: Example Code of Linear Regression Written in HDM val input = HDM("hdfs://127.0.0.1:9001/user/data") val training = input.map(line => line.split("\\s+"))}

.map{ arr =>

val vec = Vector(arr.drop(0))

DataPoint(vec, arr(0))

} val weights = DenseVector.fill(10){0.1 * rand.nextDouble()} for(i <- 1 to iteration) {

val w = weights

val gradient = training.map{ p =>

p.x * (1 / (1 + exp(-p.y * (w.dot(p.x)))) - 1) * p.y }.reduce(_ + _).collect().next()

weights -= gradient

}

3.6.3 KMeans Clustering

The code shown in Listing 3.3 is an example of K-Means Clustering written in HDM.

The code components are similar to the code snippet of Linear Regression. As K-Means is a unsupervised learning algorithm, the code loads the input data into Vectors which do not contain labels. Then it applies the classic K-Means algorithm on the input data and Chapter 3 74 update the learned K points in each iteration.

Listing 3.3: K-Means Clustering Written in HDM val input = HDM("hdfs://127.0.0.1:9001/user/data") val training = input.map(line => line.split("\\s+"))}

.map{arr => Vector(arr)} val kPoints = data.training(K, 5000000).toArray for(i <- 1 to iterations) {

val kp = kPoints

val closest = training.map{p =>

(closestPoint(p, kp), (p, 1))

}

val pointStats = closest.reduceByKey {

case ((x1, y1), (x2, y2)) => (x1 + x2, y1 + y2)

}

val newPoints = pointStats.map { pair =>

(pair._1, pair._2._1 * (1.0 / pair._2._2))}.collect() val oldPoints = kp.clone()

for (newP <- newPoints) {

kPoints(newP._1) = newP._2

}

val allDist = 0.0

for (i <- 0 until K) {

allDist += squareDistance(oldPoints(i), kPoints(i))

}

} Chapter 4

Functional Dataflow Optimization based on HDM

Figure 4.1: Traversing on the Data flow of HDM during Optimization.

During runtime, HDM jobs are represented as functional DAG graphs, on which multiple optimizations could be applied to get better performance. Before execution,

75 Chapter 4 76 the HDM optimizer traverses the DAG of HDM in a deep-first manner as shown in

Fig. 4.1. During traversing, the optimizer checks each node scope (each scope contains the information of: current node, parent and its children and their data dependencies and functions) matches any of the optimization rules then reconstruct the scope based on the matched rule. The algorithm of applying optimization rules on HDM is presented in

Algorithm 3.

Algorithm 3: HDM Optimizer Data: Logical Plan Tree h of a HDM job, A set of optimization rules R, max iterations

nmax for optimizations

Result: Optimized Logical Plan ho begin

if R is not empty AND h is not empty then

ho := null; n := 0;

curNode := h.root;

while ho 6= h AND n < nmax do for each rule r in R do

ho := r.optimize(curNode);

curNode := ho.root;

end

n+ = 1; end

else ho := h; end

return ho;

end Chapter 4 77

In the current implementation of the HDM optimizer, there are four basic optimiza- tion rules: Local Aggregation, Reordering, Function Fusion and HDM Caching. For the following sections, we will take the WordCount program (Fig. 3.3) as an example to explain each optimization rule and how they are applied to a HDM job.

4.1 Local Aggregation

Local aggregation is a very useful approach to reduce the communication cost for shuffle operations with aggregation semantics. For this kind of operations (such as ReduceBy and FoldBy), an aggregation operation can be applied before the shuffle phase, so that the amount of data can be significantly reduced for shuffling. Then, in the following step, a global aggregation is performed to compute the final results. The local aggregation rule in HDM can be specified as:

Local Aggregation Rule: Given a HDM[T,R] with function f : T → R, if HDM has N-to-One or N-to-N dependency and f has the semantics of aggregation then HDM can be split as multiple parallel HDMp[T,R] (with function f and one-to-one depen- dency) that is followed by HDMg[R,R] with the aggregation semantics Fa and original shuffle dependency.

For the example of WordCount program (in Fig. 3.3), the data flow of the job is initially explained as Fig. 4.2. By detecting the aggregation operation ReduceByKey after GroupBy, parallel branches are added to aggregate the data before the shuffling step.

During the shuffle reading phase, the aggregate semantics of ReduceByKey function is applied to get the correct results from two sub-aggregating functions. The optimized data flow is shown in Fig. 4.3. Chapter 4 78

Figure 4.2: Logical flow of word-count.

Figure 4.3: Data flow optimized by local aggregation.

4.2 Function Fusion

In HDM, we define fusible HDMs as a sequence of HDMs that start with One-To-One or N-To-One data dependency and end with One-To-One or One-To-N data dependency.

This sequence of HDMs could be combined into one HDM rather than being computed in separate ones. During the fusion process, functions associated within the fusible HDMs, such as Map, Find, filter, local reduce/group will be directly appended to the parent nodes until it reaches the root or encounter an N-to-N and One-To-N dependency. The rule of function fusion in HDM can be specified as:

Function Fusion Rule: Given two connected HDMs: HDM1[T,R] with function f : T → R followed by HDM2[R,U] with function g : R → U, if the dependency between them are one-to-one then they can be combined as HDMc[T,U] with function Chapter 4 79 f(g): T → U.

This rule can be applied recursively on a sequence of fusible operations to get the

final combined HDM. For example, after performing the function fusion, the data flow of WordCount can be simplified as shown in Fig. 4.4. Fusible operations such as Map and FlatMap are both fused into top DDMs and inherit the data dependency as One-To-

N. During execution, the combined function is directly executed on every input block of data within one task.

4.3 Re-ordering/Re-construction Operations

Table 4.1: Rewriting patterns in hdm Pattern Re-written map(m : T → R) filter(f(m): T → Bool) .filter(f : R → Bool) .map(m) union().filter(f : T → Bool) filter(f : T → Bool).union() intersection().filter(f : T → Bool) filter(f : T → Bool) .intersection() distinct().filter(f : T → Bool) filter(f : T → Bool) .distinct() sort().filter(f : T → Bool) filter(f : T → Bool).sort() groupBy(g : T → K) filter(f(g): T → Bool) .filterByKey(f : K → Bool) .groupBy(g) reduceByKey() filterByKey(f : K → Bool) .filterByKey(f : K → Bool) .reduceByKey()

Apart from aggregation operations, there is another set of operations (like Filter or

FindBy) that can reduce the total communication cost by extracting only a subset of data from previous input. These operations are considered as pruning-operations during exe- cution. The basic principle is that the optimizer attempts to lift these pruning-operations to reduce the data size in advance while sinking the operations which involves global aggregations (such as global Reducer and Grouper) to delay intensive communication.

During the optimizing phase, operation re-ordering and re-construction are achieved Chapter 4 80

Figure 4.4: Data flow of word-count after function fusion.

Figure 4.5: Data flow Reconstruction in HDM through re-writing of operation patterns. The optimizer keeps checking the operation

flow and detect possible patterns and re-writes them in optimized formats. Basic patterns and re-written formats in optimizers are listed in TABLE IV. The operation re-writing process can be recursively performed for multiple iterations to obtain an optimal oper- ation flow. As an example, consider a extended WordCount program, in which the developer wants to find out the number of words that start with the letter ’a’. The ex- tended program has two additional functions: FindByKey and ReduceByValue, as shown in Fig. 4.5. During optimization, as indicated in Fig. 4.5, the FindByKey function is categorized as one of the pruning-functions that can reduce the total data amount. Thus, it is lifted before the aggregation ReduceByKey. When it meets the shuffle operation

GroupBy, applying the same rule, it continues to be lifted until it reaches the parallel operation FlatMap and Map. Chapter 4 81

4.4 HDM Cache

Figure 4.6: Cache Detecting by Reference Counting.

For many complicated applications (such as iterative algorithms), input data might be repetitively used in the iterations. Therefore, it is necessary to cache this part of data to avoid redundant computation. In HDM, data caching can be triggered through two ways:

• Developers can explicitly specify which HDMs need to be cached for repetitive

computation.

• Cache operation can be automatically added by doing reference counting in the

logical planning phase, in which HDMs directly referenced by multiple subse-

quent operations will be labeled. The output of these HDMs will be cached for

subsequent execution;

For example, as shown in Fig. 4.6, the sort operation contains four functions: map, sample, partition and sort. The sample function and partition function share the same input that is calculated from the map’s output. In the logical planning phase, the planner can detect that the output of map has multiple references (with value of ’2’) thus the planner can add a cache tag on it. Then, during execution phase, all the output with this tag will be cached in the memory of workers to avoid redundant computations. Chapter 4 82

Once a HDM is labeled and cached by a previous job, in the next planning phase, the cache optimizer will check and replace all the references of this HDM in the data

flow with the cached DDM addresses. During execution, there are two types of caching policies: eager caching and lazy caching. The former actively computes the caching data before it is required by subsequent operations; the latter only starts computing the caching data until the first reading operations is invoked. By default, lazy caching is used during HDM computation.

4.5 Comparison with Optimizations in Other Frame-

works

The optimization mechanisms introduced in this sections are not newly invented by us but they have been derived from some optimizations in other frameworks in combination with functional wrapping.

• In MapReduce, users can manually define Combinators for each Mappers to apply

map-side merge in a MapReduce job. In Spark, users can use the Aggregator API

rather than the normal groupBy and reduce operations to define the aggregation

operation before shuffling data across the cluster. Those are all derivations of the

local aggregation optimizations.

• Tez combines the chained map-only jobs into one Mapper to reduce the complexity

and consumption for executing multiple ones. Spark’s scheduler schedules a se-

quence of arrow dependency operations into one task to reduce the redundant cost

for scheduling and generation of intermediate data. These optimizations achieve

similar effects as our function fusion optimization yet in a task-oriented perspec-

tive.

• Support for operation re-ordering and re-construction are not natively provided Chapter 4 83

in current frameworks such as MapReduce and Spark. In practice, it is required

for the job developers to have certain expertise for optimizing the sequence of

operations in their programs.

• Both Spark and Flink provide operations for developers to explicitly cache datasets

into memory to improve the performance of iterative jobs. However, for many

complicated programs it is hard to manually decide when and where to cache the

datasets if it is repeatedly used in subsequent operations in an implicit manner.

In contrast to manually applying most of those optimizations in existing frameworks,

HDM utilizes the information and understanding of functional semantics to be able to automatically perform the optimization mechanisms in the data flow of the HDM jobs.

More importantly, the automation of optimizations also make them easier and feasible to be applied to complicated jobs and data pipelines.

4.6 Performance Evaluation

In order to evaluate the performance of HDM’s execution engine and the optimizations we have applied based on HDM, we conduct comprehensive experiments to compare the performance of HDM with the current state-of-the-art big data processing framework -

Apache Spark (version 1.4.1).

4.6.1 Experimental setup

Our experiments have been deployed and executed on Amazon EC21 with one M3.large instance as the master and 20 M3.2xlarge (with 8 vCPUs - Intel Xeon E5-2670 v2 Pro- cessors, 30GB memory, 1Gib network) instances as workers. Every Spark and HDM worker is running in the memory-only model on the JVM with 24 GB memory - 480

1https://aws.amazon.com/ec2/ Chapter 4 84

GB aggregated JVM memory for the entire cluster so that all the data can be fed into the main memory during the execution of our test cases. In addition, data compression options in Spark are all disabled to avoid performance interference.

For testing data sets, the user-visits from AmpLab 2 is used for the primitives and

SQL benchmarks. The original user-visits data set is 119.34 GB and we have replicated it to 238.68GB for larger scale tests. The Daily Global Weather Measurements (DGWM) data 3 from AWS’ public repository is used for the machine learning benchmark. The

DGWM data is also replicated to 61.30GB and 122.60GB to match the larger scale test- ing scenarios in our experiments. Before running the experiments, all tested data sets are downloaded and persisted in HDFS (Hadoop 2.4.0) which is hosted on the same cluster for executing the tested jobs.

4.6.2 Experimental Benchmarks

During experiments, we try to comprehensively compare the performance of HDM and

Spark for different types of applications. The test cases are categorized into four groups: basic primitives, pipelined operations, SQL queries and Machine Learning algorithms as listed below (TC-i denotes the i-th test case).

Basic Primitives: Simple parallel, shuffle and aggregation operations are tested to show the basic performance of the primitives for both Spark and HDM:

• TC-1, Simple parallel operation:A Map operation which transforms the text input

into string tuples of page URL and value of page ranking;

• TC-2, Simple shuffle operation:A GroupBy operation which groups the page

ranking according to the prefix of page URL;

2https://amplab.cs.berkeley.edu/benchmark/ 3http://aws.amazon.com/datasets/ Chapter 4 85

• TC-3, Shuffle with map-side aggregation:A ReduceByKey operation which

groups the page ranking according to the prefix of page and sums the total value

of every group.

• TC-4, Sort: The implementation of Sort operation that contains two phases: in

the first phase input data are sampled to get the distribution boundary for range

partitioning in the next phase; in the second phase, input data are partitioned using

the range partitioner then each partitioned block is sorted in parallel.

Pipelined Operations: The performance of complicated applications can be derived from the basic performance of meta-pipelined ones. In the second part of our experi- ments, the typical cases of meta-pipelined operations are tested, respectively, by each of the following test cases:

• TC-5, Parallel sequence: Five sequentially connected Map operations each of

which transforms the key of the input into a new format.

• TC-6, Shuffle operation followed by aggregation:A groupBy operation which

groups the page ranking according to the prefix of the page; then followed with a

Reduce operation to sum the total values of every group.

• TC-7, Shuffle operation followed by filter: The same groupBy operation as TC-6,

then followed with a filter operation to find out the ranking that starts with a certain

prefix.

• TC-8, Shuffle operation followed by transformation:A groupBy operation which

groups the page ranking according to the prefix, then appends with a map operation

to transform the grouped results into new formats.

SQL queries: SQL-like operations is crucial for data management systems. There- fore, most data processing framework provide SQL interfaces to facilitate data scientists Chapter 4 86 in their task of applying ETL (Extraction, Transformation, Loading) operations. To eval- uate the performance of SQL operations, we test HDM’s POJO based ETL operations against SparkSQL.

• TC-9, Select: Selects a subset of columns for the tabular data, a Select operation

is actually achieved by one parallel map operation;

• TC-10, Where: Scan the data to find out records that meet the condition expression;

• TC-11, OrderBy: Order the input data set by a given column.

• TC-12, Aggregation with groupBy: Group the records and aggregate the value of

a numeric column in each group;

Iterative Algorithms: Many data analytic applications such as machine learning algorithms are iterative jobs. To evaluate the performance of iterative jobs, we test two commonly used machine learning algorithms (Linear Regression and K-Means) for

Spark and HDM:

• TC-13, Linear Regression: Apply linear regression based on stochastic gradient

descent;

• TC-14, K-Means: Apply K-means clustering on the input data, in the experiment,

we choose K=128;

For both test cases of Linear Regression and K-Means, we use the example implemen- tation in the official Spark-example component and provide equivalent implementation using HDM primitives.

4.6.3 Experiment Results

We compared the Job Completion Time (JCT) of the above tested cases for HDM and

Spark. The results of each tested group are discussed as follows. Chapter 4 87

• Comparison of Basic Primitives

Simple parallel operation: As shown in TC-1 in the Fig. 4.7, for the parallel oper- ations like Map, HDM has similar performance as Spark. However, HDM is slightly faster, as it provides a global view in the planning phases so that workers are assigned with parallel branches in the execution graph rather than as a single step. In such a case, less communication is required for completing the same job. Besides, the Spark sched- uler uses delay scheduling to achieve better data localization by default. For HDM, data localization and fairness are pre-considered in the physical planning phase, so no delay is required during scheduling tasks, which also decreases the latency of assigning tasks to some extent.

Simple shuffle operation: For the GroupBy operation in TC-2, HDM shows 15% shorter JCT compared to Spark. Basically, HDM benefits from the parallel aggregation and de-coupled implementation of IO and computation during the shuffle stage.

Shuffle operation with map side aggregation: Spark provides several optimized prim- itives to improve the performance for some shuffle-based aggregations by using map-side aggregators e.g. ReduceByKey. In HDM, this type of optimization is achieved by auto- matically applying local aggregation when necessary. Eventually, for optimized shuffle operations, HDM still achieves slightly better performance (TC-3) than Spark, which is mainly due to the better efficiency in the scheduling phases.

Sorting: For general sort operation, both Spark and HDM contain the same steps

(sample the data, range partitioning then parallel sorting). However, HDM shows more than 30% improvement (in TC-4) by detecting and caching the input data after the sam- ple step. As the whole input data is already loaded into memory during the sample step,

HDM caches the data for the subsequent partitioning step. For Spark, the partitioning step needs to reload the input data from HDFS again. Chapter 4 88

(a) Primitives for user-visits

(b) Pipelines for user-visits

(c) SQL for user-visits

Figure 4.7: Comparison of Job Completion Times for HDM and Spark Note:Spark-1D, HDM-1D and Spark-2D, HDM-2D denote the results for initial dataset and double-sized dataset, respectively.

• Comparison of Pipelined Operations Chapter 4 89

Multiple parallel operations: For multiple parallel operations (TC-5), Spark wraps sequentially connected operations into one task, within which pipelined operations are executed one by one iteratively. For HDM, the optimization is applied by parallel func- tion fusion, in which multiple operations are merged into one high-order function and applied only once for every record of the input. As a result, the JCT of both Spark and

HDM does not increase too much when compared to single parallel primitives (TC-1).

However, HDM shows slightly better performance than Spark due to less scheduling la- tency.

Shuffle operation followed by data transformation: HDM and Spark both show rel- atively longer JCT for shuffle operations followed with general transformations (TC-6).

Without sufficient semantics about the user-defined transformation function, no opti- mization was applied by both HDM and Spark. Thus, the results for this test case is quite similar to the basic shuffle primitives. HDM generates shorter JCT due to more efficient parallel aggregations in shuffling.

Shuffle operation followed by aggregation: For operations composed by shuffle and aggregation (TC-7), HDM shows much better performance (35-40%) than Spark. This is because there is no data flow analysis in the core engine of Spark whereas HDM can recognize aggregations behind shuffle operations and automatically apply re-ordering and local aggregation to reduce the size of data required for subsequent shuffling.

Shuffle operation followed by pruning operation: For operations sequences which contains pruning operations (TC-8), HDM also achieves considerable improvement by being able to perform re-ordering and re-writing to push pruning operations forward as much as possible. Therefore, the data size of following data flow is significantly reduced.

• Comparison of SQL Queries

For basic SQL queries, SparkSQL uses Catalyst to optimize their execution plan while HDM relies on the built-in HDM optimizations. As a result, the performance is Chapter 4 90 quite close for Select (TC-9), Where (TC-10) and Aggregation (TC-12) as shown in Fig. 4.7(c). HDM indicates around 10% shorter JCT for those test cases. However, for the orderBy query, SparkSQL does not apply any optimizations for sorting whereas

HDM adds caching after loading the input data. Therefore, HDM shows around 30% improvement in TC-13.

• Comparison of Iterative Jobs

(a) Linear Regression for Weather Data set

(b) KMeans for Weather Data set

Figure 4.8: Comparison of Job Completion Times of ML algorithms for HDM and Spark Note:Spark-1D, HDM-1D and Spark-2D, HDM-2D denote the results for initial dataset and double-sized dataset, respectively.

Linear Regression: For the performance of linear regression, the example code im- plementation uses SGD (Stochastic Gradient Decent) to train the data. In SGD, every partition of data only needs to compute and send the local co-efficiency vector for the Chapter 4 91

final aggregation step. Apart from caching the input data, there is very little space for data flow optimizations. For this test case, both Spark and HDM caches the input data for the iterative learning process. As a result (in Fig. 4.8(a)), the performance of Spark and HDM are very close - after caching the data in the first iteration, it becomes much faster (20x) in the subsequent iterations.

KMeans: For K-Means clustering, it contains more intensive computation and data transferring steps than linear regression. For each iteration, the job needs to compute to the square distance between each candidate and every point in the candidate cluster is then updated with the new candidates. In the implementation, a map-reduceByKey-map pipeline is involved in each iteration. Similarly, both Spark and HDM cache the input data for iterative learning. The results (in Fig. 4.8(b)) show that subsequent iterations of both Spark and HDM gain about 30% JCT reduction after caching the data in the first iteration. However, HDM shows around 10% shorter JCT for the first iteration and 20% shorter JCT for subsequent iterations compared to Spark. Basically, HDM benefits from better performance in pipelined executions as discussed in previous test cases.

4.6.4 Comparison and Discussion

Table 4.2: Comparison of major big data frameworks MR Spark Flink HDM Data Model key-value RDD flat record HDM Programming Model Map-Reduce functional functional functional Job Planing task-based DAG-based DAG-based DAG-based Data Flow Optimiza- N/A manual auto (inherit from auto (DAG & func- tion DBMS) tional) Composition and high level frameworks high level frameworks high level frameworks native support pipeline

Several frameworks have been developed for providing distributed big data pro- cessing platforms (Tsai et al., 2015). MapReduce (Dean and Ghemawat, 2008) is the commonly used big data processing paradigm which pioneered this domain. It uses key-value pairs as the basic data format during processing. Map and Reduce are two Chapter 4 92 primitives which are inherited from functional programming. In terms of performance,

Hadoop/MapReduce jobs are not guaranteed to be fast. All the intermediate data during execution are written into a distributed storage to enable crash recovery. This is a trade- off which sacrifices the efficiency of using memory and local storage. The MapReduce framework is usually not effective for fast and smaller jobs where the data can fit into memory (Sakr et al., 2013).

Spark (Zaharia, Chowdhury, Franklin, Shenker and Stoica, 2010) utilizes memory as the major data storage during execution. Therefore, it can provide much better perfor- mance when compared with jobs running on MapReduce. The fundamental program- ming abstraction of Spark is called Resilient Distributed Datasets (RDD) (Zaharia et al.,

2012) which represents a logical collection of data partitioned across machines. Be- sides, applications on Spark are decomposed as Stage-based DAGs that are separated by shuffle-dependencies. In job explanation, Spark also combines parallel operations into one task, in which sense, it also achieves the similar optimization of function fu- sion as that in HDM. However, further data sequence optimizations such as operation re-ordering and rewriting are not provided in the Spark processing engine.

Apache Flink4 originated from the Stratosphere project (Alexandrov et al., 2014) which is a software stack for parallel data analysis. Flink shares many similar func- tionalities with Spark and it provides optimizations that have been inspired from re- lational databases but adapted to schemaless user defined functions (UDF). Compared with Spark, Flink has a pipelined execution model which is considered better suited for iterative and incremental processing.

The main difference between HDM and other functional like frameworks (Spark and

Flink) is the programming abstraction. RDD in Spark is a data-oriented abstraction - each RDD represents a distributed data set with a certain type. Flink shares similar

4https://flink.apache.org/ Chapter 4 93 concept as Spark, it uses the DataSet object which is also an abstraction of distributed datasets. In comparison, HDM is a function-oriented abstraction - each HDM is a wrap- per of a function which takes a set of input dataset and compute the output data set.

Both Spark and Flink has function objects during execution but those functions are sup- plementary to their data abstractions. In HDM functions are the top one citizen while data sets are supplementary (can be replaced just as change the input of a function) to functions. Due to its function-oriented nature, HDM is able to natively utilize functional optimization and composition mechanisms to improve the execution performance and provide some extent of reusability. TABLE 4.2 compares the main features between

HDM and the major open-source big data frameworks. Chapter 5

Towards a Multi-Cluster Architecture

Data are increasingly being collected and stored in highly distributed infrastructures (e.g. across data centers, clusters, racks and nodes). The majority of these big-data-processing frameworks such as Hadoop1 and Spark (Zaharia, Chowdhury, Franklin, Shenker and

Stoica, 2010) are designed and implemented based on the single-cluster design which is not a good fit for other scenarios in which data are highly distributed in a heterogeneous environment and there could be either logical or physical boundaries for different groups of computational nodes.

In this chapter, we present, HDM-MC, our multi-cluster solution to natively support the ability to perform computation and data analytics across multiple clusters with both physical (structured regrading to the network topology) and logical (structured regrading to multi-party/organization relation) boundaries. Figure 5.1 illustrates an overview of our framework which is targeting the following three basic requirements:

• Multi-cluster coordination and management: The system is designed to support

the collaborative execution of data analytics jobs across multiple clusters, each of

which manage their own computation and data resources without directly exposing

1http://hadoop.apache.org/

94 Chapter 5 95

Figure 5.1: A Better Solution for a Multi-party Computation Architecture.

them to other clusters.

• Data transparency with localization: Data are automatically routed and located

within the multi-cluster infrastructure. External system and users should not con-

sider and know about the data being exchanged within the cluster. On the other

hand, the multi-cluster servers are able to provide optimized task planning and

scheduling based on the awareness of the underlying network topology.

• Simple and unified programming interface: Users should be able to write one sin-

gle driver program and algorithm which is sufficient to coordinate with multiple

clusters without the need to consider the additional issues of data localization and

exchange.

Bearing the above mentioned targets in mind, we propose our multi-cluster architecture in the rest of this chapter, in which, we extend the kernel of Hierarchically Distributed

Data Matrix (HDM) (Wu et al., 2015) to support cross-cluster computation. Chapter 5 96

5.1 Core Execution Engine - HDM

HDM-MC is designed as an extension of our previous framework, the Hierarchically

Distributed Data Matrix (HDM) (Wu et al., 2015). In principle, HDM is a light-weight, functional and strongly-typed data representation which contains complete information

(such as data format, locations, dependencies and functions between input and output) to support the parallel execution of data-driven applications. In order to enable HDM to support multi-cluster jobs, our extensions mainly include the following three main components:

• Coordination of multi-clusters with two architectures: Hierarchical supervisors

and master chains that provide mechanisms to support dynamic switching between

single and multi-cluster architectures.

• Multi-cluster planning: Stage-based planning by checking the context of the de-

pendent data source and differentiate different jobs: Local Job, Remote Job and

Collaborative Job.

• Optimized task scheduling for multi-cluster: Scheduling strategies with the aware-

ness of the underlying network topology (distance between nodes).

5.2 Coordination of Multi-clusters

The first step for supporting multi-cluster applications is to provide support for the co- ordination between different masters for each cluster. In our solution, we provide two types of coordination architecture considering different collaborating scenarios for ap- plications: A hierarchical architecture and P2P architecture. The former is suitable for applications which allow an external coordinator over the masters of each cluster. The latter is suitable for clusters which only trust peer masters that they have recognized and Chapter 5 97 agreed with. In the remaining sections, we will illustrate the design of each architecture in detail.

5.2.1 Hierarchical Architecture

Figure 5.2: Hierarchical Multi-cluster Architecture.

Figure 5.3: Message Coordination between Hierarchical Clusters.

In the hierarchical architecture, there is one or more super-masters who are able to perform the coordination for the children clusters. Figure 5.2 illustrates an overview of the hierarchical multi-cluster architecture. From the super-master’s point of view, it treats the children masters as normal workers each of which contains the resource by summarizing the total resources (e.g. CPU cores and memory) of the actual workers in Chapter 5 98 each cluster. From the perspective of mediation masters, they are responsible to manage and monitor the resources of the underlying workers and also to keep reporting on the changes of the cluster (including itself) to the direct superior master. In principle, this ar- chitecture could be extended to a multi-layer (with multiple layers of mediation masters) hierarchy as the superior master actually considers the children masters just as normal but more “powerful” workers.

Figure 5.3 shows the states of the life cycle of a master and its message coordination between its superior master. A children master has six basic states:

• Init: When a master process is started, it is in the init state.

• Joining: When a children master is requesting supervision from a superior master,

it sends a JoinMsg (with the information of the resources in the current cluster) to

the superior master and then changes its state to Joining. When a children mas-

ter is in the Joining state, it will postpone the offers which might cause resource

changes to make sure that the resource info is consistent as described in the previ-

ous JoinMsg.

• Active: When a children master is in the Joining state, if it receives a success-

ful JoinResp from the superior master, it will become Active and start to serve

regular requests from both workers and its superior master. In the Active state,

any operation that results in resource changes (including, worker joining/leaving

and resource assigning/recycling ) would trigger the children master to send the

ResSync (Resource Synchronization) message to synchronize the changes to the

superior master.

• Inactive: When a children master is in the Active state, any communication failure

(heartbeat and other messages) between its superior master will cause it to become

Inactive. In the Inactive state, the children master will stop serving regular requests Chapter 5 99

but keeps retrying the failed communication until it succeeds or is time out.

• Leaving: In the Active state, a children master can receive the LeaveMsg from ei-

ther a client or superior master to actively leave the cluster. After successfully re-

ceiving the confirmation (LeaveResp) from the superior master, it will be changed

to a Dead state.

• Dead: If a children master has been Inactive for too long or it has successfully left

the cluster, it will be labeled as Dead. A dead children master will be permanently

removed from the superior master.

5.2.2 Decentralized Architecture

In our framework, we also support a decentralized architecture which has no super mas- ters, but instead, each master can have a few (e.g. two) sibling masters with which they can collaborate and share some computation and data resources. Figure 5.4 shows the overview about the chained multi-cluster architecture.

In this architecture, each master is responsible for managing its own worker children and updating its resource info to its peer masters. During the execution of the applica- tions, if some jobs are identified as remote jobs or require collaborative execution, the master can submit those jobs to the related siblings. To reduce the overhead for synchro- nizing resource information to sibling masters, the number of peer master are limited for each individual master (i.e two or three). Theoretically, this architecture is able to scale to a very large number of nodes with multiple peer-to-peer masters.

Figure 5.5 shows the state diagram and message coordination between a master and its siblings. There are a few different states that are related to the coordination of the hierarchical architecture:

• Collaborating: Only masters which are in Active state can try to connect with Chapter 5 100

Figure 5.4: P2P Master Architecture.

other peer masters to start collaborating. Once a ConnectReq message is sent to a

sibling master, the current master will change to a Collaborating state. Unlike the

Joining state in the hierarchical architecture, masters in the Collaborating state can

still serve regular requests from its local cluster but will postpone all the requests

from other peer masters.

• Collaborative: Once a Collaborating master receives a ConnectResp message

from the required sibling, they will become collaborative to each other. Once they

become collaborative siblings, they start to synchronize the resource information

whenever it changes. At the same time, they are able to send and receive remote

jobs and tasks as well as the completion notification from each other.

• Unreachable: Any communication failure (heartbeat and other messages) between

collaborative sibling masters will lead them to become Unreachable to each other. Chapter 5 101

Figure 5.5: Message Coordination between P2P Clusters.

A Unreachable master would still be able to serve the requests from the local

cluster but will stop taking requests from its siblings unless it is reconnected.

• Left: If a sibling is unreachable over a time limit or it has actively left from the

collaborating masters, it will be labeled as left. Left masters could still work sep-

arately within their own cluster.

5.2.3 Dynamic Architecture Switching

Both single and multi-cluster architectures have there own pros and cons. Single cluster is easier to construct and maintain while multi-cluster architecture is better suited for some more complicated infrastructures. So it is important to provide a dynamic switch- ing approach so that infrastructure managers is able to dynamically change their cluster architecture when the infrastructure is being evolved or for supporting specific scenarios.

In our work, we provide both mechanisms for a HDM cluster to switch its architecture between single and multi-cluster.

• Single-cluster to multi-cluster. To divide a single cluster into multiple clusters,

the system admin can create a new master and migrate the workers from the old

cluster to the new one after which a collaborating message is sent between them.

To migrate from one master to another, a worker just needs a LeaveMsg message Chapter 5 102

to the current master. After the leaving message is confirmed, it sends a JoinMsg

to the new master to complete the “migration” process.

• Multi-cluster to single-cluster. To combine a multi-cluster into a single cluster, the

system admin can migrate all the workers to a selected master after which leaving

messages are sent to other collaborating masters.

5.3 Job Planning on Multi-clusters

With the coordination service described in the last section, we are able to share the real- time resource information (including nodes, cpus, memory) of clusters between multiple masters. However, in order to support data analytics jobs across a multi-cluster archi- tecture, the masters of the clusters need to recognize and explain data flow of the jobs and schedule the computation process on the cross-cluster resources. In this section, we present the job planing and scheduling processes of multi-cluster applications.

5.3.1 Categorizations of Jobs

Firstly, we categorize jobs into three basic categories according to the location of data sets and the point of job planning.

• Local jobs: If the input data sets of a job are all in the context of the current cluster

which performs the job planning, it is considered as a local job. For this type of

job, it is scheduled as a normal local job on the children workers of the current

cluster.

• Remote jobs: If the input data sets of a job are all in the context of another remote

cluster, it is considered as a remote job. For a remote job, it is directly re-submitted

to the related cluster for execution, meanwhile, a promise entry is registered in the Chapter 5 103

current scheduling cluster to wait for the call-back response for the remote job

after it has succeeded or failed.

• Collaborative jobs: If the input data sets of a job are from multiple different clus-

ters (siblings), it is considered as a collaborative job. A collaborative job will

be parallelized and scheduled on the overall resources of both current cluster and

siblings regarding the data location and parallelism of the job.

Job categorization enables the planner and scheduler to differentiate jobs with different data sources from different clusters and locations. In addition, it also provides a clue for dataflow construction in the job explaining phase.

5.3.2 Job Explanation

After an analytics application is submitted to any of the master, it will be firstly explained as an understandable task/job flow for the scheduler to schedule the tasks/jobs. Based on the granularity and abstraction levels of jobs, the job explaining phase contains two steps: Stage Planing, and Task Planning.

• Stage Planing step: In this step, the application will be divided into several jobs

stages, each of which belongs to one of the job categories.

• Task Planning step: For each job identified in the stage planning, it will be sched-

uled in one of the masters for execution. Before the actual execution, each master

that receives the job stage will further explain the job into actual task flows for

scheduling.

Stage Planning Step

Algorithm 4 shows the process of identifying different types of jobs in the stage planning step. Basically, the stage planner recursively checks whether the children of a function Chapter 5 104 belongs to the current cluster (context), if not the planner sets the current job stage as

Collaborative and then creates new stages for each children based on their context.

Figure 5.6: Job Explanation Example.

Figure. 5.6 shows an example of the stage planning results for a cross-cluster pro- gram. In the program, the new stage is discovered at the Join node, for which the chil- dren comes from different cluster contexts. Then the two children of Join is categorized as Remote Job and Local Job while the Join node is categorized as Collaborative Job, respectively.

Task Planning Step

After the job is divided into stages according to the context and data dependency, each divided stage is actually a self-executable job. Then each master just need to use the local job planner in HDM (Wu et al., 2015) to explain and schedule the local jobs. For collaborative jobs, it will wait for all the dependent stages to finish, then it is scheduled on the resources of either the current cluster or current cluster plus its siblings to improve parallelism. Chapter 5 105

Algorithm 4: StagePlaning Data: current ctx of explaining, current node cN, current stage cS, parallelism p

Result: a list of job stages lists identified begin

if children of cN is not empty then

if cN.children are not all in ctx then

cS.setType(”collaborative”); for each c in cN.children do

if c.context == ctx then

nS := newStage(c, ”local”);

lists += nS ; cS.parents += nS;

lists += StageP laning(ctx, c, nS, p) ;

else

nS := newStage(c, ”remote”);

lists += nS; cS.parents += nS ;

end

end

else

for each c in cN.children do

lists += StageP laning(ctx, c, cS, p); end

end

return lists;

else

return lists; end

end Chapter 5 106

5.4 Scheduling on Multi-clusters

5.4.1 Multi-layer Scheduler Design

After the job is explained, it is required to schedule different types of jobs/tasks and co- ordinate with remote clusters to complete the tasks of the overall applications. Similar to the two-layer planning, our scheduling process is also designed as a two-layer scheduler.

• The first layer is responsible for monitoring and scheduling the stages of each

application. Once all the parent jobs of a stage are completed, the stage is notified

as active and will be submitted to a remote or local task scheduler for execution.

• The second layer is responsible for receiving, monitoring and scheduling the tasks

of each active stage. In principle, any classic single cluster scheduler can be re-

used in this layer.

Figure 5.7 illustrates the process of multi-cluster scheduling for a previous example

(Figure 5.6) from the perspective of one master. Basically, after a cross-cluster pro- gram/application is submitted to the server, it will firstly be divided into a stage flow based on Algorithm 4. During scheduling of the stage flow, if a stage does not have parents or all the parental stages have been completed, then it will be triggered and sub- mitted to a local task scheduler or a remote siblings for execution. Based on the job type, a remote job will be submitted to the related sibling master and a local job will be submitted to the local task scheduler. A collaborative job will be explained in the current task scheduler and scheduled across the current cluster and its siblings (some task will be executed in local while some task will be submitted to a sibling master based on the data localization). Once the task scheduler receives a local or remote stage job, the job will be explained as executable task flows and then it is scheduled and executed as a local cluster job. Chapter 5 107

Figure 5.7: Multi-layer Job Scheduling Example.

5.4.2 Scheduling Strategies

In the second layer of scheduler, explained tasks are scheduled in each local cluster.

Currently, in our implementation, there are three types of scheduling strategies that can be selected:

• Delay Scheduling. This is a simple and commonly used scheduling algorithm that

has been used in both MapReduce. The algorithm allows arriving tasks to wait for

a short duration of time in order to achieve better data locality. There are four main

data locality levels in delay scheduling: Process Local, Node Local, Rack Local

and Any.

• Minmin/Maxmin Scheduling. Tasks are scheduled based on the estimated mini-

mum completion time. a) Firstly, the scheduler finds out the estimated minimum

completion time for every task among all the resources; 2) Secondly, the task with

the minimum value within the candidate set is selected for execution. Chapter 5 108

• Hungarian Algorithm. It is originally a graph algorithm which is used to find the

near optimal shortest distances among the nodes of a graph. In the scheduling

scenario, the distance matrix is calculated by the distance between input data and

available candidate workers.

Delay scheduling has the least complexity among the three algorithms and it works well for non-shuffle operations. However, for shuffle operations, delay scheduling does not provide any optimizations and just randomly assign tasks to any available worker. So it does not suit well for the multi-cluster scenario as shuffle operations are very expensive and network connectivity is heterogeneous.

Both Minmin/Maxmin and Hungarian algorithm are able to find more optimal scheduling plans for both shuffle and non-shuffle dependencies. However, they all have higher complexities: O(M × N)(M is the number of workers and N is the number if tasks) and O(N 3) respectively. During our experiments we choose Minmin Schedul- ing as our default scheduling strategy as it has higher scalability. In order to enable the awareness of heterogeneous network in multi-cluster infrastructure, the estimated exe- cution time (T ) and distance between worker and input data is calculated as the sum of

CPU time (Tc), disk IO time (Td) and network IO time (Tn).

T = Tc + Td + Tn = d × fc + d × fd + d × fn (5.1)

Each time component is calculated by the multiplication of input data size d with the related resource factors. Each factor reflects the speed of the resource during cost estimation:fc is the CPU factor, fd is the disk IO factor and fn is the network factor.

During cost estimation, we consider the cpu factor fc = 1.0 which is the unit for calcu- Chapter 5 109

lation. In addition, disk IO factor fd is defined as follows:   Cd Cd > fc, when data are not cached in the memory fn =  0 when data are cached in memory To make the scheduler aware of the heterogeneous connectivity of multi-cluster in- frastructure, the network IO factor fn is defined as follows:   Distance(w, d) when data d is not on node w fn =  0 when data d is node local to w These factors can be configured to fit into different execution environments in practice.

5.5 Experimental Evaluation of Multi-cluster

To evaluate our multi-cluster architecture, we conducted comprehensive experiments to assess the different aspects of our framework using two group of experiments:

• We compare the performance of both multi-cluster architecture with the native

single-cluster architecture on one local cluster infrastructure to find out the over-

head for having a multi-cluster. For this group of experiments, we run our bench-

mark tests for native Spark (single-cluster), single-cluster HDM and multi-cluster

HDM on a EC2 cluster with 20 nodes.

• We compare the performance of multi-cluster architecture with the native single-

cluster architecture on a cluster infrastructure with limited inter-connectivity to

find out the effects for using a multi-cluster in a heterogeneous environment. For

this group of experiments, we run our benchmark tests for native Spark (single-

cluster), single-cluster HDM and multi-cluster HDM on a infrastructure which

contains two VPC clusters EC2 cluster (each with 10 nodes). The physical inter

connectivity boundary between the two VPC is a maximum of 1 Gb. Chapter 5 110

5.5.1 Experimental Setup

Our experiments are built on Amazon EC22 with up to 20 M3.2xlarge (8 vCPUs, 30GB memory, 1Gib network) instances as workers and a variable number of M3.large in- stances for masters. The experimental infrastructure is shown in Fig 5.8. The workers are grouped into two clusters each of which is managed by its own master. The two clus- ters are connected with a virtual router to control the bandwidth between them. In the experiments of this section, we compare HDM-MC with current the current state-of-the- art framework - Apache Spark (v1.6.2) and the single-cluster HDM. Every Spark and

HDM worker is running in the memory-only model on the JVM with 24 GB memory

(480 GB aggregated JVM memory for the entire cluster).

Figure 5.8: Multi-cluster Infrastructure for Experiments.

5.5.2 Benchmark and Test Cases

To better understand the performance of our multi-cluster realization for different types of applications, we run four groups of test cases for each type of the cluster settings listed in section 5.5.1 during our experiments. The four group of test cases include:

2https://aws.amazon.com/ec2/ Chapter 5 111 basic primitives, pipelined operations, SQL queries and Machine Learning algorithms as listed in Section 4.6.2 (TC-i denotes the i-th test case).

For testing data sets, the user-visits from AmpLab 3 is used for the primitives, pipelined operations and SQL benchmarks. The original user-visits data set is 119.34

GB, we replicate it to 238.68 GB for larger scale tests. Before running the experiments, all tested data sets are downloaded and persisted in the HDFS (Hadoop 2.4.0) which is hosted on the same cluster for executing the tested jobs.

5.5.3 Experimental Results

During the experiment, we compared the Job Completion Time (JCT) for every tested case under each cluster setting. The results of each tested group are listed in the following subsections.

Scheduling Overhead of Multi-cluster Architecture

Figure 5.9: Comparison the scheduling cost of single and multi-cluster architecture.

In the first group of experiments, we compare the scheduling cost (time spent in planner and scheduler) of the multi-cluster architecture with the native single-cluster ar- chitecture (Spark and HDM) on a local cluster infrastructure to find out the overhead of introducing our multi-cluster architecture. Figure 5.9 shows the scheduling time (in ms)

3https://amplab.cs.berkeley.edu/benchmark/ Chapter 5 112 for scheduling of different types of test cases on the single-HDM cluster and Multi-HDM cluster. As we can see from the results, in general, the overhead costs of a more compli- cated scheduler in multi-cluster HDM is generally less than 500 ms (avg. 355 ms). The average overhead across all the test cases for the for using multi-cluster HDM is about an extra 11.41% compared with the single-cluster scheduler, which is an acceptable time cost considering that most of the jobs only take a few minutes to complete.

The main cost of the multi-cluster scheduler comes from the two-layered schedul- ing in HDM-MC. As we mentioned in Section 5.3.2 and Section 5.4.1, HDM-MC con- tains the additional stage planning and stage scheduling steps at runtime to be able to explain and coordinate jobs which are executing across multiple domain and clusters.

Sequentially, those extra steps incur additional cost during the job planning and schedul- ing phases compared to the single-cluster scheduler which only needs to schedule local tasks.

Comparison of Performance on Heterogeneous Infrastructure

In the second group of experiments, we compare the job completion time and data transfer of multi-cluster architecture with the single-cluster architecture (both Spark and

HDM) on a two-cluster (each cluster has 10 nodes) infrastructure with limited inter- connectivity to find out the impact of our multi-cluster solution in a heterogeneous envi- ronment.

For parallel operations such as Map (TC-1), MultiMap (TC-5), Select (TC-9), and

Where (TC-10), Spark, HDM-single and HDM-multi all show similarly close JCT. The schedulers in Spark (Delay Scheduling) HDM (Minmin Scheduling for both single and multi-cluster architecture) are able to achieve data locality for the majority of the input for those parallel test cases. Therefore, almost no cross cluster communication is needed during processing as shown in Fig. 5.11. Chapter 5 113

For shuffle-intensive test cases including GroupBy (TC-2), Sort (TC-4), GroupBy- with-Transformation (TC-6) and OrderBy (TC-11), there were no optimizations for

Spark while HDM (both single and multi-cluster) is able to achieve more optimal data locality by estimating the computation cost in the scheduler based on the input data dis- tribution in the shuffle stage. Therefore, HDM shows shorter job completion times due to less data being transferred (as shown in Fig. 5.11) during shuffling.

For aggregation based shuffling, the performance is close for all the tested platforms.

Much less data transfer across cluster is required for this type of jobs as shown in

Fig. 5.11. Spark applies Map-Side merge in primitives such as ReduceByKey (TC-2) and uses Catalyst to optimize Aggregations (TC-12) in SQL. In HDM, the optimization is achieved by Local Aggregation (Wu et al., 2015).

For pipelined operations with aggregation (TC-7) and filter (TC-8), HDM is able to apply operation reconstruction and reordering, which significantly reduces the data transfer during shuffling (as shown in Fig. 5.11). However, Spark does not provide any optimizations for these types of jobs. Consequently, HDM shows much better perfor- mance for this group of test cases, especially under a multi-cluster infrastructure.

In addition, as shown in the results, HDM multi-cluster achieves the multi-cluster feasibility with the trade-off of only slightly longer job completion times compared with the HDM single architecture. The main cost comes from the related longer scheduling time in the two-stage planning phase and scheduling in multi-cluster architecture plus the less-optimal overall execution plan as each master only holds half of the information for the workers compared to the single-cluster solution which holds all the information in a global view. Chapter 5 114

(a) Primitives for user-visits

(b) Pipelines for user-visits

(c) SQL for user-visits

Figure 5.10: Comparison of Job Completion Times on a two-cluster Infrastructure Note: Postfix -1c and -2c denotes the results on a single cluster and two-cluster infrastructures respectively. e.g. Spark-1c and Spark-2c denote the results for running test cases for Spark on the single-cluster and two-cluster, respectively. Chapter 5 115

(a) Primitives for user-visits

(b) Pipelines for user-visits

(c) SQL for user-visits

Figure 5.11: Comparison of Data Transfer on a two-cluster Infrastructure Chapter 5 116

Discussion and Conclusion

In our evaluation and experiments, we show that our solution is able to achieve the fea- sibility of constructing a multi-cluster architecture with minimal scheduling cost and achieves very close performance when compared to the optimized single cluster infras- tructure of HDM. In addition, with our multi-cluster solution, users are able to apply data processing and analytics for both the multi-party scenario and the multi-cluster in- frastructure within a single organization. We also provide the flexibility of switching between single and multiple cluster architecture (as presented in Section. 5.2.3) to avoid unnecessary performance lost when it is not needed. Chapter 6

Dependency and Execution History

Management on HDM

During continuous development and deployment of big data analytic applications, main- taining and managing applications that are constantly evolving is a tedious and compli- cated work. In HDM, by drawing on comprehensive information maintained by HDM models, the runtime engine is able to provide sophisticated dependency and history man- agement for submitted jobs.

6.1 History Traces Management

In HDM, once a application is submitted to the server, it will be assigned with an appli- cation ID (if it is not specified) and version number. Then, any future submissions and updates for the same application (identified the application ID) will obtain a new, unique and auto-increased version number. In addition, execution dependencies (e.g. binary jars of the libs) are decoupled from execution programs for HDM. Before executing a HDM application, all the dependent libraries that are associated with the specific application

ID and version number must be submitted to the server. Subsequently, the application

117 Chapter 6 118 can be executed at any time at any location within the cluster without re-submitting the dependencies. This facilitates the dependency management for applications and also makes the application re-producible and portable across developers and locations. More- over, during the execution of HDM jobs, all the meta data of the applications, such as the logical plan, optimized plan, physical plan and timestamp of tasks will be automatically recorded by the server. This information of jobs are very helpful for users to profile, debug and analyze their deployed applications and jobs in their life cycles. In the HDM server, it maintains two basic types of meta data for each HDM application:Dependency

Trace and Execution Trace.

• Dependency Trace: For each version of the submitted applications, the HDM

server records the dependent libraries that are required for execution. In addition,

the server also maintains the history of updates of dependencies for each applica-

tion in a tree-based structure. Client programs can query over the history tree by

sending query messages. The information maintained in the Dependency Trace

are listed in Table 6.1.

• Execution Trace: During execution, the HDM server also maintains information

that are related to any runtime execution status of each HDM application and task.

For each execution for an application, the server stores the meta data of the ex-

plained HDM job as an ExecutionInsrance as listed in Table 6.2. In addition, for

each task of the execution instance, the server records its execution information as

a TaskTrace which contains information such as HDM ID/TaskID, created time,

scheduled time, completion time, execution status, input/output DDMs and in-

put/output types as listed in Table 6.3. All the execution traces of each execution

instance of an application is maintained as a DAG which represents the dependen-

cies of the overall data flow and tasks. Chapter 6 119

Table 6.1: Information Maintained in Dependency Trace Attribute Description AppName The name of the HDM application. AppName must be unique within each namespace. Namespace The name space of the applications. Version The version for this application. An application can have un- limited versions and the version number is auto generated and maintained at the server side. author The user who created this version of the application. Timestamp The time that this version of application is created. Dependencies A set of URL addresses for the libs that required for execution of this version of the application.

Table 6.2: Attributes maintained for each Execution Instance Attribute Description AppName The related name of the executed HDM application. Namespace The name space of the executed application. Version The version for this executed application.. ExeId The ID for this execution instance. Each version of an appli- cation can be executed for multiple times. ExeId is unique and auto-generated by the HDM execution engine. Timestamp The time that this execution instance is created. LogicalPlan The logical operation graph which is generated by the planner. LogicalPlanOpt The logical operation graph after optimized by the logical op- timizers. PhysicalPlan The task DAG after explained and parallelized for actual scheduling and execution.

All of these meta data that has been collected for HDM jobs are maintained as a tree- based history in the HDM server. Based on the comprehensive traces information stored in the server, users can apply queries to explore and analyze the historical information in which they are interested. Basically, users can query history information by sending two types of messages:

• AppHistoryMsg, which specifies the application Name and a version duration.

The response contains the matched history information from the dependency trace

trees.

• ExecutionHistoryMsg, which specifies the execution instance ID. The response

contains the execution DAG with all the task information that are related to the Chapter 6 120

Table 6.3: Attributes maintained for each Executed Task Attribute Description TaskID The ID for each task within the executed instance. The TaskID is unique within each execution instance. ExeId The ID for the execution instance to which the task belongs. Function The basic function that executed in this task. Input The input paths for execution of this task. Output The output paths that store the results of this task. Location The execution location (executor address) of this task. StartTime The beginning time for execution of this task. EndTime The ending time for the completion or fail of this task. Status The status of this task. Status of a task includes: Created, Scheduled, Running, Completed and Failed.

execution instances.

A concrete example about the historical trees maintained in the HDM manager is shown in Fig. 6.1. Basically, after an application (e.g. WordCount) is submitted to the

HDM server, the dependency information such as (application ID, version, dependent libs and the original logical plan) are recorded in the dependency tree. In addition, every new version of the same application creates a new branch under the application root and is differentiated by the version number.

For each version of the application, the job can be executed for multiple times. Corre- spondingly, each execution of the job will generate a execution instance in the execution trace tree. In each execution trace, the physical execution DAG of the job is recorded.

Furthermore, for each node within the DAG, the runtime information (including task

ID, execution location, input/output paths and execution states) are also automatically maintained. Additionally, the references of execution instances are also maintained in the dependency trees accordingly (under each branch version of the applications) to fa- cilitate future queries. Chapter 6 121 Figure 6.1: Dependency and Execution Traces of HDM. Chapter 6 122

6.2 Dependency Trace Synchronization in HDM Cluster

The dependency information of HDM applications is initially received and maintained in the Master node. During execution, to be able to execute the tasks of the applications, every involved worker also needs to access the dependent libraries to accomplish the tasks and the overall jobs. Therefore, mechanisms are provided in the HDM cluster to persist, coordinate and synchronize the information of dependencies for execution.

After a dependent library is submitted to the master of HDM, the binary files of the library is stored at the master node and the meta data of the library is appended into the server repository logs. When a master is initiated or restarted, it automatically loads the repository log information into the Dependency Manager to be able to coordinate and synchronize with the entire cluster.

In terms of dependency synchronization with workers in the cluster, there are two basic strategies in the current HDM master: Eager Synchronization and Lazy Synchro- nization.

• Eager Synchronization: The master node will actively send all newly received de-

pendency information and files to all the nodes within the cluster. In the process

of the execution, workers do not need to request extra communication for depen-

dencies.

• Lazy Synchronization: The master only stores the dependency information in the

master node when it receives the dependency message. Before execution of HDM

tasks, the involved workers will check whether they have already synchronized

the required dependencies for the related tasks. If not, they will request the related

dependency information and files from the master node or other workers when

available.

Eager Synchronization is better for small clusters when broadcasting to all the nodes Chapter 6 123 is inexpensive, for larger cluster which contains thousands of workers synchronize the execution dependencies at once is very expensive and it might block the master from doing more important tasks. Lazy Synchronization scales well for large clusters but check and synchronize before execution of each job will slightly slow down the overall execution process. In practice, cluster managers can choose the trade-off between these two strategies based on their application scenarios.

6.3 Reproduce of HDM Applications

Drawing on the dependency and execution information maintained in the HDM man- ager, developers or data scientists are able to reproduce the historical HDM jobs and applications when requested (assuming that the input data is still available). This is a very significant feature for learning-based applications and algorithms because there might be multiple back-and-forth steps during exploring a data set and finding the most suitable algorithm and parameters. In order to re-produce a historical HDM applica- tion, users need to specify the application name, version and reproduce level for the re-execution of the application. Based on the comprehensive meta-data maintained for each HDM application, the runtime engine of HDM is able to provide three different levels of re-execution:

• Full: The HDM server will completely re-run an application from explaining the

source program of the application. In this case, the re-execution only has the same

source code but the logical plan and final task DAG might be different with the

original execution instance.

• Logical: The HDM server will re-run an application based on the logical plan (Op-

timized) stored in the execution trace. In this case, the re-execution is guaranteed

to have the same logical plan as the original execution instance. Chapter 6 124

• Physical: The HDM server will re-run an application based on the task DAG (phys-

ical plan) stored in the execution trace. In this case, the re-execution is guaranteed

to have the same data flow, parallelism as the original execution instance.

wordcount = HDM[(String,Int)](“hdm://wordcount”, 0.0.1) wordcount.compute(context = “10.10.0.100:8999”, reExelevel=“Full”) onComplete { case Success(resp) ⇒ resp.foreach(println) case Failure(exception) ⇒ println(exception)}

Figure 6.2: Reproducing an existing word-count application

Fig. 6.2 shows an example code of reproduceing a WordCount application in HDM.

Basically, the user specifies the name and version of the WordCount program and then set the re-produce level as Full to re-execute the application from explaining the source program.

6.4 Composition of HDM applications

In addition to reproducing a historical application, developers can also develop new applications by composing them from existing applications in HDM. Consider the

WordCount program (Fig. 3.3) as an example. Assume the program is already de- ployed on the HDM server, a subsequent developer wants to write an extended program which counts the sum of all words that start with the letter ’a’. In HDM, as we already have a WordCount program deployed, the subsequent developer just needs a few lines code to write a follow-up program to filter the WordCount results which start with the letter ’a’ (as shown in Fig. 6.3). By specifying the data format, URL and expected ver- sion of an existing HDM job/application, developers can directly use it and re-generate extended jobs and applications.

Compared to HDM, the common approach in MapReduce or Spark is to re-write a brand new job and re-deploy it to the cluster. However, this is not the most elegant way Chapter 6 125

wordcount = HDM[(String,Int)](“hdm://wordcount”, 0.0.1) newCount = wordcount.filterByKey(k ⇒ k.startsWith(“a”))

Figure 6.3: Applying a new operation to an existing program to achieve composability. In HDM, as the logical plans of all the deployed applications are automatically maintained in the server, the application manager is able to apply com- position based on the functional DAGs of applications rather than requiring developers to rewrite the entire program manually.

An additional example is that sometimes developers want to reuse the same program to process data sets from different sources with different formats. In HDM, develop- ers can re-use an existing job with a new HDM as input. Consider the WordCount example, we want to process a different data source in which words are separated by semicolons rather than commas. As the previous HDM program (in Fig. 3.3) is a HDM with the input type of String, a substitute input can be constructed from the new data source and passed through to replace the old input HDM. To achieve this, developers only need to write a few lines of code as shown in Fig. 6.4.

newText = HDM.string(“hdfs://10.10.0.1/newPath”)) .flatMap( .split(“;”)) newCount = HDM[(String,Int)](“hdm://wordcount”, 0.0.1) .compose(newText)

Figure 6.4: Replacing the input of an existing program

The two examples in this section also refer to the two basic composition patterns of

HDM, Compose and AndThen respectively (described in section 3.4.2). Composabil- ity of HDMs significantly improves development efficiency and is also very meaningful for integrations and continuous development because one team of developers can easily share their mature and fully tested data-driven applications. During the execution pro- cess, composed HDMs are new job instances and will be executed separately with no interference with the original ones. However, as HDM is strongly-typed, one assump- Chapter 6 126 tion for the composition of HDM applications is that subsequent developers should know about the data types and URL in order to reuse an existing HDM job.

6.5 Fault Tolerance in HDM

By drawing on the dependency graph and the reproducibility of HDM jobs, the current

HDM execution engine provides fault tolerance for data processing by using the pushing lineage technique, in which the lost or failed data partitions would be re-compute from its parents or ancestors in the data dependency graph. Pushing lineage is a well known technique that is also used in modern data-intensive frameworks such as Spark, Tachyon,

Nectar and BAD-FS. In the future, we are planing to add more fault tolerance mecha- nisms such as replication and snapshotting to support the requirements for different types of applications.

6.6 Case Study

To evaluate the work in this chapter, a case study is presented to show how users can develop a pipelined machine learning application by integrating multiple HDM jobs. We use the Image Classification Pipeline as the case study, in which the training pipeline consists of three main parts: the image data parser, feature extractor and classification trainer.

In the image-classification pipeline, components like feature extractor and classifica- tion trainer are commonly-used algorithms for many machine learning applications. This means that they could be developed by other developers and published as shared applica- tions in HDM. Then, subsequent developers can find those exiting HDM descriptions of which they can directly re-use and integrate them to create a simple image classification application by writing just a few lines of code as shown in Fig. 6.5. Chapter 6 127

/* define model */ model = new AtomicObject[Vector] /* specify data */ data = HDM[ , Byte](“hdfs://10.10.0.1:9001/images/*”)) .map(arr ⇒ Vector(arr)) /* specify existing components */ extractor= HDM[Vector,Vector](“hdm://feature-extractor”, 1.1) learner= HDM[Vector,Vector](“hdm://linear-classifier”, 1.0.0) /* connect components as a pipelined job */ imageApp = extractor.compose(data).andThen(learner) /* run job and update the model */ imageApp.traverse(context=“10.10.0.1:8999”) onComplete { case Success(resp) ⇒ model.update(resp.next) case Failure(exception) ⇒ println(e) }

Figure 6.5: Creating an image classification pipeline in HDM

In principle, the kernel engine of Hadoop and Spark does not support direct integra- tion and composition of deployed jobs for subsequent development. Therefore, program- mers may need to manually combine programs written by different developers or re-write every component by themselves. High level Frameworks such as Flume1, PigOlston et al.

(2008) and OozieIslam et al. (2012) support writing data pipelines in a re-defined pro- gramming manner and automatically generates MapReduce jobs. However, it sacrifices some flexibility for integration and interaction with the general programming context of developers. For HDM programs, as they are essentially a programming library of Scala, users can directly embed HDM codes within other Scala or Java programs. Basically, the HDM server is acting as both an execution engine and application repository which enables developers to easily check and integrate with published HDM applications.

1http://incubator.apache.org/flume Chapter 7

Data Pipeline on Multiple Execution

Platforms

Over the past years, big data processing frameworks, such as MapReduce and Spark, have been presented to tackle the ever large data sets distributed on large scale clusters.

These frameworks significantly reduce the complexity for the development of big data programs and applications. In practice, many real-world scenarios require pipelining and integration of multiple data processing and analytics jobs. For example, an image analyzing application requires many pre-processing steps such as image parsing and fea- ture extraction while the core machine learning algorithm is only one component within the whole analytics flow. However, the developed jobs are not easy to be integrated or pipelined to support more complex data analytics scenarios. To integrate data jobs executed in heterogeneous execution environments, a large amount of glue code has to be written to get data into and out from the deployed jobs being integrated. According to a Google report, a mature system in the real world might contain only 5% machine learning code and (at least) 95% glue code (Sculley et al., 2014).

To support the integration and pipelining of big data jobs, many higher-level pipeline

128 Chapter 7 129 frameworks have been proposed, such as Crunch1, Pig (Olston et al., 2008), and Cascad- ing2. Most of these existing data pipeline frameworks are built on top of a single data processing execution environment and require the pipelines to be written in their specif- ically defined interfaces and programming paradigms. In addition, pipeline applications would keep evolving to address the new changes and requirements. These pipeline ap- plications could also contain various legacy components that need to be executed on different execution environments. Therefore, the maintenance and management of such pipelines become very complicated and time-consuming.

In this chapter, we present the framework - Pipeline61 which aims to reduce the effort for maintaining and managing data pipelines across heterogeneous execution con- texts without major rewriting of the original jobs. In particular, Pipeline61: 1) inte- grates data processing components that are executed on various environments, including

MapReduce, Spark and Scripts; 2) re-uses the existing programs of data processing com- ponents as much as possible so that developers do not need to learn a new programming paradigm; 3) provides automated version control and dependency management for both data and components in each pipeline instance during its lifecycle.

7.1 Motivating Scenarios

The work of this chapter is motivated by a realistic suspicion detection system, the data processing pipeline of which is shown in Fig 7.1. In this scenario, multiple input data are collected from different departments and organizations, such as vehicle registration records provided by governments road services, personal income reports provided by governments tax office, or travel histories provided by airline companies. For different data sources, the collected data may have different formats (e.g. CSV, text and JSON)

1https://crunch.apache.org/ 2http://www.cascading.org/ Chapter 7 130 with different schemas.

Data Source Data Analytics Pipeline

Data Clean Data Pre-processing Data Analysis MR_v1MRMR CSV Python Python MR_v4

Text MR Bash MR MR

Json Spark Spark Spark

HDFS

Figure 7.1: The data process pipeline of a real-world suspicion detection system.

On one hand, due to different technical preferences in different stages of the pipeline, data processing components might be developed by different data scientists or engi- neers using different techniques and frameworks, such as IPython, MapReduce, R and

Spark. Some legacy components might also be implemented with Bash scripts or third- party software. As a result, it is complicated and tedious to manage and maintain those pipelines which involve heterogeneous execution environments and keep them all being updated through the lifecycle. The cost of using a new pipeline framework to replace the old one is also expensive and could even be un-affordable. In the worst case scenario, developers might need to re-implement all the data processing components from scratch.

On the other hand, pipeline applications keep evolving and being updated to deal with continuous changes and requirements to the system. For example, new data sources could be added as new inputs; existing data sources might introduce changes to their formats and schemas; the analytic components can also be upgraded to improve the ef-

ficiency and accuracy. All of these can cause continuous changes and updates to the components in the pipeline. One challenge is to provide both traceability and repro- Chapter 7 131 ducibility during the evolving process of the pipelines. Pipeline developers may want to check the history of a pipeline to compare the effects before and after updates. In addi- tion, each of the data processing components should be able to roll back to a previous version when it is required.

7.2 Pipeline on Heterogeneous Execution Contexts

Figure 7.2: Architecture overview of Pipeline61.

To bridge the gap, we present our framework Pipeline61.The architecture of the framework is illustrated in Fig 7.2. There are three main components within the frame- work: Execution Engine, Dependency and Version Manager and Data Service. The cur- rent implementation of Pipeline61 supports the execution of data processing components based on MapReduce, Spark and Bash scripts. The Execution Engine is responsible for triggering, monitoring and managing the execution of pipelines. Data Service provides Chapter 7 132 a uniformly managed data I/O layer that manages the tedious work of data exchange and conversion between various data sources and execution environments. The Depen- dency and Version Manager provides a few mechanisms to automate the version control and dependency management for both data and components within the pipelines. The pipeline framework is interactive through management APIs, which allow developers to test, deploy and monitor the pipelines via sending and receiving messages.

7.2.1 Pipeline Model

In Pipeline61, every component in a pipeline is represented as a Pipe, which has the following attributes:

• Name: The name of the pipe, which must be unique and is associated with all the

management information of the pipe. The pipes name may contain the namespace

information.

• Version: The version number of the pipe, which automatically increases to repre-

sent different versions of the same pipe. Users can also specify the version number

to execute a specific version of the pipe.

• Pipeline Server: The address of the associated pipeline server, which is responsi-

ble for managing and maintaining the pipe. The pipe needs the address information

for sending back the notification message to the pipeline server during execution.

• Input Path / Output Path: The URL of the path contains both the protocol and

address of the input/output data of the pipes. The protocol represents the persistent

system types such as HDFS, JDBC, S3, File and other data storage systems.

• Input Format and Output Format: This specifies the exact reading/writing format

of the input/output data. Chapter 7 133

• Execution Context: The context specifies the execution environment and any other

information required by the underlying execution framework.

The ExecutionContext attribute is associated to different data processing frame- works. There are three major ExecutionContexts in current framework:

• Spark ExecutionContext contains a SparkProc attribute which provides the trans-

formation function from input RDDs to output RDDs or from input DataFrame to

output DataFrame for SparkSQL;

• MapReduce ExecutionContext contains a few structured parameters to specify the

Mapper, Reducer, Combiner and Partitioner for a MapReduce job. Other parame-

ters could be added as key-value parameters.

• Shell ExecutionContext contains a script file or in-line commands for execution.

Python and R scripts are considered as sub-classes of shell pipes with more inputs

and outputs controlled by the data service. One limitation of the shell pipe is that

it relies on the developers to manually deal with the data conversion for the input

and output.

The code snippet below shows how to write a simple SparkPipe. Basically, develop- ers just need to wrap the Spark RDD functions with the SparkProc interface, then use the SparkProc to initiate a SparkPipe object.

Listing 7.1: Example Code of Specifying a SparkPipe class Wordcount extends SparkProc[String,(String, Int))] {

def process(in:RDD[String], sc:SparkContext):RDD[(String, Int)] = {

in.flatmap(_.split( )).map(w=> (w, 1)).reduceBykey(_ + _)

}

} wcPipe = new SparkPipe(name = "wordcount", version = "0.0.1", exec = new Wordcount) Chapter 7 134

In Pipline61, different types of pipes can be seamlessly integrated at the logical level.

Method notations are provided to connect pipes as pipelines. After being connected with others, the output of previous pipes are considered as the input of the following ones. A more concrete example is shown in section IV.

7.2.2 Execution Engine

The execution engine contains three components: Pipeline Server Backend, DAG Sched- uler and a bunch of Task Launchers. Pipeline Server Backend contains a number of mes- sage handlers which receive and process the messages sent by both users and running tasks. Users can send messages to submit, deploy and manage their pipeline jobs and dependencies. Tasks of the running pipelines can also send messages to notify their run- time status. Runtime messages can also trigger the events for scheduling and recovering processes during execution. The DAG Scheduler traverses the task graph of a pipeline in a backward manner and submits the tasks to the corresponding environments for execu- tion. A task is scheduled for execution when all of its parent tasks have been successfully computed. Task Launchers are used to launch execution processes for pipes. Different types of pipes are launched by the corresponding launchers for different execution con- texts:

• Spark Launcher initiates a sub-process as the driver process to execute the Spark

job and captures the notifications of runtime status then sends back to the pipeline

server for monitoring and debugging purposes.

• MR Launcher initiates a sub-process to submit the MapReduce job specified by the

pipe. The sub-process waits until the job has succeeded or failed before sending

the execution states back to the pipeline server.

• Shell Launcher creates a sequence of channeled processes to handle the shell Chapter 7 135

scripts or commands specified by the shell pipes. Once the sequence of processes

succeeds or any of them fails, the related state messages will be sent to the pipeline

server.

New task launchers can be implemented to support new execution contexts in two ways: a) through APIs provided by the execution frameworks (such as Hadoop and Spark); b) by initiating of a sub-process and execute the program logic in the launched process (such as Shell scripts, Python and R). Theoretically, any task that can be started by executing a shell script can be extended through a process launcher.

7.2.3 Data Service

Every pipe is executed independently during runtime. A pipe reads and processes the input data according to the input paths and formats then writes the output into the ex- pected storage system. Managing the data IO for various protocols and formats are often tedious and error-prone. Thus, we extract out most of the data IO work from developers by providing a Data Service in Pipeline61. Data Service provides a collection of data parsers, each of which is responsible for reading and writing data in a specific execution environment according to the given format and protocol. For example, for a Spark pipe, the data service uses the native Spark API to load text files as RDD objects or uses Spark-

SQL API to load data from JDBC or JSON files as a Spark DataFrame. For a Python

Pipe, CSV files in HDFS are loaded through PythonHadoop API and transferred as a

Python DataFrame. Basically, Data Service provides the mapping from data protocols and formats to the concrete data parsers in specific execution context. For the flexibil- ity consideration, data service could be extended by implementing and registering new types of data parsers. Data parsing toolkits such as [12] can be used as a complementary implementation in the Data Service. Chapter 7 136

7.2.4 Dependency and Version Manager

Realistic pipeline applications, they are continuously evolving and being updated to deal with new changes and requirements. It is always a crucial but complicated piece of work for pipeline administrators to manage and maintain the pipelines through their life cycles.

To relieve the pain of managing evolving pipelines, Pipeline61 provides a dependency and version manager to help users maintain, track and analyze the historical information of both data and components involved in the pipelines. Dependency and version manager maintains three major information for every pipeline: Data Snapshot, Pipeline Execution

Trace, Pipe Dependency Trace, as shown in Fig 7.3.

• Data Snapshot contains the input, output locations and sampled data for every

execution instance of each component in a pipeline.

• Pipeline Execution Trace maintains the data flow graph for every execution in-

stances of a pipeline application. Every node in the graph also contains the meta-

data for the component in that instance, such as start, end time and execution

status.

• Pipe Dependency Trace maintains the historical meta-data for different versions

of each component in a pipeline. The dependency information is stored as a tree

structure for each component. Meta-data stored in the tree includes, the name,

version, author, timestamp for the latest update, also the dependent libraries for

execution.

With the above historical information maintained by Dependency and Version Man- ager, users of Pipeline61 are able to apply some analysis techniques on the pipeline his- tory and re-produce the historical results by re-running older versions of the pipelines. Chapter 7 137 Figure 7.3: Historical and dependency information maintained in Pipeline61. Chapter 7 138

7.3 Case Study

In this section, we present a case study that shows the effectiveness and advantages of Pipeline61. In the scenario of the case study, there are three data sources collected from different organizations with three different formats, including CSV, Text and JSON.

There are two groups of data scientists who are doing data analysis on the overall datasets with several manually maintained and connected programs written in MapReduce, Spark and Python scripts. There are two problems in this scenario: First, manually triggering the execution of the entire pipeline is tedious and repetitive. Therefore, there is a require- ment for a framework that supports automated execution of data pipelines on heteroge- neous execution context (including MapReduce, Spark and Python, etc.). Second, while data scientists keep exploring the the analytics pipelines, they may need to keep mod- ifying their algorithms and move back and forth on different versions of the programs to explore the optimal solution. As a result, they need a tool to help them to be able to effectively manage the complexities of version control and dependency management for their programs in the data pipeline. To address the issues in the case study, we introduce our pipeline framework to automate the execution and facilitate the management of the pipeline jobs. The code snippet below shows how the pipeline is specified in Pipeline61.

Listing 7.2: Example Code of an Analytics Pipeline csvMapper = new MRPipe(name = "csvMapper", version = "0.0.1",

mapper = new CSVMapper,

inputPath = "hdfs://127.0.0.1:9001/user/org1/data/csv/") jsonMapper = new MRPipe(name = "jsonMapper", version = "0.0.1",

mapper = new JSONMapper,

inputPath = "hdfs://127.0.0.1:9001/user/org2/data/json/") textMapper = new MRPipe(name = "textMapper", version = "0.0.1",

mapper = new TextMapper,

inputPath = "hdfs://127.0.0.1:9001/user/org3/data/text/") Chapter 7 139

dataJoiner = new SparkPipe(name = "dataJoiner", version = "0.0.1", exec = new DataJoinerFunc) extractorPy = new SparkPipe(name = "extractorPy", version = "0.0.1", exec = new ExtractorPy) extractorSpark = new SparkPipe(name = "extractorSpark", version = "0.0.1", exec = new ExtractorSpark) analysisPy = new ShellPipe(name = "analysisPy", version = "0.0.1", script = "/user/dev/analysis.py") analysisSpark = new SparkPipe(name = "analysisSpark", version = "0.0.1", exec = new SparkAnalysis)

pipeline = ((csvMapper, jsonMapper, textMapper) ->: dataJoiner) :->

(featureExtractorPy :-> analysisPy, featureExtractorSpark :-> analysisSpark)

PipelineContext.exec(pipeline)

Firstly, three data transformers (csvMapper, jsonMapper and textMapper) are defined to process the input data in different formats. These three MapReduce pipes are specified by passing the existing Mapper classes as data parsers. For the second phase of the pipeline, a Spark Pipe named dataJoiner, is specified with the RDD function

DataJoinerFunc to join the three outputs from previous mappers as a whole. In the last step, two branches of analysis pipes are specified to consume the output from the dataJoiner. Because each of the analysis branches is interested in different features of the input, two feature extractors are added before each of the actual analysis com- ponents. Then, the last two analysis components are implemented as Python and Spark

Pipes. Eventually, the overall data flow is defined by connecting all the specified pipes together using the connecting notations.

In this scenario, using existing pipeline frameworks such as Crunch and Cascad- ing requires developers to re-implement everything from scratch following their specific programming paradigms. It not only wastes the re-usage of existing programs written in

MapReduce, Python or Shell Scripts, but also restricts the users to be limited to certain data analysis frameworks like IPython and R.

In contrast, Pipeline61 does not re-invent new programming paradigm. Instead, it focuses on pipelining and management of heterogeneous components in the pipelines.

Thus, it can significantly reduce the effort to integrate with existing data processing Chapter 7 140 components with legacy code, for which the task of re-implementing everything is risky and time-consuming.

Future developing and updating processes of the pipeline also would benefit from the version and dependency management of Pipeline61. For example, if developers want to update one of the components to a new version, they can sample the latest input and out- put of the component from the data snapshots history. Then, developers can implement and test the new program based on the sampled data to make sure that the new version does not break the pipeline. Before submitting the updated component to the produc- tion environment, the developer can specify a new pipeline instance with the updated component and compare its output with the online version to double check the correct- ness. Moreover, if a recently updated component shows any errors after deployment, the pipeline manager can easily roll-back to a previous version as all the historical data and dependencies of every component are automatically maintained by the pipeline server.

These DevOps supports are very meaningful for the actual maintenance and management of pipeline applications but rarely offered by existing pipeline frameworks.

7.4 Comparison and Discussion

Most of the current frameworks for building pipelined big data jobs are built on top of a data processing engine (e.g. Hadoop), and use an external persistent service (e.g. HDFS) for exchanging data. Crunch is a pipeline framework that defines its own data model and programming paradigm to support writing the pipeline and executing pipeline jobs on top of both MapReduce and Spark. Pig uses a data-flow based programming paradigm to write ETL scripts, which is translated into MapReduce jobs during execution. Cas- cading provides an operator-based programming interfaces for pipelines and supports Chapter 7 141 executing Cascading applications on MapReduce. Flume3is originally designed for log- based pipelines. It allows creating a pipeline using configuration files and parameters.

MRQL4is a general system for query and optimization on top of various execution en- vironments such as Hadoop, Spark and Flink. Tez (Saha et al., 2015) is a DAG-based optimization framework, which can optimize MapReduce pipelines written in Pig and

Hive.

Compared to the frameworks above, Pipeline61 provides three main functionality:

• Support pipelining and integration of heterogeneous data processing jobs (MapRe-

duce, Spark and scripts).

• Re-use the existing programming paradigms rather than requiring developers to

learn new ones for writing analytics algorithms.

• Provide automated version control and dependency management to facilitate his-

torical traceability and reproducibility, which are very important for a continuously

developing pipelines.

Apache OODT5 is a data grid framework which provides capturing, locating and ac- cessing data among heterogeneous environments. It shares both similarities and differ- ences with Pipeline61. Firstly, Apache OODT provides more general task-driven work-

flow execution, in which developers need to write their own programs to invoke different execution tasks. However, Pipeline61 focuses on deep integration with contemporary big data processing frameworks including Spark, MapReduce and IPython. Secondly,

OODT uses a XML based specification for their pipelines while Pipeline61 provides programmable interfaces in different programming languages. Lastly, OODT also main- tains general information and meta-data for the shared data sets, but Pipeline61 provides

3https://flume.apache.org/ 4https://mrql.incubator.apache.org/ 5[http://oodt.apache.org/ Chapter 7 142 explicitly defined provenance information for both input/output data and transformations for each task within the pipelines. Subsequently, Pipeline61 natively supports the ability to reproduce and re-execute all historical pipelines or even part of the pipelines. Chapter 8

Conclusion and Future Work

8.1 Conclusion

In this thesis, we have presented HDM as a functional and strongly-typed meta-data abstraction, along with a runtime system implementation to support the execution, opti- mization and management of HDM applications.

A Functional Meta-data Abstraction for Big Data Processing - HDM

In Chapter 3, we presented HDM (Hierarchically Distributed Matrix) which is a func- tional meta-data abstraction for writing and representing data-oriented programs. HDM is essentially a structured representation of functions in a distributed computing context with the awareness of the data types, partition dependencies as well as location infor- mation of inputs. With the abstraction of HDM, users can easily write functional data analytics programs while the programming interfaces explicitly and automatically record and maintain important information as the data dependency graphs for future explana- tion, optimization and execution.

143 Chapter 8 144

Functional Dataflow Optimization based on HDM

Based on the functional dependency graph of HDM, several data flow optimization tech- niques including Function Fusion, Local Aggregation, Operation Reordering and Cache

Detecting are applied to HDM programs in the job planning stage (as presented in Chap- ter 4). In particular, the data flows of HDM jobs are automatically transformed and reconstructed based on the optimization rules before they are executed in the distributed environment. Therefore, programming in HDM releases developers from the tedious task of manual optimization of data-driven programs so that they can focus on the pro- gram logic and data analysis algorithms. We conducted comprehensive experiments to evaluate the performance of HDM jobs. The experimental results shows the competitive performance of HDM in comparison with Spark especially for pipelined operations that contains aggregations and filters.

Towards a Multi-Cluster Architecture

In addition, we extended the kernel of HDM towards a multi-cluster solution — HDM-

MC (in Chapter 5), which fills the gap that when multi-cluster architecture is needed for complicated and highly distributed infrastructure. Our solution enables the multi-cluster feasibility with minimum cost in scheduling and with the optimizations in scheduling, the multi-cluster architecture shows reasonably good performance in our experiments when compared with the current state-of-the-art single cluster platforms. In addition, we also provide mechanisms to support dynamic switching of architecture between single and multi-clusters, which provide both convenience and flexibility for system adminis- trators to dynamically evolve and manage their clusters based on different requirements.

Dependency and Execution History Management on HDM

Furthermore, due to the functional nature of HDM abstraction, applications written in

HDM are natively composable and can be integrated with existing applications. More- Chapter 8 145 over, by drawing on the comprehensive dependency and trace information maintained in HDM manager, the HDM framework is able to provide more sophisticated support for HDM jobs to help users to be able to effectively maintain continuously evolving big data analytics applications. In Chapter 6, we presented several management capability that are naturally supported in the HDM framework, such as the dependency manage- ment, version control, execution history tracing as well as composition and reproduction of HDM jobs. Finally, we also presented a case study to show the composibility and reproducibility of HDM programs.

Data Pipeline on Multiple Execution Platforms

In Chapter7, we propose a pipeline framework — Pipeline61 which supports the ex- ecution of data pipelines on heterogeneous execution contexts and reduces the effort for maintaining and managing data pipelines with multiple versions and legacy compo- nents. A case study has also been conducted to show the effectiveness and advantages of

Pipeline61.

8.2 Future work

We would like to note that the kernel of HDM is still in its early stage of development, of which some limitations are left to be solved in our future work: 1) disk-based processing needs to be supported in case the overall cluster memory is insufficient for very large jobs; 2) fault tolerance needs to be considered as a crucial requirement for practical usage; 3) one long-term challenge we are planning to solve is about the optimizations for processing heterogeneously distributed data sets, which normally results in heavy outliers and would seriously slow down the overall job completion time and degrade the global resource utilization. We are already in the process of addressing all of the above-mentioned limitations. Chapter 8 146

Our work is also an initial but important step towards providing a thorough solution that supports data processing on complicated and heterogeneous infrastructures. There are many further work that can be applied and extended in our future plan. Firstly, it is a challenging and promising work to find out data replication strategies to reduce the data transfers across network boundaries. Second, many data flow optimizations in single cluster infrastructures can be further explored under the multi-cluster scenarios. Last but not the least, security and trust models are crucial and necessary to make the solution applicable to real-world applications and products.

In addition, we also plan to enhance our pipeline framework (Pipeline61) in the fu- ture. Firstly, Pipeline61 does not check the compatibility of data schemas across multiple data processing frameworks. So far, it relies on developers to manually test the input and output of every pipe during the developing phase to make sure that the output of a pipe can be fed into the pipe behind. In the future, we plan to utilize existing schema matching techniques to solve this problem. Secondly, most of the intermediate results produced during execution are require to be written to an underlying physical data storage (such as HDFS) for connecting various pipes with different execution contexts and ensuring the reliability for any components in the pipeline. Thus, the execution of pipelines in

Pipeline61 is generally slower than other frameworks, which are running independently over a single execution environment without integrating with external systems. Bibliography

Ahmad, F., Chakradhar, S. T., Raghunathan, A. and Vijaykumar, T. (2012), Tarazu: op-

timizing mapreduce on heterogeneous clusters, in ACM SIGARCH Computer Archi-

tecture News, Vol. 40, ACM, pp. 61–74.

Ahmad, F., Lee, S., Thottethodi, M. and Vijaykumar, T. (2013), ‘Mapreduce with

communication overlap (marco)’, Journal of Parallel and Distributed Computing

73(5), 608–620.

Alexandrov, A., Bergmann, R., Ewen, S., Freytag, J., Hueske, F., Heise, A., Kao, O.,

Leich, M., Leser, U., Markl, V., Naumann, F., Peters, M., Rheinlander,¨ A., Sax, M. J.,

Schelter, S., Hoger,¨ M., Tzoumas, K. and Warneke, D. (2014), ‘The Stratosphere

platform for big data analytics’, VLDB J. 23(6).

Alexandrov, A., Heimel, M., Markl, V., Battre,´ D., Hueske, F., Nijkamp, E., Ewen,

S., Kao, O. and Warneke, D. (2010), ‘Massively parallel data analysis with pacts on

nephele’, Proceedings of the VLDB Endowment 3(1-2), 1625–1628.

Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan,

T., Franklin, M. J., Ghodsi, A. and Zaharia, M. (2015), Spark SQL: Relational Data

Processing in Spark, in SIGMOD, pp. 1383–1394.

Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan,

T., Franklin, M. J., Ghodsi, A. et al. (2015), Spark sql: Relational data processing in

147 BIBLIOGRAPHY 148

spark, in Proceedings of the 2015 ACM SIGMOD International Conference on Man-

agement of Data, ACM, pp. 1383–1394.

Babu, S. (2010), Towards automatic optimization of mapreduce programs, in Proceed-

ings of the 1st ACM symposium on Cloud computing, ACM, pp. 137–142.

Battre,´ D., Ewen, S., Hueske, F., Kao, O., Markl, V. and Warneke, D. (2010), Nephele/-

pacts: a programming model and execution framework for web-scale analytical pro-

cessing, in Proceedings of the 1st ACM symposium on Cloud computing, ACM,

pp. 119–130.

Behm, A., Borkar, V. R., Carey, M. J., Grover, R., Li, C., Onose, N., Vernica, R.,

Deutsch, A., Papakonstantinou, Y. and Tsotras, V. J. (2011), ‘Asterix: towards a scal-

able, semistructured data platform for evolving-world models’, Distributed and Par-

allel Databases 29(3), 185–216.

Borkar, V., Carey, M., Grover, R., Onose, N. and Vernica, R. (2011), Hyracks: A flexible

and extensible foundation for data-intensive computing, in 2011 IEEE 27th Interna-

tional Conference on Data Engineering, IEEE, pp. 1151–1162.

Bradley, J., Barbier, J. and Handler, D. (2013), ‘Embracing the internet of everything to

capture your share of $14.4 trillion’, White Paper, Cisco .

Cardosa, M., Wang, C., Nangia, A., Chandra, A. and Weissman, J. (2011), Exploring

mapreduce efficiency with highly-distributed data, in Proceedings of the second inter-

national workshop on MapReduce and its applications, ACM, pp. 27–34.

Chaiken, R., Jenkins, B., Larson, P.-A.,˚ Ramsey, B., Shakib, D., Weaver, S. and Zhou, J.

(2008), ‘Scope: easy and efficient parallel processing of massive data sets’, Proceed-

ings of the VLDB Endowment 1(2), 1265–1276. BIBLIOGRAPHY 149

Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R. R., Bradshaw, R. and

Weizenbaum, N. (2010), FlumeJava: easy, efficient data-parallel pipelines, in PLDI.

Chen, F., Kodialam, M. and Lakshman, T. (2012), Joint scheduling of processing and

shuffle phases in mapreduce systems, in INFOCOM, 2012 Proceedings IEEE, IEEE,

pp. 1143–1151.

Chen, Q., Zhang, D., Guo, M., Deng, Q. and Guo, S. (2010), Samr: A self-adaptive

mapreduce scheduling algorithm in heterogeneous environment, in Computer and

Information Technology (CIT), 2010 IEEE 10th International Conference on, IEEE,

pp. 2736–2743.

Chowdhury, M., Zaharia, M., Ma, J., Jordan, M. I. and Stoica, I. (2011), Managing data

transfers in computer clusters with orchestra, in ACM SIGCOMM Computer Commu-

nication Review, Vol. 41, ACM, pp. 98–109.

Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K. and Sears, R.

(2010), Mapreduce online., in Nsdi, Vol. 10, p. 20.

Dean, J. and Ghemawat, S. (2008), ‘MapReduce: simplified data processing on large

clusters’, Commun. ACM 51(1).

Fadika, Z., Dede, E., Hartog, J. and Govindaraju, M. (2012), Marla: Mapreduce for het-

erogeneous clusters, in Proceedings of the 2012 12th IEEE/ACM International Sympo-

sium on Cluster, Cloud and Grid Computing (ccgrid 2012), IEEE Computer Society,

pp. 49–56.

Gantz, J. and Reinsel, D. (2012), ‘The digital universe in 2020: Big data, bigger digital

shadows, and biggest growth in the far east’, IDC iView: IDC Analyze the future

2007, 1–16. BIBLIOGRAPHY 150

Gu, L., Zeng, D., Li, P. and Guo, S. (2014), ‘Cost minimization for big data processing

in geo-distributed data centers’, IEEE Transactions on Emerging Topics in Computing

2(3), 314–323.

Guha, S., Hafen, R., Rounds, J., Xia, J., Li, J., Xi, B. and Cleveland, W. S. (2012), ‘Large

complex data: divide and recombine (d&r) with rhipe’, Stat 1(1), 53–67.

Hammoud, M., Rehman, M. S. and Sakr, M. F. (2012), Center-of-gravity reduce task

scheduling to lower mapreduce network traffic, in Cloud Computing (CLOUD), 2012

IEEE 5th International Conference on, IEEE, pp. 49–58.

Hammoud, M. and Sakr, M. F. (2011), Locality-aware reduce task scheduling for mapre-

duce, in Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third

International Conference on, IEEE, pp. 570–576.

Han, J., Ishii, M. and Makino, H. (2013), A hadoop performance model for multi-rack

clusters, in Computer Science and Information Technology (CSIT), 2013 5th Interna-

tional Conference on, IEEE, pp. 265–274.

He, C., Weitzel, D., Swanson, D. and Lu, Y. (2012), Hog: Distributed hadoop mapre-

duce on the grid, in High Performance Computing, Networking, Storage and Analysis

(SCC), 2012 SC Companion:, IEEE, pp. 1276–1283.

Heintz, B., Chandra, A. and Sitaraman, R. K. (2012), ‘Optimizing mapreduce for highly

distributed environments’, arXiv preprint arXiv:1207.7055 .

Heise, A., Rheinlander,¨ A., Leich, M., Leser, U. and Naumann, F. (2012), Meteor/so-

premo: an extensible query language and operator model, in Workshop on End-to-end

Management of Big Data, Istanbul, Turkey.

Herodotou, H. (2011), ‘Hadoop performance models’, arXiv preprint arXiv:1106.0940 . BIBLIOGRAPHY 151

Hoefler, T., Lumsdaine, A. and Dongarra, J. (2009), Towards efficient mapreduce using

mpi, in European Parallel Virtual Machine/Message Passing Interface Users Group

Meeting, Springer, pp. 240–249.

Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E. N., O’Malley, O., Pandey,

J., Yuan, Y., Lee, R. and Zhang, X. (2014), Major technical advancements in Apache

Hive, in SIGMOD, pp. 1235–1246.

Ibrahim, S., Jin, H., Lu, L., Wu, S., He, B. and Qi, L. (2010), Leen: Locality/fairness-

aware key partitioning for mapreduce in the cloud, in Cloud Computing Technology

and Science (CloudCom), 2010 IEEE Second International Conference on, IEEE,

pp. 17–24.

Isard, M., Budiu, M., Yu, Y., Birrell, A. and Fetterly, D. (2007), Dryad: distributed

data-parallel programs from sequential building blocks, in ACM SIGOPS Operating

Systems Review, Vol. 41, ACM, pp. 59–72.

Islam, M., Huang, A. K., Battisha, M., Chiang, M., Srinivasan, S., Peters, C., Neumann,

A. and Abdelnur, A. (2012), Oozie: towards a scalable workflow management system

for hadoop, in SIGMOD Workshops.

Jayalath, C., Stephen, J. and Eugster, P. (2014a), ‘From the cloud to the atmosphere:

running mapreduce across data centers’, IEEE Transactions on Computers 63(1), 74–

87.

Jayalath, C., Stephen, J. and Eugster, P. (2014b), ‘Universal cross-cloud communica-

tion’, IEEE Transactions on Cloud Computing 2(2), 103–116.

Jiang, D., Ooi, B. C., Shi, L. and Wu, S. (2010), ‘The performance of mapreduce: an

in-depth study’, Proceedings of the VLDB Endowment 3(1-2), 472–483. BIBLIOGRAPHY 152

Kailasam, S., Dhawalia, P., Balaji, S., Iyer, G. and Dharanipragada, J. (2014), ‘Extend-

ing mapreduce across clouds with bstream’, IEEE Transactions on Cloud Computing

2(3), 362–376.

Kondikoppa, P., Chiu, C.-H., Cui, C., Xue, L. and Park, S.-J. (2012), Network-aware

scheduling of mapreduce framework ondistributed clusters over high speed networks,

in Proceedings of the 2012 workshop on Cloud services, federation, and the 8th open

cirrus summit, ACM, pp. 39–44.

Li, H., Ghodsi, A., Zaharia, M., Shenker, S. and Stoica, I. (2014), Tachyon: Reliable,

memory speed storage for cluster computing frameworks, in Proceedings of the ACM

Symposium on Cloud Computing, ACM, pp. 1–15.

Lim, H., Herodotou, H. and Babu, S. (2012), ‘Stubby: A transformation-based optimizer

for mapreduce workflows’, Proceedings of the VLDB Endowment 5(11), 1196–1207.

Lu, X., Liang, F., Wang, B., Zha, L. and Xu, Z. (2014), Datampi: extending mpi to

hadoop-like big data computing, in Parallel and Distributed Processing Symposium,

2014 IEEE 28th International, IEEE, pp. 829–838.

Luo, Y., Guo, Z., Sun, Y., Plale, B., Qiu, J. and Li, W. W. (2011), A hierarchical frame-

work for cross-domain mapreduce execution, in Proceedings of the second interna-

tional workshop on Emerging computational methods for the life sciences, ACM,

pp. 15–22.

Luo, Y. and Plale, B. (2012), Hierarchical mapreduce programming model and schedul-

ing algorithms, in Proceedings of the 2012 12th IEEE/ACM International Sympo-

sium on Cluster, Cloud and Grid Computing (Ccgrid 2012), IEEE Computer Society,

pp. 769–774. BIBLIOGRAPHY 153

Maheshwari, N., Nanduri, R. and Varma, V. (2012), ‘Dynamic energy efficient data

placement and cluster reconfiguration algorithm for mapreduce framework’, Future

Generation Computer Systems 28(1), 119–127.

Malewicz, G., Austern, M. H., Bik, A. J. C., Dehnert, J. C., Horn, I., Leiser, N. and

Czajkowski, G. (2010), Pregel: a system for large-scale graph processing, in SIGMOD

Conference.

Mandal, A., Xin, Y., Baldine, I., Ruth, P., Heerman, C., Chase, J., Orlikowski, V. and

Yumerefendi, A. (2011), Provisioning and evaluating multi-domain networked clouds

for hadoop-based applications, in Cloud Computing Technology and Science (Cloud-

Com), 2011 IEEE Third International Conference on, IEEE, pp. 690–697.

Mattess, M., Calheiros, R. N. and Buyya, R. (2013), Scaling mapreduce applications

across hybrid clouds to meet soft deadlines, in Advanced Information Networking and

Applications (AINA), 2013 IEEE 27th International Conference on, IEEE, pp. 629–

636.

Mell, P., Grance, T. et al. (2011), ‘The nist definition of cloud computing’.

Mitchell, T. M. (1997), ‘Machine learning. 1997’, Burr Ridge, IL: McGraw Hill

45(37), 870–877.

Murphy, M. and Meeker, M. (2011), ‘Top mobile internet trends’, KPCB Relationship

Capital .

Narayan, S., Bailey, S., Daga, A., Greenway, M., Grossman, R., Heath, A. and Powell,

R. (2012), Openflow enabled hadoop over local and wide area clusters, in High Perfor-

mance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:,

IEEE, pp. 1625–1628. BIBLIOGRAPHY 154

Obar, J. A. and Wildman, S. S. (2015), ‘Social media definition and the governance

challenge: An introduction to the special issue’.

Olston, C., Reed, B., Srivastava, U., Kumar, R. and Tomkins, A. (2008), Pig latin: a

not-so-foreign language for data processing, in SIGMOD.

Palanisamy, B., Singh, A., Liu, L. and Jain, B. (2011), Purlieus: locality-aware resource

allocation for mapreduce in a cloud, in Proceedings of 2011 International Conference

for High Performance Computing, Networking, Storage and Analysis, ACM, p. 58.

Plimpton, S. J. and Devine, K. D. (2011), ‘Mapreduce in mpi for large-scale graph algo-

rithms’, Parallel Computing 37(9), 610–632.

Polo, J., Carrera, D., Becerra, Y., Beltran, V., Torres, J. and Ayguade,´ E. (2010), Perfor-

mance management of accelerated mapreduce workloads in heterogeneous clusters,

in 2010 39th International Conference on Parallel Processing, IEEE, pp. 653–662.

Pu, Q. et al. (2015), ‘Low latency geo-distributed data analytics’, ACM SIGCOMM Com-

puter Communication Review 45(4).

Riteau, P., Iordache, A. and Morin, C. (2011), Resilin: Elastic MapReduce for private

and community Clouds, PhD thesis, INRIA.

Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A. C. and Curino, C. (2015),

Apache Tez: A Unifying Framework for Modeling and Building Data Processing Ap-

plications, in SIGMOD.

Sakr, S. and Gaber, M. M., eds (2014), Large Scale and Big Data - Processing and

Management, Auerbach Publications.

Sakr, S., Liu, A. and Fayoumi, A. G. (2013), ‘The family of mapreduce and large-scale

data processing systems’, ACM CSUR 46(1), 11. BIBLIOGRAPHY 155

Samuel, A. L. (1959), ‘Some studies in machine learning using the game of checkers’,

IBM Journal of research and development 3(3), 210–229.

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V.

and Young, M. (2014), Machine learning: The high interest credit card of technical

debt, in SE4ML: Software Engineering for Machine Learning.

Selinger, P. G., Astrahan, M. M., Chamberlin, D. D., Lorie, R. A. and Price, T. G. (1979),

Access path selection in a relational database management system, in Proceedings of

the 1979 ACM SIGMOD international conference on Management of data, ACM,

pp. 23–34.

Sharma, B., Wood, T. and Das, C. R. (2013), Hybridmr: A hierarchical mapreduce

scheduler for hybrid data centers, in Distributed Computing Systems (ICDCS), 2013

IEEE 33rd International Conference on, IEEE, pp. 102–111.

Shiraz, M., Gani, A., Khokhar, R. H. and Buyya, R. (2013), ‘A review on distributed

application processing frameworks in smart mobile devices for mobile cloud comput-

ing’, IEEE Communications Surveys & Tutorials 15(3), 1294–1313.

Shvachko, K., Kuang, H., Radia, S. and Chansler, R. (2010), The Hadoop Distributed

File System, in IEEE MSST.

Su, Y.-L., Chen, P.-C., Chang, J.-B. and Shieh, C.-K. (2011), ‘Variable-sized map and

locality-aware reduce on public-resource grids’, Future Generation Computer Systems

27(6), 843–849.

Tian, C., Zhou, H., He, Y. and Zha, L. (2009), A dynamic mapreduce scheduler for

heterogeneous workloads, in 2009 Eighth International Conference on Grid and Co-

operative Computing, IEEE, pp. 218–224. BIBLIOGRAPHY 156

Tomasiˇ c,´ I., Rashkovska, A. and Depolli, M. (2013), Using hadoop mapreduce in a

multicluster environment, in Information & Communication Technology Electronics

& Microelectronics (MIPRO), 2013 36th International Convention on, IEEE, pp. 345–

350.

Tsai, C. W., Lai, C. F., Chao, H. C. and Vasilakos, A. V. (2015), ‘Big data analytics: a

survey’, Journal of Big Data 2(21).

Vermesan, O. and Friess, P. (2014), Internet of things-from research and innovation to

market deployment, River Publishers Aalborg.

Viswanathan, R. et al. (2016), Clarinet: Wan-aware optimization for analytics queries,

in OSDI.

Vulimiri, A. et al. (2015), Global analytics in the face of bandwidth and regulatory con-

straints, in NSDI.

Wang, G., Butt, A. R., Pandey, P. and Gupta, K. (2009), A simulation approach to eval-

uating design decisions in mapreduce setups., in MASCOTS, Vol. 9, pp. 1–11.

Wang, G., Ng, T. and Shaikh, A. (2012), Programming your network at run-time for

big data applications, in Proceedings of the first workshop on Hot topics in software

defined networks, ACM, pp. 103–108.

Wang, L., Tao, J., Marten, H., Streit, A., Khan, S. U., Kolodziej, J. and Chen, D. (2012),

Mapreduce across distributed clusters for data-intensive applications, in Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE

26th International, IEEE, pp. 2004–2011.

Wang, L., Tao, J., Ranjan, R., Marten, H., Streit, A., Chen, J. and Chen, D. (2013),

‘G-hadoop: Mapreduce across distributed data centers for data-intensive computing’,

Future Generation Computer Systems 29(3), 739–750. BIBLIOGRAPHY 157

Wu, D., Sakr, S., Zhu, L. and Lu, Q. (2015), Composable and Efficient Functional Big

Data Processing Framework, in IEEE Big Data.

Xie, J., Yin, S., Ruan, X., Ding, Z., Tian, Y., Majors, J., Manzanares, A. and Qin, X.

(2010), Improving mapreduce performance through data placement in heterogeneous

hadoop clusters, in Parallel & Distributed Processing, Workshops and Phd Forum

(IPDPSW), 2010 IEEE International Symposium on, IEEE, pp. 1–9.

Yang, H.-c., Dasdan, A., Hsiao, R.-L. and Parker, D. S. (2007), Map-reduce-merge:

simplified relational data processing on large clusters, in Proceedings of the 2007 ACM

SIGMOD international conference on Management of data, ACM, pp. 1029–1040.

Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, U.,´ Gunda, P. K. and Currey, J.

(2008), Dryadlinq: A system for general-purpose distributed data-parallel computing

using a high-level language., in OSDI, Vol. 8, pp. 1–14.

Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S. and Stoica, I.

(2010), Delay scheduling: a simple technique for achieving locality and fairness in

cluster scheduling, in Proceedings of the 5th European conference on Computer sys-

tems, ACM, pp. 265–278.

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M. J.,

Shenker, S. and Stoica, I. (2012), Resilient Distributed Datasets: A Fault-Tolerant

Abstraction for In-Memory Cluster Computing, in NSDI.

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. and Stoica, I. (2010), Spark:

Cluster Computing with Working Sets, in HotCloud.

Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R. H. and Stoica, I. (2008), Improving

mapreduce performance in heterogeneous environments., in OSDI, Vol. 8, p. 7. BIBLIOGRAPHY 158

Zhang, X., Feng, Y., Feng, S., Fan, J. and Ming, Z. (2011), An effective data locality

aware task scheduling method for mapreduce framework in heterogeneous environ-

ments, in Cloud and Service Computing (CSC), 2011 International Conference on,

IEEE, pp. 235–242.