Big Data Processing on Arbitrarily Distributed Dataset
Total Page:16
File Type:pdf, Size:1020Kb
Big Data Processing on Arbitrarily Distributed Dataset by Dongyao Wu A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY SCHOOL OF COMPUTER SCIENCE AND ENGINEERING FACULTY OF ENGINEERING Thursday, 1st June, 2017 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without the permission of the author. c 2017 by Dongyao Wu PLEASE TYPE THE UNIVERSITY OF NEW SOUTH WALES Thesis/Dissertation Sheet Surname or Family name: Wu First name: Dongyao Other name/s: Abbreviation for degree as given in the University calendar: PhD School: School of Computer Science and Engineering Faculty: Faculty of Engineering Title: Big Data Processing on Arbitrarily Distributed Dataset Abstract 350 words maximum: (PLEASE TYPE) Over the past years, frameworks such as MapReduce and Spark have been Introduced to ease the task of developing big data programs and applications. These frameworks significantly reduce the complexity of developing big data programs and applications. However, in reality, many real-world scenarios require pipelining and Integration of multiple big data jobs. As the big data pipelines and applications become more and more complicated, It Is almost Impossible to manually optimize the performance for each component not to mention the whole pipeline/application. At the same time, there are also increasing requirements to facilitate interaction, composition and Integration for big data analytics applications In continuously evolving, Integrating and delivering scenarios. In addition, with the emergence and development of cloud computing, mobile computing and the Internet of Things, data are Increasingly collected and stored In highly distributed infrastructures (e.g. across data centres, clusters, racks and nodes). To deal with the challenges above and fill the gap In existing big data processing frameworks, we present the Hierarchically Distributed Data Matrix (HOM) along with the system Implementation to support the writing and execution of composable and integrable big data applications. HOM Is a light-weight, functional and strongly-typed meta-data abstraction which contains complete Information (such as data format, locations, dependencies and functions between input and output) to support parallel execution of data-driven applications. Exploiting the functional nature of HOM enables deployed applications of HOM to be natively integrable and reusable by other programs and applications. In addition, by analysing the execution graph and functional semantics of HDMs, multiple automated optimizations are provided to Improve the execution performance of HOM data flows. Moreover, by extending the kernel of HOM, we propose a multi cluster solution which enables HOM to supportlarge scale data analytics among multi-cluster scenarios. Drawing on the comprehensive information maintained by HOM graphs, the runtime execution engine of HOM Is also able to provide provenance and history management for submitted applications. We conduct comprehensive experiments to evaluate our solution compared with the current state-of-the-art big data processing framework ••• Apache Spark. Declaration relating to disposition of project thesis/dissertation I hereby grant to the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or in partin the University libraries in all formsof media, now or here afterknown, subject to the provisions of the Copyright Act 1968. I retain all property rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or partof this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstracts International (this is applicable to doctoral theses only). Date The University recognises that there may be exceptional circumstances requiring restrictions on copying or conditions on use. Requests for restriction for a period of up to 2 years must be made in writing. Requests for a longer period of restriction may be considered in exceptional circumstances and reauire the aooroval of the Dean of Graduate Research. FOR OFFICE USE ONLY Date of completion of requirements for Award: COPYRIGHT STATEMENT ‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.' Signed ……………………………………………........................... Date ……………………………………………........................... AUTHENTICITY STATEMENT ‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’ Signed ……………………………………………........................... Date ……………………………………………........................... ORIGINALITY STATEMENT 'I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.' Signed ......... .... .................... Date .................... To my family, friends and supervisors. Abstract Over the past years, frameworks such as MapReduce and Spark have been introduced to ease the task of developing big data programs and applications. These frameworks significantly reduce the complexity of developing big data programs and applications. However, in reality, many real-world scenarios require pipelining and integration of mul- tiple big data jobs. As the big data pipelines and applications become more and more complicated, it is almost impossible to manually optimize the performance for each com- ponent not to mention the whole pipeline/application. At the same time, there are also increasing requirements to facilitate interaction, composition and integration for big data analytics applications in continuously evolving, integrating and delivering scenarios. In addition, with the emergence and development of cloud computing, mobile computing and the Internet of Things, data are increasingly collected and stored in highly distributed infrastructures (e.g. across data centers, clusters, racks and nodes). To deal with the challenges above and fill the gap in existing big data processing frameworks, we present the Hierarchically Distributed Data Matrix (HDM) along with the system implementation to support the writing and execution of composable and in- tegrable big data applications. HDM is a light-weight, functional and strongly-typed meta-data abstraction which contains complete information (such as data format, loca- tions, dependencies and functions between input and output) to support parallel exe- cution of data-driven applications. Exploiting the functional nature of HDM enables i deployed applications of HDM to be natively integrable and reusable by other programs and applications. In addition, by analyzing the execution graph and functional seman- tics of HDMs, multiple automated optimizations are provided to improve the execution performance of HDM data flows. Moreover, by extending the kernel of HDM, we pro- pose a multi-cluster solution which enables HDM to support large scale data analytics among multi-cluster scenarios. Drawing on the comprehensive information maintained by HDM graphs, the runtime execution engine of HDM is also able to provide prove- nance and history management for submitted applications. We conduct comprehensive experiments to evaluate our solution compared with the current state-of-the-art big data processing framework — Apache Spark. ii Acknowledgments Thanks for everyone who have helped and/or accompanied me during my PhD. iii Contents Abstract i Acknowledgments iii List of Figures x List of Tables xi 1 Introduction 1 1.1 Background . .1 1.1.1 Big Data . .1 1.1.2 Big Data Enabling Technologies . .3 1.2 Motivation . 13 1.3 Publications . 19 2 Literature Review and Related Work 22 2.1 Big Data Processing Frameworks: State-of-the-Art . 23 2.1.1 MapReduce . 23 2.1.2 Spark . 26 2.1.3 Flink . 28 2.1.4 Other Big Data Processing Frameworks . 30 iv 2.1.5 Discussion . 33 2.2 Optimizations