High Performance Integration of Data Parallel File Systems and Computing

HIGH PERFORMANCE INTEGRATION OF DATA PARALLEL FILE SYSTEMS AND COMPUTING: OPTIMIZING MAPREDUCE Zhenhua Guo Submitted to the faculty of the University Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy in the Department of Computer Science Indiana University August 2012 Accepted by the Graduate Faculty, Indiana University, in partial fulfillment of the requirements of the degree of Doctor of Philosophy. Doctoral Geoffrey Fox, Ph.D. Committee (Principal Advisor) Judy Qiu, Ph.D. Minaxi Gupta, Ph.D. David Leake, Ph.D. ii Copyright c 2012 Zhenhua Guo ALL RIGHTS RESERVED iii I dedicate this dissertation to my parents and my wife Mo. iv Acknowledgements First and foremost, I owe my sincerest gratitude to my advisor Prof. Geoffrey Fox. Throughout my Ph.D. research, he guided me into the research field of distributed systems; his insightful advice inspires me to identify the challenging research problems I am interested in; and his generous intelligence support is critical for me to tackle difficult research issues one after another. During the course of working with him, I learned how to become a professional researcher. I would like to thank my entire research committee: Dr. Judy Qiu, Prof. Minaxi Gupta, and Prof. David Leake. I am greatly indebted to them for their professional guidance, generous support, and valuable suggestions that were given throughout this research work. I am grateful to Dr. Judy Qiu for offering me the opportunities to participate into closely related projects. As a result, I obtained deeper understanding of related systems including Dryad and Twister, and could better position my research in the big picture of the whole research area. I would like to thank Dr. Marlon Pierce for his guidance and support of my initial science gateway research. Although the topic of this dissertation is not science gateway, he helped me to strengthen my research capability and lay the foundation for my subsequent research on distributed systems. In addition, he provided excellent administrative support for the use of our laboratory clusters. Whenever I had problems with user accounts or environment setup, he was always willing to help. I was fortunate to work with brilliant colleagues on publications and projects. My discussions with them broadened my research horizon and inspired new research ideas. I would like to thank Xiaoming Gao, Yuan Luo, Yang Ruan, Yiming Sun, Hui Li, and Tak-Lon Wu. v Without the continuous support and encouragement from my family, I could not have come this far. Whenever I encountered any hurdle or frustration, they were always there with me. The love carried by the soothing phone calls from my parents was the strength and motivation for me to pursue my dream. Words cannot fully express my heartfelt gratitude and appreciation to my lovely wife Mo. She accompanied me closely through hardship and tough times in this six year long journey. She made every effort to help with my life as well as my research. They make the long Ph.D. journey a joyful and rewarding experience. Finally, I would like to thank the administrative staff of Pervasive Technology Institute and the School of Informatics and Computing for providing continual help during my Ph.D. study. With their help, I could concentrate on my research. vi Abstract The ongoing data deluge brings parallel and distributed computing into the new data-intensive computing era, where many assumptions made by prior research on grid and High-Performance Computing need to be reviewed to check their validity and explore their performance implication. Data parallel systems, which are different from traditional HPC architecture in that compute nodes and storage nodes are not separated, have been proposed and widely deployed in both industry and academia. Many research issues, which did not exist before or were not under serious consideration, arise in this new architecture and have drastic influence on performance and scalability. MapReduce has been introduced by the information retrieval community, and has quickly demonstrated its usefulness, scalability and applicability. Its adoption of data centered approach yields higher throughput for data-intensive applications. In this thesis, we present our investigation and improvement of MapReduce. We identify the inefficien- cies of various aspects of MapReduce such as data locality, task granularity, resource utilization, and fault tolerance, and propose algorithms to mitigate the performance issues. Extensive evaluation is presented to demonstrate the effectiveness of our proposed algorithms and approaches. Besides, I, along with Yuan Luo and Yiming Sun, observe the inability of MapReduce to utilize cross-domain grid resources, and propose a MapReduce extension called Hierarchical MapReduce (HMR). In addition, to speed up the execution of our bioinformatics data visualization pipelines containing both single-pass and iterative MapReduce jobs, a workflow management system Hybrid MapReduce (HyMR) is presented built by Rang and me upon Hadoop and Twister. The thesis also includes a detailed performance evaluation of Hadoop and some storage systems, and provides useful insights to both framework and application developers. vii Contents Acknowledgements v Abstract vii 1 Introduction and Background3 1.1 Introduction...........................................3 1.2 Data Parallel Systems......................................5 1.2.1 Google File System (GFS)...............................5 1.2.2 MapReduce.......................................6 1.3 Motivation............................................8 1.4 Problem Definition....................................... 10 1.5 Contributions.......................................... 11 1.6 Dissertation Outline....................................... 12 2 Parallel Programming Models and Distributed Computing Runtimes 14 2.1 Programming Models...................................... 15 2.1.1 MultiThreading..................................... 15 2.1.2 Open Multi-Processing (OpenMP)........................... 17 2.1.3 Message Passing Interface (MPI)........................... 19 2.1.4 Partitioning Global Address Space (PGAS)...................... 22 2.1.5 MapReduce....................................... 23 2.1.6 Iterative MapReduce.................................. 23 2.2 Batch Queuing Systems..................................... 23 2.3 Data Parallel Runtimes..................................... 24 2.3.1 Hadoop......................................... 25 viii 2.3.2 Iterative MapReduce Runtimes............................. 25 2.3.3 Cosmos/Dryad..................................... 25 2.3.4 Sector and Sphere................................... 26 2.4 Cycle Scavenging and Volunteer Computing.......................... 28 2.4.1 Condor......................................... 28 2.4.2 Berkeley Open Infrastructure for Network Computing (BOINC)........... 29 2.5 Parallel Programming Languages................................ 29 2.5.1 Sawzall......................................... 29 2.5.2 Hive........................................... 30 2.5.3 Pig Latin........................................ 31 2.5.4 X10........................................... 32 2.6 Workflow............................................ 32 2.6.1 Grid workflow..................................... 32 2.6.2 MapReduce workflow................................. 33 3 Performance Evaluation of Data Parallel Systems 34 3.1 Swift............................................... 34 3.2 Testbeds............................................. 36 3.3 Evaluation of Hadoop...................................... 36 3.3.1 Job run time w.r.t the num. of nodes.......................... 39 3.3.2 Job run time w.r.t the number of map slots per node.................. 41 3.3.3 Run time of map tasks................................. 43 3.4 Evaluation of Storage Systems................................. 44 3.4.1 Local IO subsystem................................... 45 3.4.2 Network File System (NFS).............................. 46 3.4.3 Hadoop Distributed File System (HDFS)....................... 47 3.4.4 OpenStack Swift.................................... 47 3.4.5 Small file tests..................................... 49 3.5 Summary............................................ 50 ix 4 Data Locality Aware Scheduling 51 4.1 Traditional Approaches to Build Runtimes........................... 52 4.2 Data Locality Aware Approach................................. 54 4.3 Analysis of Data Locality In MapReduce........................... 56 4.3.1 Data Locality in MapReduce.............................. 56 4.3.2 Goodness of Data Locality............................... 57 4.4 A Scheduling Algorithm with Optimal Data Locality..................... 60 4.4.1 Non-optimality of dl-sched ............................... 60 4.4.2 lsap-sched: An Optimal Scheduler for Homogeneous Network............ 61 4.4.3 lsap-sched for Heterogeneous Network........................ 65 4.5 Experiments........................................... 67 4.5.1 Impact of Data Locality in Single-Cluster Environments............... 67 4.5.2 Impact of Data Locality in Cross-Cluster Environments................ 68 4.5.3 Impact of Various Factors on Data Locality...................... 69 4.5.4 Overhead of LSAP Solver............................... 73 4.5.5 Improvement of Data Locality by lsap-sched ..................... 74 4.5.6 Reduction of Data Locality Cost............................ 75 4.6 Integration of Fairness..................................... 79 4.6.1 Fairness

High Performance Integration of Data Parallel File Systems and Computing

Analysis of GPU-Libraries for Rapid Prototyping Database Operations

Bench - Benchmarking the State-Of- The-Art Task Execution Frameworks of Many- Task Computing

Adaptive Data Migration in Load-Imbalanced HPC Applications

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

Scalable and High Performance MPI Design for Very Large

DNA Scaffolds Enable Efficient and Tunable Functionalization of Biomaterials for Immune Cell Modulation

Beowulf Clusters — an Overview

Improving MPI Threading Support for Current Hardware Architectures

A Cilk Implementation of LTE Base-Station Up- Link on the Tilepro64 Processor

Exascale Computing Project -- Software

Library for Handling Asynchronous Events in C++

Intel Threading Building Blocks