Virtual Cluster Management for Analysis Of
Total Page:16
File Type:pdf, Size:1020Kb
VIRTUAL CLUSTER MANAGEMENT FOR ANALYSIS OF GEOGRAPHICALLY DISTRIBUTED AND IMMOVABLE DATA Yuan Luo Submitted to the faculty of the University Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy in the School of Informatics and Computing, Indiana University August 2015 Accepted by the Graduate Faculty, Indiana University, in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Doctoral Committee Beth Plale, Ph.D. (Chairperson) Geoffrey Fox, Ph.D. Judy Qiu, Ph.D. Yuqing Wu, Ph.D. Philip Papadopoulos, Ph.D. August 7th, 2015 ii Copyright c 2015 Yuan Luo iii In loving memory of my grandfather. To my parents, my cousin, and all family members. iv Acknowledgements It has been a great pleasure working with the faculty, staff, and students at Indiana Univer- sity, during my seven-year PhD study. I would like to express my sincere gratitude to my advisor, Professor Beth Plale. This dissertation would never have been possible without her guidance, encouragement, challenges, and support throughout my research at Indiana Uni- versity. I am grateful for the many opportunities that Beth has provided for me to research, collaborate, publish, teach, and travel independently. I thank my other research committee members, Professor Geoffrey Fox, Professor Judy Qiu, Professor Yuqing Wu, and Dr. Philip Papadopoulos, for their invaluable discussion, ideas, and feedback. I also would like to thank Professor Xiaohui Wei, Dr. Peter Arzberger, and Dr. Wilfred Li for their mentorship before I joined Indiana University. I would never have started my PhD without their influence. I thank all current and past members in the Data to Insight Center at Indiana University for making the lab an enjoyable and stimulating place to work. I would like to name these individuals in particular: Peng Chen, Yiming Sun, Dr. You-Wei Cheah, Zong Peng, Jiaan Zeng, Guangchen Ruan, Quan Zhou, Dr. Scott Jensen, Dr. Devarshi Ghoshal, Felix Terkhorn, Jenny Olmes-Stevens, Jodi Stern, and Robert Ping. I also thank Tak-Lon Wu and Zhenhua Guo from Digital Science Center for multiple discussions. v I want to thank all mentors, colleagues and friends in the PRAGMA community for research discussions, project collaborations, and hospitalities: Shava Smallen, Dr. Kevin Kejun Dong, Professor Jose Fortes, Nadya Williams, Cindy Zheng, Dr. Yoshio tanaka, Professor Osamu Tatebe, Teri Simas, and many more. I thank all faculty and staff at School of Informatics and Computing for the excellent teaching and service. I am grateful to the school for providing me academic fellowship at the time of enrollment. I would like to thank Jenny Olmes-Stevens, Marisha Zimmerman, and Yi Qiao for all their help in thesis proofreading. Finally, I thank my family for all of the love, support, encouragement, and prayers throughout my journey. vi Yuan Luo VIRTUAL CLUSTER MANAGEMENT FOR ANALYSIS OF GEOGRAPHICALLY DISTRIBUTED AND IMMOVABLE DATA Scenarios exist in the era of Big Data where computational analysis needs to utilize widely distributed and remote compute clusters, especially when the data sources are sensitive or extremely large, and thus unable to move. A large dataset in Malaysia could be ecologically sensitive, for instance, and unable to be moved outside the country boundaries. Controlling an analysis experiment in this virtual cluster setting can be difficult on multiple levels: with setup and control, with managing behavior of the virtual cluster, and with interoperability issues across the compute clusters. Further, datasets can be distributed among clusters, or even across data centers, so that it becomes critical to utilize data locality information to optimize the performance of data-intensive jobs. Finally, datasets are increasingly sensitive and tied to certain administrative boundaries, though once the data has been processed, the aggregated or statistical result can be shared across the boundaries. This dissertation addresses management and control of a widely distributed virtual cluster having sensitive or otherwise immovable data sets through a controller. The Virtual Cluster Controller (VCC) gives control back to the researcher. It creates virtual clusters across multiple cloud platforms. In recognition of sensitive data, it can establish a single network overlay over widely distributed clusters. We define a novel class of data, notably immovable data that we call “pinned data”, where the data is treated as a first-class citizen vii instead of being moved to where needed. We draw from our earlier work with a hierarchical data processing model, Hierarchical MapReduce (HMR), to process geographically distributed data, some of which are pinned data. The applications implemented in HMR use extended MapReduce model where computations are expressed as three functions: Map, Reduce, and GlobalReduce. Further, by facilitating information sharing among resources, applications, and data, the overall performance is improved. Experimental results show that the overhead of VCC is minimum. The HMR outperforms traditional MapReduce model while processing a particular class of applications. The evaluations also show that information sharing between resources and application through the VCC shortens the hierarchical data processing time, as well satisfying the constraints on the pinned data. Beth Plale, Ph.D. (Chairperson) Geoffrey Fox, Ph.D. Judy Qiu, Ph.D. Yuqing Wu, Ph.D. Philip Papadopoulos, Ph.D. viii Contents 1 Introduction 1 1.1 Motivation . .1 1.2 Challenges and Contributions . .5 1.2.1 Multi-Cloud Controllability and Interoperability . .6 1.2.2 Manage, Access, and Process of Pinned Data . .8 1.3 Definition of Terminologies . 10 1.4 Dissertation Outline . 11 2 Related Work 12 2.1 Multi-Cloud Controllability and Interoperability . 12 2.2 Manage, Access, and Use of Pinned Data . 16 2.3 Background Related Work . 19 3 Background Work 24 3.1 Hierarchical MapReduce Framework . 24 3.1.1 Architecture . 25 3.1.2 Programming Model . 26 3.2 HMR Job Scheduling . 29 3.2.1 Compute Capacity Aware Scheduling . 30 ix 3.2.2 Data Location Aware Scheduling . 32 3.2.3 Topology and Data Location Aware Scheduling . 35 3.3 Hierarchical MapReduce Applications . 37 3.3.1 Compute-Intensive Applications . 37 3.3.2 Data-Intensive Applications . 39 3.4 HMR Framework Evaluation . 41 3.4.1 Evaluation of AutoDock HMR . 41 3.4.2 Gfarm Over Hadoop Clusters . 47 4 Virtual Cluster Controller 49 4.1 Architecture . 50 4.1.1 VCC Design . 51 4.1.2 VCC Implementation . 51 4.1.3 PRAGMA Cloud . 55 4.2 Specifications of Cloud Resources, Applications and Data . 55 4.2.1 Deployment vs Execution . 56 4.2.2 Resource, Application and Data Specifications . 57 4.2.3 Runtime Information for Provisioned Resource . 58 4.3 VCC Resource Allocation: A Matchmaking Process . 62 4.3.1 Application Aware Resource Allocation . 62 4.3.2 VCC Matchmaker . 63 4.4 VCC Framework Evaluation . 68 4.4.1 Testbed . 69 4.4.2 VCC Overhead Evaluation . 70 x 4.4.3 Network Overhead Evaluation . 71 4.5 Information Sharing Evaluation . 75 4.6 Summary . 78 5 Manage, Access, and Use of Pinned Data 82 5.1 Introduction . 83 5.2 Pinned Data Model . 83 5.2.1 Pinned Data (PND) . 83 5.2.2 Pinned Data Suitcase (PDS) . 84 5.2.3 Access Pinned Data . 91 5.3 Pinned Data Processing with HMR+ . 91 5.3.1 HMR+ Architecture . 91 5.3.2 Pinned-Data Applications . 94 5.4 Pinned Data Evaluation . 94 5.4.1 Experimental Setup . 95 5.4.2 PDS Creation Evaluation . 96 5.4.3 PDS Server Evaluation . 98 5.4.4 Pinned Data on HMR+ Evaluation . 102 5.5 Summary . 106 6 Conclusion and Future Work 110 6.1 Summary . 110 6.2 Conclusion . 111 6.3 Future Work . 112 xi Bibliography 113 Curriculum Vitae xii List of Figures 3.1 Hierarchical MapReduce Architecture. 27 3.2 HMR Data Flow. 28 3.3 HMR Execution Steps. 41 3.4 Local cluster MapReduce execution time based on different number of Map tasks. 43 3.5 (a) Two-way data movement cost of γ-weighted partitioned datasets: local MapReduce inputs and outputs; (b) Local MapReduce turnaround time of γ-weighted datasets, including data movement cost. 44 3.6 (a) Two-way data movement cost of γθ-weighted partitioned datasets: local MapReduce inputs and outputs; (b) Local MapReduce turnaround time of γθ-weighted datasets, including data movement cost. 45 4.1 Resource management system for hybrid cloud infrastructures. 50 4.2 Virtual Cluster Controller Architecture . 52 4.3 Architecture of VCC-HTCondor Pool . 53 4.4 VCC Enabled PRAGMA Cloud . 54 4.5 Runtime topology information of a virtual cluster. VCk is a subset of a virtual cluster on the physical cluster k.Xij represents network bandwidth (Bij) and latency (Lij) measured from VCi to VCj.............. 61 xiii 4.6 Resource Allocation Model . 62 4.7 Matchmaker Architecture . 66 4.8 Matchmaker Engine flowchart . 67 4.9 VCC overhead with ad-hoc IPOP installation . 72 4.10 VCC overhead with IPOP pre-installed . 72 4.11 TCP/UDP Bandwidth and RTT Tests . 73 4.12 TCP/UDP Bandwidth and RTT Tests over IPOP . 73 4.13 Data-Intensive HMR application execution time using 4 algorithms combi- nations under different data distribution. The data transfer speed between IU and SDSC is 3MB/s. 79 4.14 Data-Intensive HMR application execution time using 4 algorithms combi- nations under different data distribution. The data transfer speed between IU and SDSC is 5MB/s. 79 4.15 Data-Intensive HMR application execution time using 4 algorithms combi- nations under different data distribution. The data transfer speed between IU and SDSC is 7MB/s. 80 4.16 Speedup of AARA+TDLAS algorithms combination over other algorithms combinations. The x-axis indicates the data transfer speed ratio of LAN overWAN. .................................. 80 5.1 Pinned Data Suitcase with Local Data .