Data Management in Dynamic Distributed Computing Environments
Total Page:16
File Type:pdf, Size:1020Kb
DATA MANAGEMENT IN DYNAMIC DISTRIBUTED COMPUTING ENVIRONMENTS BY IAN ROBERT KELLEY A dissertation submitted to the faculty of Cardiff University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Cardiff School of Computer Science & Informatics Cardiff University June 2012 Declaration This work has not been submitted in substance for any other degree or award at this or any other university or place of learning, nor is being submitted concur- rently in candidature for any degree or other award. Signed: Date: 30 June 2012 STATEMENT 1 This thesis is being submitted in partial fulfillment of the requirements for the degree of Ph.D. Signed: Date: 30 June 2012 STATEMENT 2 This thesis is the result of my own independent work/investigation, except where otherwise stated. Other sources are acknowledged by explicit references. The views expressed are my own. Signed: Date: 30 June 2012 STATEMENT 3 I hereby give consent for my thesis, if accepted, to be available for photocopying and for inter-library loan, and for the title and summary to be made available to outside organizations. Signed: Date: 30 June 2012 i Acknowledgements The journey leading to this dissertation’s completion has been long, with many challenges and unexpected twists and turns along the way. It all began nearly nine years ago, when the idea to pursue a computer science doctorate was conceived during my time as a research programmer in Berlin. This notion evolved and became a goal of mine, which I pursued during my subsequent move to Louisiana. Two years later, I was accepted into a Ph.D. program at Cardiff University, where I began my research in the field of distributed systems. Throughout this experience, I have made many new friends and colleagues, and I thoroughly enjoyed life at Cardiff University, first as a student and later as a researcher. I would like to thank many of the people I encountered on the road leading here, not only for their help in my research, but most of all for their friendship. First, I would like to recognize Edward Seidel and Gabrielle Allen, who were there in the very beginning and started me on this path. Next, and of special note, are the members of “the group” at Cardiff: Matthew Shields, Andrew Harrison, and my supervisor (and friend) Ian Taylor. Each of you, in your own way, made my time in the U.K. more enjoyable and fruitful. I miss our lively after-work debates and late-night curries. I am grateful to the Cardiff School of Computer Science & Informatics, both for the support its members have shown me and also for making this all possible. I would like to acknowledge my colleagues from the Enabling Desktop Grids for e-Science and European Desktop Grid Initiative projects, who contributed many ideas and helped drive my research. ii Lastly, I appreciate the support given to me over the years by my family and friends. You have constantly reminded me to complete my dissertation, which finally resulted in pen being put to paper (or, rather, fingers to keys) at long last. I would especially like to thank my aunt Victoria Scott and my grandmother, Carolyn Scott, for their exceptional editorial assistance. My grandmother was not only steadfast in her encouragement, but also came out of retirement at age 90 to proofread my dissertation! I dedicate this work to my son Nils and my partner Erica Hammer. At the time of this writing Nils is 11, yet was much younger when it all began. I hope one day he will appreciate the sacrifices that were made so this research could see the light of day. iii Abstract Data management in parallel computing systems is a broad and increasingly im- portant research topic. As network speeds have surged, so too has the movement to transition storage and computation loads to wide-area network resources. The Grid, the Cloud, and Desktop Grids all represent different aspects of this move- ment towards highly-scalable, distributed, and utility computing. This dissertation contends that a peer-to-peer (P2P) networking paradigm is a natural match for data sharing within and between these heterogeneous network architectures. Peer-to-peer methods such as dynamic discovery, fault-tolerance, scalability, and ad-hoc security infrastructures provide excellent mappings for many of the requirements in today’s distributed computing environment. In recent years, volunteer Desktop Grids have seen a growth in data through- put as application areas expand and new problem sets emerge. These increasing data needs require storage networks that can scale to meet future demand while also facilitating expansion into new data-intensive research areas. Current prac- tices are to mirror data from centralized locations, a technique that is not practical for growing data sets, dynamic projects, or data-intensive applications. The fusion of Desktop and Service Grids provides an ideal use-case to re- search peer-to-peer data distribution strategies in a hybrid environment. Desktop Grids have a data management gap, while integration with Service Grids raises new challenges with regard to cross-platform design. The work undertaken here is two-fold: first it explores how P2P techniques can be leveraged to meet the iv data management needs of Desktop Grids, and second, it shows how the same distribution paradigm can provide migration paths for Service Grid data. The result of this research is a Peer-to-Peer Architecture for Data-Intensive Cycle Sharing (ADICS) that is capable not only of distributing volunteer computing data, but also of providing a transitional platform and storage space for migrating Service Grid jobs to Desktop Grid environments. v Contents 1 Introduction1 1.1 Thesis Goals..............................8 1.2 Thesis Overview............................ 12 1.3 Contributions.............................. 14 2 Background, Motivation, and Related Work 16 2.1 Grid Computing............................. 17 2.1.1 Conflicting Views........................ 20 2.1.2 Different Priorities....................... 21 2.1.3 My View of the Grid...................... 23 2.1.4 Service Grids and Desktop Grids Compared........ 24 2.2 Cloud Computing............................ 27 2.2.1 Flexibility and Scalability Benefits............... 28 2.2.2 Cloud Computing as a Business Model........... 29 2.2.3 High Performance versus Highly Scalable Computing... 30 2.3 Volunteer Computing.......................... 31 2.3.1 Berkeley Open Infrastructure for Network Computing... 34 2.3.2 XtremWeb........................... 36 2.3.3 Volunteer Computing Application Profiles.......... 37 2.4 Service and Desktop Grid Interoperability.............. 42 2.4.1 Enabling Grids for E-sciencE (EGEE)............ 43 2.4.2 Enabling Desktop Grids for e-Science (EDGeS)....... 46 2.4.3 Data Management Obstacles................. 49 2.5 Peer-to-Peer Networking........................ 52 vi CONTENTS 2.5.1 The Explosion of P2P File-Sharing.............. 53 2.5.2 Peer-to-Peer File Distribution Taxonomy........... 54 2.5.3 Flat Networks.......................... 55 2.5.4 Centralized/Decentralized................... 57 2.5.5 Distributed Hash Tables.................... 59 2.5.6 Decentralized Super-Peer Topologies............ 61 2.5.7 Tightly Coupled LAN-Oriented File-Sharing Systems.... 63 2.6 Bridging the Data Management Gap................. 65 3 Analysis and Design 70 3.1 Methodology.............................. 71 3.2 Problem Scope and Analysis..................... 72 3.3 Application Environment........................ 74 3.4 Design Considerations......................... 77 3.4.1 Scalability and Network Topology............... 78 3.4.2 Data Integrity and Security.................. 80 3.4.3 User Security and Client-Side Configuration......... 83 3.4.4 Legacy Software Integration.................. 85 3.5 Analysis of P2P Network Architectures................ 87 3.5.1 BitTorrent............................ 88 3.5.2 Super-Peer Topologies..................... 89 3.6 Conclusions............................... 91 4 A Peer-to-Peer Architecture for Data-Intensive Cycle Sharing 94 4.1 Architectural Model........................... 95 4.2 Roles.................................. 97 4.3 Data Sharing Protocol......................... 102 4.3.1 Publishing and Replication.................. 102 4.3.2 Downloading.......................... 104 4.4 Evaluation Against Requirements................... 106 4.4.1 Scalability and Network Topology............... 106 4.4.2 Data Integrity and Security.................. 108 4.4.3 User Security and Client-Side Configuration......... 109 4.4.4 Legacy Software Integration.................. 111 4.5 Validation of Design through Experimentation............ 112 4.6 Summary................................ 124 5 Implementation and Integration with Service and Desktop Grids 127 vii CONTENTS 5.1 Attic................................... 129 5.1.1 Introduction........................... 130 5.1.2 Message Types......................... 134 5.1.3 Component Implementation.................. 136 5.1.4 Role Implementations..................... 140 5.1.5 Downloading.......................... 144 5.1.6 Attic URLs........................... 149 5.1.7 Security............................. 150 5.1.8 Configuration.......................... 151 5.1.9 Summary............................ 154 5.2 Data Migration Using Attic....................... 156 5.2.1 Data Transition from gLite ! DG............... 156 5.2.2 Data Transition from ARC ! DG............... 164 5.2.3 Data Transition from UNICORE ! DG............ 168 5.3 BOINC Integration..........................