Transparent Data Access in Federated Computational Infrastructures

AGH University of Science and Technology Faculty of Computer Science, Electronics and Telecommunications Department of Computer Science Doctoral Thesis Transparent Data Access in Federated Computational Infrastructures Author: Supervisor: mgr in_z. Micha lWrzeszcz Prof. dr hab. in_z. Jacek Kitowski Co-supervisor: dr hab. in_z. Renata S lota Kraków,Poland April 2019 Akademia Górniczo-Hutniczaim. Stanis lawa Staszica w Krakowie Wydzial Informatyki, Elektroniki i Telekomunikacji Katedra Informatyki Rozprawa doktorska Transparentny Dostep do Danych w Sfederalizowanych Infrastrukturach Obliczeniowych Autor: Promotor: mgr in_z. Micha lWrzeszcz Prof. dr hab. in_z. Jacek Kitowski Promotor pomocniczy: dr hab. in_z. Renata S lota Kraków, Kwiecień2019 I would like to thank Professor Jacek Kitowski, my PhD supervisor for his guidance, support and valuable advices that helped shape this thesis. My sincere thanks also goes to Dr. hab. Renata S lota for suggestions and fruitful discussions during writing this dissertation. I am also grateful to ACC Cyfronet AGH for provision of resources required to verify the thesis. Among my colleagues from ACC Cyfronet AGH and Department of Computer Science, there is one person I want to especially thank and express my gratitude to, the leader of Onedata team Dr.LukaszDutka who supported me with his knowledge and passion. I also owe thanks to my Colleagues from Onedata team for great time working together and Piotr Nowakowski for consulting text of the thesis. Last but not the least, I would like to thank my family: my wife and my daughter for patience and forbearance. Abstract Current scientific problems require strong support from data access and management tools, especially in terms of data processing performance and ease of access. However, when analysing elements that influence user operations it is impossible to choose a single set of mutually non- exclusive features that satisfy all the requirements of data access stakeholders. Thus, the author has decided to study how a large-scale data access system should operate in order to meet the needs of a multiorganizational community. The author has identified context, represented by metadata, as a key aspect of the solution. On this basis the author postulates that context awareness enables data to be provisioned to users in a transparent manner while maintaining quality of access. However, along with the growth of the environment in terms of round-trip times, metadata management becomes challenging due to access and/or management overheads, often resulting in bottlenecks. Thus, the author has identified and classified contextual metadata, taking into account consistency and synchronization models, utilizing BASE (Basic Availability, Soft-state, Eventual consistency) rather than ACID (Atomic, Consistent, Isolated, Durable) whenever possible. This dissertation describes steps undertaken in order to validate the author's thesis, starting with analysis of the requirements of federated computational infrastructure stakeholders and shortcomings of existing data access tools. The core element of the thesis is a description of the Model of Transparent Data Access with Context Awareness (MACAS), designed to accommo- date dynamic changes of factors which affect data access in order to provide the desired access characteristics to specific groups. To solve this complex task, the model introduces layers and cross-cutting concerns which cover different aspects of data access, such as interactions with diverse storage resources, users' interactions with the data access system, coordination of ex- ecution of multiple operations to utilize more than one storage system, efficient utilization of network resources, cooperation of resource providers and distribution of the environment. The author also presents an implementation of proposed model that focuses on the ability to process large amounts of metadata, along with notifications which enable broad provisioning of up-to-date context information. The dissertation is concluded by a description of tests car- ried out in a federated environment, without any assumptions regarding the providers' mutual relationships. These tests validate the model's quality as well as its capability for adaptation to nonfederated environments. Streszczenie Aktualne problemy naukowe wymagają odpowiednich narzędzi zapewniających nie tylko wyda- jny dostęp do danych ale i łatwe zarządzanie danymi. Analizując elementy, które mają wpływ na operacje wykonywane przez użytkownika, nie jest jednak możliwe wybranie jednego zestawu funkcjonalności, który satysfakcjonuje wszystkich zainteresowanych dostępem do danych. W związku z tym autor zaproponował i poddał badaniom system dostępu do danych spełniający wymagania społeczności użytkowników zrzeszonych w wielu niezależnych organizacjach. Autor zidentyfikował kontekst, reprezentowany jako metadane, jako kluczowy element rozwiąza- nia, formułując tezę, że znajomość kontekstu umożliwia transparentne dostarczanie danych użytkownikom, utrzymując przy tym jakość dostępu. Jednak wraz z rozrastaniem się infras- truktury, zarządzanie metadanymi staje się coraz bardziej wymagające z powodu narzutów na synchronizację i/lub opóźnień w dostępie, które mogą doprowadzić do powstania wąskiego gardła w systemie dostępu do danych. W związku z tym autor zidentyfikował metadane opisujące kontekst i sklasyfikował je na podstawie modeli spójności i synchronizacji, w celu zapewnienia dostępności i efektywności kosztem transakcyjnego, atomicznego przetwarzania, tam gdzie jest to możliwe. Rozprawa zawiera opis etapów realizowanych w celu weryfikacji sformuowanej w rozprawie tezy, zaczynając od analizy wymagań oraz niedoskonałości istniejących narzędzi zapewniających dostęp do danych. Głównym osiągnięciem pracy jest model transparentnego dostępu do danych z wykorzystaniem kontekstu (ang. Model of Transparent Data Access with Context Awareness - MACAS), który umożliwia dynamiczne zmiany parametrów wpływających na dostęp do danych, aby zapewnić pożądaną charakterystykę dostępu do danych przez poszczególnych użytkown- ików. Aby rozwiązać tak złożone zadanie, model składa się z warstw obejmujących różne as- pekty dostępu do danych, takie jak interakcja z różnymi systemami składowania danych, interakcja użytkowników z systemem dostępu do danych, koordynacja wykonania wielu operacji w celu wykorzystania więcej niż jednego systemu składowania danych, wydajne wykorzystanie zasobów sieciowych, współpraca organizacji dostarczających zasoby dyskowe i obliczeniowa oraz rozproszenie środowiska. Autor przedstawia także implementację proponowanego modelu, która koncentruje się na możliwości przetwarzania dużej ilości metadanych oraz powiadomień, które umożliwiają dostarczanie szerokich i aktualnych informacji kontekstowych. Na zakończenie prezentowane są testy w środowisku sfederowanym, które udowadniają jakość systemu utworzonego na bazie modelu, a także zdolność dostosowania modelu do niesfederowanych środowisk. Contents 1 Introduction.............................1 1.1 Motivation.............................3 1.2 Thesis Statement and Research Objective.................4 1.3 Thesis Contribution.........................5 1.4 Note on Participation in Research Projects.................7 1.5 Thesis Structure...........................7 1.6 Definitions of Terms.........................8 2 Background Survey.......................... 10 2.1 Computational Environments...................... 10 2.1.1 Grid Computing......................... 10 2.1.2 Cloud Computing........................ 11 2.2 Typical Grid and Cloud Data Access Tools................. 12 2.3 Tools for Anytime/Anyplace Data Access................. 13 2.4 Tools for Distributed Data Processing.................. 14 2.5 Tools for Unified View of Multiorganizational Data............. 16 2.6 Summary............................. 17 3 MACAS - Model of Transparent Data Access with Context Awareness... 20 3.1 Data Access Stakeholders....................... 21 3.2 Basis of MACAS.......................... 23 3.3 Context Modelling in MACAS..................... 25 3.3.1 Types of Metadata........................ 26 3.3.2 Description of Metadata...................... 27 3.3.3 Classification of Metadata...................... 29 3.3.4 Metadata Consistency and Synchronization Models.............. 32 3.4 Model Description.......................... 33 3.4.1 Description of MACAS Layers and Concerns................ 34 3.4.2 MACAS Algorithm........................ 36 3.5 Summary............................. 38 4 Architecture and Selected Aspects of Implementation.......... 40 4.1 Overall Architecture of the System.................... 40 4.1.1 Metadata Distribution....................... 45 4.1.2 Handling Metadata Updates..................... 48 4.1.3 Propagation Delay for Metadata Changes and its Consequences.......... 50 CONTENTS v 4.2 Architecture of Data Management Component............... 51 4.2.1 DMC Core........................... 52 4.2.2 DMC Modules......................... 55 4.2.3 Request Handling and Load Balancing.................. 57 4.3 Summary............................. 59 5 Experimental Evaluation........................ 61 5.1 DMC Core Tests.......................... 62 5.1.1 Evaluation of Request Routing and Processing............... 62 5.1.2 Metadata Access Evaluation..................... 64 5.1.3 Reliability Evaluation....................... 66 5.2 Performance Evaluation of Integrated System................ 67 5.2.1 Overhead Evaluation....................... 68 5.2.2 Evaluation of Scalability

Load more