AGH University of Science and Technology Faculty of Computer Science, Electronics and Telecommunications Department of Computer Science

Doctoral Thesis Transparent Data Access in Federated Computational Infrastructures

Author: Supervisor: mgr in˙z. Micha lWrzeszcz Prof. dr hab. in˙z. Jacek Kitowski

Co-supervisor: dr hab. in˙z. Renata S lota

Krak´ow,Poland April 2019 Akademia G´orniczo-Hutniczaim. Stanis lawa Staszica w Krakowie Wydzial Informatyki, Elektroniki i Telekomunikacji Katedra Informatyki

Rozprawa doktorska

Transparentny Dostep do Danych w Sfederalizowanych Infrastrukturach Obliczeniowych

Autor: Promotor: mgr in˙z. Micha lWrzeszcz Prof. dr hab. in˙z. Jacek Kitowski

Promotor pomocniczy: dr hab. in˙z. Renata S lota

Krak´ow, Kwiecie´n2019 I would like to thank Professor Jacek Kitowski, my PhD supervisor for his guidance, support and valuable advices that helped shape this thesis. My sincere thanks also goes to Dr. hab. Renata Slota for suggestions and fruitful discussions during writing this dissertation. I am also grateful to ACC Cyfronet AGH for provision of resources required to verify the thesis. Among my colleagues from ACC Cyfronet AGH and Department of Computer Science, there is one person I want to especially thank and express my gratitude to, the leader of Onedata team Dr.LukaszDutka who supported me with his knowledge and passion. I also owe thanks to my Colleagues from Onedata team for great time working together and Piotr Nowakowski for consulting text of the thesis. Last but not the least, I would like to thank my family: my wife and my daughter for patience and forbearance. Abstract

Current scientific problems require strong support from data access and management tools, especially in terms of data processing performance and ease of access. However, when analysing elements that influence user operations it is impossible to choose a single set of mutually non- exclusive features that satisfy all the requirements of data access stakeholders. Thus, the author has decided to study how a large-scale data access system should operate in order to meet the needs of a multiorganizational community. The author has identified context, represented by metadata, as a key aspect of the solution. On this basis the author postulates that context awareness enables data to be provisioned to users in a transparent manner while maintaining quality of access. However, along with the growth of the environment in terms of round-trip times, metadata management becomes challenging due to access and/or management overheads, often resulting in bottlenecks. Thus, the author has identified and classified contextual metadata, taking into account consistency and synchronization models, utilizing BASE (Basic Availability, Soft-state, Eventual consistency) rather than ACID (Atomic, Consistent, Isolated, Durable) whenever possible. This dissertation describes steps undertaken in order to validate the author’s thesis, starting with analysis of the requirements of federated computational infrastructure stakeholders and shortcomings of existing data access tools. The core element of the thesis is a description of the Model of Transparent Data Access with Context Awareness (MACAS), designed to accommo- date dynamic changes of factors which affect data access in order to provide the desired access characteristics to specific groups. To solve this complex task, the model introduces layers and cross-cutting concerns which cover different aspects of data access, such as interactions with diverse storage resources, users’ interactions with the data access system, coordination of ex- ecution of multiple operations to utilize more than one storage system, efficient utilization of network resources, cooperation of resource providers and distribution of the environment. The author also presents an implementation of proposed model that focuses on the ability to process large amounts of metadata, along with notifications which enable broad provisioning of up-to-date context information. The dissertation is concluded by a description of tests car- ried out in a federated environment, without any assumptions regarding the providers’ mutual relationships. These tests validate the model’s quality as well as its capability for adaptation to nonfederated environments. Streszczenie

Aktualne problemy naukowe wymagają odpowiednich narzędzi zapewniających nie tylko wyda- jny dostęp do danych ale i łatwe zarządzanie danymi. Analizując elementy, które mają wpływ na operacje wykonywane przez użytkownika, nie jest jednak możliwe wybranie jednego zestawu funkcjonalności, który satysfakcjonuje wszystkich zainteresowanych dostępem do danych. W związku z tym autor zaproponował i poddał badaniom system dostępu do danych spełniający wymagania społeczności użytkowników zrzeszonych w wielu niezależnych organizacjach. Autor zidentyfikował kontekst, reprezentowany jako metadane, jako kluczowy element rozwiąza- nia, formułując tezę, że znajomość kontekstu umożliwia transparentne dostarczanie danych użytkownikom, utrzymując przy tym jakość dostępu. Jednak wraz z rozrastaniem się infras- truktury, zarządzanie metadanymi staje się coraz bardziej wymagające z powodu narzutów na synchronizację i/lub opóźnień w dostępie, które mogą doprowadzić do powstania wąskiego gardła w systemie dostępu do danych. W związku z tym autor zidentyfikował metadane opisujące kontekst i sklasyfikował je na podstawie modeli spójności i synchronizacji, w celu zapewnienia dostępności i efektywności kosztem transakcyjnego, atomicznego przetwarzania, tam gdzie jest to możliwe. Rozprawa zawiera opis etapów realizowanych w celu weryfikacji sformuowanej w rozprawie tezy, zaczynając od analizy wymagań oraz niedoskonałości istniejących narzędzi zapewniających dostęp do danych. Głównym osiągnięciem pracy jest model transparentnego dostępu do danych z wykorzystaniem kontekstu (ang. Model of Transparent Data Access with Context Awareness - MACAS), który umożliwia dynamiczne zmiany parametrów wpływających na dostęp do danych, aby zapewnić pożądaną charakterystykę dostępu do danych przez poszczególnych użytkown- ików. Aby rozwiązać tak złożone zadanie, model składa się z warstw obejmujących różne as- pekty dostępu do danych, takie jak interakcja z różnymi systemami składowania danych, in- terakcja użytkowników z systemem dostępu do danych, koordynacja wykonania wielu operacji w celu wykorzystania więcej niż jednego systemu składowania danych, wydajne wykorzystanie zasobów sieciowych, współpraca organizacji dostarczających zasoby dyskowe i obliczeniowa oraz rozproszenie środowiska. Autor przedstawia także implementację proponowanego modelu, która koncentruje się na możliwości przetwarzania dużej ilości metadanych oraz powiadomień, które umożliwiają dostar- czanie szerokich i aktualnych informacji kontekstowych. Na zakończenie prezentowane są testy w środowisku sfederowanym, które udowadniają jakość systemu utworzonego na bazie modelu, a także zdolność dostosowania modelu do niesfederowanych środowisk. Contents

1 Introduction...... 1 1.1 Motivation...... 3 1.2 Thesis Statement and Research Objective...... 4 1.3 Thesis Contribution...... 5 1.4 Note on Participation in Research Projects...... 7 1.5 Thesis Structure...... 7 1.6 Definitions of Terms...... 8 2 Background Survey...... 10 2.1 Computational Environments...... 10 2.1.1 Grid Computing...... 10 2.1.2 Cloud Computing...... 11 2.2 Typical Grid and Cloud Data Access Tools...... 12 2.3 Tools for Anytime/Anyplace Data Access...... 13 2.4 Tools for Distributed Data Processing...... 14 2.5 Tools for Unified View of Multiorganizational Data...... 16 2.6 Summary...... 17 3 MACAS - Model of Transparent Data Access with Context Awareness... 20 3.1 Data Access Stakeholders...... 21 3.2 Basis of MACAS...... 23 3.3 Context Modelling in MACAS...... 25 3.3.1 Types of Metadata...... 26 3.3.2 Description of Metadata...... 27 3.3.3 Classification of Metadata...... 29 3.3.4 Metadata Consistency and Synchronization Models...... 32 3.4 Model Description...... 33 3.4.1 Description of MACAS Layers and Concerns...... 34 3.4.2 MACAS Algorithm...... 36 3.5 Summary...... 38 4 Architecture and Selected Aspects of Implementation...... 40 4.1 Overall Architecture of the System...... 40 4.1.1 Metadata Distribution...... 45 4.1.2 Handling Metadata Updates...... 48 4.1.3 Propagation Delay for Metadata Changes and its Consequences...... 50 CONTENTS v

4.2 Architecture of Data Management Component...... 51 4.2.1 DMC Core...... 52 4.2.2 DMC Modules...... 55 4.2.3 Request Handling and Load Balancing...... 57 4.3 Summary...... 59 5 Experimental Evaluation...... 61 5.1 DMC Core Tests...... 62 5.1.1 Evaluation of Request Routing and Processing...... 62 5.1.2 Metadata Access Evaluation...... 64 5.1.3 Reliability Evaluation...... 66 5.2 Performance Evaluation of Integrated System...... 67 5.2.1 Overhead Evaluation...... 68 5.2.2 Evaluation of Scalability and System Limits...... 70 5.2.3 Evaluation of Overhead in a Multi-DMC Environment...... 71 5.3 Datachunk Management Evaluation...... 73 5.4 Context Awareness Evaluation...... 76 5.5 Contribution of Context Awareness to Experiments...... 78 5.6 Evaluation Summary...... 78 6 Conclusions and Future Work...... 81 6.1 Summary...... 81 6.2 Research Contribution...... 83 6.3 Range of Applications...... 83 6.4 Future Work...... 84 Bibliography...... 86 Author’s Bibliometric Data...... 94 Author’s Publications...... 95 List of Figures

1.1 View of collaboration at scale ...... 1 1.2 Influence of metadata on data access ...... 3 1.3 Proposed evolution of data access model ...... 6 1.4 Contribution of the author (green) and collaborative tasks in which the author was involved (black) ...... 6

2.1 Drawbacks of existing tools for federated computational environments (red - draw- backs, green - advantages offset by other drawbacks) ...... 18

3.1 Scheme of the data access system ...... 21 3.2 Stakeholders’ relation to data access in federated computational environments . . 21 3.3 Basic metadata used during data access ...... 24 3.4 Example of datachunk replication ...... 26 3.5 Main metadata dependencies ...... 27 3.6 Model of Transparent Data Access with Context Awareness ...... 34 3.7 Algorithmic representation of MACAS ...... 37

4.1 Overall architecture of the system ...... 41 4.2 FUSE client (FClient) concept ...... 41 4.3 Sample FClient pseudocode ...... 42 4.4 Basic pseudocode for handling FClient requests ...... 42 4.5 Pseudocode for FClient datachunk synchronization request handling ...... 44 4.6 Pseudocodes for updates of metadata describing datachunks ...... 45 4.7 Pseudocodes for metadata updates following merger and invalidation of datachunks 46 4.8 Metadata stores and caches ...... 46 4.9 Sample flow between metadata stores and metadata caches ...... 49 4.10 Conflict resolution pseudocode ...... 50 4.11 Deployment of DMC Core ...... 52 4.12 DMC modules ...... 55 4.13 Request flow – different modes with different features ...... 59

5.1 Normalized throughput with similar load on all DMC cluster nodes ...... 62 5.2 Normalized throughput with DMC cluster nodes divided into two groups with different load ...... 63 5.3 Fragments of logs from reliability tests ...... 68 5.4 Total aggregated throughput and CPU usage ...... 71 5.5 Data access throughput for local and shared datasets ...... 72 LIST OF FIGURES vii

5.6 Changing distribution of file fragments among DMCs ...... 73 5.7 Test environment for comparing management policies ...... 73 5.8 Context awareness test environment ...... 77 5.9 Results of selected steps of the context awareness test ...... 79

6.1 Relation of issues connected to transparent data access ...... 82 6.2 Author’s individual achievements (green), collaborative achievements (black) and tasks in which the author was not involved (orange) ...... 83 List of Tables

1.1 Data storage, access and management levels ...... 3

2.1 Features of data access solutions ...... 17 2.2 Existing solution characteristics ...... 19

3.1 Features of the data access system expected by stakeholders ...... 23 3.2 Abbreviations for types of metadata used in equations ...... 28 3.3 Classes of metadata with abbreviations used in equations ...... 29 3.4 Metadata classes ...... 33

4.1 Implementation assumptions for MACAS ...... 43 4.2 Aggregation of events and changes ...... 50 4.3 Implementation of MACAS layers and concerns by DMC modules ...... 56

5.1 Request handling time and characteristics of request processing modes ...... 64 5.2 Test configurations for metadata access ...... 65 5.3 Average metadata access times for different configurations and computing envi- ronments ...... 65 5.4 Number of memory slots occupied at the end of the test for different configurations and computing environments ...... 66 5.5 Results of reliability tests ...... 67 5.6 Description of overhead tests ...... 69 5.7 Throughput of a system implementing the MACAS model in comparison with direct access ...... 69 5.8 Total aggregated throughput and number of operations per second ...... 71 5.9 Comparison of management policies ...... 75 5.10 Types of context awareness ...... 80 1 Introduction

With the fast grow of the digital universe, data access and processing at a global scale are at the centre of scientific and commercial interests. This follows from the ever increasing scale of research problems which call for wide-ranging collaboration between groups of researchers making use of geographically distributed, heterogeneous data sources (cf. Figure 1.1). The problems – such as Data Science [35; 57], Big Data [26; 53], the 4th Paradigm [18] and Science 2.0 [47], each represented by a set of activities performed worldwide – require strong support from data access and management tools, which must evolve to meet demands for data processing performance supported by the available storage resources, as well as ease of use. Extracting knowledge or generating insight from data provides an understanding of the con- temporary, scientific, commercial and social challenges. A recognizable feature of modern data access and management systems is the tendency to cross organizational boundaries, resulting in work on the forefront of science and engineering, such as for example processing of astronomy data [44], or sharing and analyzing the results of large-scale experiments and simulations (e.g. Human Brain Project [24] or Worldwide LHC Computing Grid [12]). New data storage and management paradigms are foreseen, e.g., the concept of data lakes [49] for maintaining and sharing different types of data. Modern science introduces many problems which require collaborative work and call for sub- stantial resources. Likewise, the business world acknowledges the increasing role of efficient data

Figure 1.1: View of collaboration at scale 2 management for analyzing constantly growing volumes of data in order to remain competitive on the market. The need for e-infrastructures which deal with open data, facilitating easy access and shar- ing in organizationally distributed environments, has already been acknowledged, giving rise to various projects and initiatives, e.g. [81; 72]. Nevertheless, development of such systems clearly lags behind expectations when ease of use, effectiveness, transparency of data access and heterogeneity of resources are concerned. Collaboration between distributed groups of workers calls for sophisticated systems, which, in turn, requires a set of specific problems to be solved separately to meet collaboration requirements. Due to the complexity of such systems, their development is usually performed by teams of architects and developers with well-defined roles and activities, inspired by specific use cases and end users. In formal terms, data access and management can be analyzed at several levels, from personal data to globally distributed shared data. This leads to various cooperation problems which need to be overcome by data access and management tools. The simplest case involves access to local data, i.e., to data stored on direct attached storage (DAS). The user can control all activities; the only problem is to provide device drivers and solutions that use the hardware in an optimal fashion, such as the stripping algorithm for the SSD storage array [39], tuning the storage system for a particular type of usage (e.g., archiving [23]) or balancing between several levels of a hierarchical storage system. The next level involves provisioning data access to a group of users working for a single organization, e.g., network attached storage (NAS). Problems encountered in this case pertain to possible network failures, higher latency [37], or simultaneous work by multiple users, impacting quality of service parameters [29; 30; 36]. Storage systems must be able to maintain the required QoS and cost effectiveness on the organizational level, assisted by request scheduling and resource allocation algorithms [43; 32]. The resources offered by a single organization may prove insufficient for users who require large amounts of storage and computing power to process data streams produced continuously by experiments, or make use of large amounts of information gathered in existing datasets. For such users, resource providers create federated organizations (FO) – or federations for short – i.e., groups which agree to collaborate in order to simplify access to resources which belong to multiple organizations, often defining a storage attached network (SAN) and detailed common rules of cooperation and resource sharing [42]. Nevertheless, many problems related to federated data remain unresolved. Computing Grids [2] and Virtual Organizations (VOs) [1; 14] address issues related to decentralized management by organizations that use different policies and make autonomous decisions according to local requirements. Nevertheless, further work is required to improve efficiency and convenience of data access, as well as to ensure cost-effective data management. The most complex case is the use of data provided by several nonfederated organizations (NFOs), i.e., organizations that do not have any cooperation agreement in place. In this case, challenges related to trust and lack of standards appear. As there is no bond between NFOs, the exchange of information concerning users and their data is difficult since each NFO may apply its own authentication mechanism. Table 1.1 summarizes the presented levels of data storage, access and management. At each level, a cumulative increase of difficulty is observed. One of the important aspects of a solution which addresses such problems pertains to metadata (see Figure 1.2) that can be managed automatically or manually by the user. Such metadata includes user-specific information (e.g., 1.1. MOTIVATION 3

Figure 1.2: Influence of metadata on data access

Table 1.1: Data storage, access and management levels Environment Description Level Sample problems Provision of device drivers to maximize hardware Direct Attached Storage Local performance Network Attached Storage Provision of device drivers for distributed environment Organization for a group of users to minimize negative impact on simultaneous use Resources offered by closely Federated Distributed management and low-level data access collaborating organizations Organization (FO) policies Resources offered by Nonferedated No trust between organizations, lack of standards, local unrelated organizations Organization (NFO) user accounts access control) along with storage-specific information (e.g., location of data replicas) [25; 9; 5; 6]. Metadata can also be used to describe the context of data access, e.g., system component load. The more contextual information is taken into account by management algorithms, the higher quality is provided to the users. However, management of large amounts of metadata may prove difficult for many users if it is not automatically handled by data access system. Moreover, along with the growth of the environment in terms of round-trip times, metadata management becomes more challenging because of possible access and/or management overheads. Research indicates that operations on metadata are very likely to cause a bottleneck [13; 11; 15]. Thus, the quality of data access in a multiorganizational distributed environment is strictly related to metadata management.

1.1 Motivation

When a user accesses data, several aspects are typically considered important – these include the type of storage system, location of data and the state of the environment (e.g. availability of storage space, its load and type, number of users accessing storage, etc.) Minimizing the negative impact of data distribution, ensuring high availability of data, providing data replication and limiting delays in accessing replicas comprise another group of important issues. Dealing with such aspects is usually inconvenient for typical users. Hence, the topic of this dissertation is a study on provisioning the distributed data in a transparent and effective manner. As a result of the author’s involvement in various national and international projects (see Chapter 1.4), strong demand for a system for accessing heterogeneous distributed data which 1.2. THESIS STATEMENT AND RESEARCH OBJECTIVE 4 would be user-friendly, efficient and scalable has become evident. This observation has resulted in attempts to capture customer needs in terms of a general use case and the required functionality, which, according to system engineering principles, expresses the most important factors at the early stage of development. Since – as previously mentioned – the development of such a system is a complex task, we formulate the following overall research question:

At large scale, how is the system built to fulfill the requirements of the multi- organizational community to offer user-friendly, efficient and scalable access to heterogeneous, distributed data resources? What are its unique features?

In addition to the above, we also state a more specific question:

What are the main elements of the system?

In order to address these research questions we specify the overall concept, architecture and core elements of the proposed system.

1.2 Thesis Statement and Research Objective

When analyzing data access in an organizationally distributed environment, we should men- tion elements which influence user operation, such as a consistent view of the distributed data, efficient data reading and/or writing capabilities as well as avoiding or demanding data redun- dancy. Since many of these features contradict each other, according to the CAP theorem [3] it is impossible to select a single set of mutually non-exclusive features that satisfy all stakeholders. On the basis of the shortcomings of existing solutions (discussed in the State of the Art section) along with our initial experimental environments and tests [138; 137] we have identified context awareness, represented by metadata management, as the main aspect of this study. Consequently, the thesis of the dissertation may be defined as follows:

In distributed storage infrastructures context awareness enables data to be provisioned to users in a transparent manner while maintaining quality of access.

The above thesis includes three important terms:

• context awareness – the ability of entities to sense and react to the state of their environment [17]; in this work context is conflated with metadata which expresses knowledge of the circumstances of data access, e.g., the environment and the user’s expectations,

• data provisioning transparency – a concept which stipulates that problems associated with differing data formats, storage systems and locations must be concealed from users; instead, users access data via logical paths while the infrastructure handles the underlying technical aspects,

• quality of access (QoA) – Quality of Service (QoS) considerations related to data access [29].

The overall research objective is to develop an approach for transparent and easy provisioning of distributed data while maintaining quality of access. Transparent provisioning of distributed data not only results in access simplicity but also in great management possibilities. Data can 1.3. THESIS CONTRIBUTION 5 be automatically migrated and/or replicated in a transparent manner to decrease access latency or increase throughput. However, the concept of data access transparency does not assume any goals regarding the data access system. Instead, transparent data management should implement a specific policy (follow particular guidelines – see Chapter 1.6), e.g., improving access throughput or providing additional security mechanisms. Moreover, different policies can be applied to different datasets or groups of users to address the above-mentioned problem of mutually exclusive features. QoS requirements can be defined in the form of a Service Level Agreement (SLA) [16] to formally describe aspects of quality. The agreement may also specify best-effort mechanisms along with provisioning of characteristics that are difficult to measure (e.g., simplicity). The term QoA is used in this thesis to describe any attempt to deliver the desired access characteristics. Maintaining quality of access means that provisioning of new/upgraded features/characteristics does not result in the loss of any other desired characteristics. In particular, implementing a layer of abstraction that conceals data location from users (simplifying access) should not result in decreased access performance. While quality trade-offs are allowed, they should always relate to a specific policy (e.g., loss of performance due to use of safer but slower storage) rather than to implementation aspects of the data transparency mechanism itself. To achieve the stated objective, the following research steps are performed:

• elaboration of a model which reflects the overall idea of the system,

• designing a system architecture that represents the model,

• implementation of core system elements,

• practical verification of the approach in real and simulated computing environments.

To validate the thesis and complete the main objective the Model of Transparent Data Access with Context Awareness (MACAS) is created. In addition to modeling easy and transparent data access, it also enables implementation of different policies, e.g., focusing on data access efficiency and scalability or security (see Figure 1.3). To achieve these goals, appropriate knowledge about several aspects of data access (data access context) is modelled as a set of metadata, hence metadata definitions constitute an important element of MACAS. The MACAS model introduces various abstractions which enable implementation of different levels of data storage, access and management (see Table 1.1). Practical verification focuses on federated multiorganizational environments as a core functionality of the model, disregarding issues of trust which typically arise in NFOs.

1.3 Thesis Contribution

The work performed in this dissertation is based on collaborative research projects. In Figure 1.4 the author’s contribution is highlighted in green, while black text represents collaborative work. The presented work is aligned with the trend of Software Defined Storage (SDS) [99]. The author’s involvement is as follows:

1. Contribution to developing assumptions for the MACAS-compliant system – identifying data access stakeholders, limitations of organizational distribution as well as functional and non-functional requirements to be included in the model. 1.3. THESIS CONTRIBUTION 6

Figure 1.3: Proposed evolution of data access model

Figure 1.4: Contribution of the author (green) and collaborative tasks in which the author was involved (black) 1.4. NOTE ON PARTICIPATION IN RESEARCH PROJECTS 7

2. Design of MACAS – a Model of Transparent Data Access with Context Awareness including transparent data management assisted by data access context. It models policy-based provisioning with hardware-independent data management and provides sufficient elasticity to account for dynamically adjusted features depending on the requirements of a particular user application.

3. Co-design of the data access system by mapping elements of MACAS to elements of the system architecture. The design takes into account efficiency, distributed management and diversity of solutions and policies.

4. Implementation of the system core. Participation in the implementation of components representing elements of the system architecture.

5. Design of experiments and tests that verify the model based on popular benchmarks and tools. Participation in the implementation of tests.

The most important novelty of the author’s contribution is context awareness, represented by metadata, included in the MACAS model to accommodate dynamic changes in various fac- tors that affect data access. Elaboration of metadata classes with different consistency and synchronization models to avoid context information processing bottlenecks should also be ac- knowledged. The implementation focuses on the ability to process large volumes of metadata as well as notifications, ultimately enabling the system to provision broad and up-to-date context information.

1.4 Note on Participation in Research Projects

The author has actively participated in several research projects related to distributed systems. The main background is provided by collaboration with the Academic Computer Centre Cyfronet AGH [62] and by postgraduate research at the Department of Computer Science of the Faculty of Computer Science, Electronics and Telecommunication of the AGH University of Science and Technology. Participation in the PL-Grid family of projects [92] has provided insight into data manage- ment in federated computing infrastructures. In the PL-Grid PLUS [91] and PL-Grid CORE [90] projects the author was responsible for a development team working on implementation of tools simplifying access and management of data stored within the PL-Grid infrastructure. The author was involved in the QStorMan project [28], developing a tool for optimization of use of storage resources in accordance with user requirements. Work for INDIGO [81] and EGI ENGAGE [72] projects involved exploration of user require- ments for organizationally distributed environments in the context of collaboration and data sharing. Basic experience was gained from the European Defence Agency EUSAS [139; 123; 122; 117] and national Rehab [96; 118; 145] projects focusing on data farming methodology applications as well as holistic rehabilitation of stroke patients with help from computer games.

1.5 Thesis Structure

The remainder of the thesis is organized as follows. Chapter 2 contains an overview of data access tools and environments important for the thesis. In Chapter 3 data access stakeholders and their requirements are identified. On the basis of this analysis, the Model of Transparent Data Access with Context Awareness is introduced. Selected aspects of the model implementation 1.6. DEFINITIONS OF TERMS 8 are presented in Chapter 4 while Chapter 5 discusses its experimental evaluation. The final Chapter outlines conclusions and plans for future work.

1.6 Definitions of Terms

The following terms are used in this thesis:

• Basic terms:

– Dataset: a collection of data that may be perceived as a coherent whole. – Metadata: data [information] that provides information about other data and/or ac- cess context. – Data access: storing, retrieving and manipulating data. – Data access context: environment / condition in which data is accessed. – Data access context awareness: a property of data access systems that allows the system to understand the environment / conditions in which the system operates and react to changes in the environment / conditions. Data access context is represented by metadata that describes knowledge concerning the circumstances of data access. – Data access and management policy: a set of guidelines that should be followed during data access and management to fulfill requirements.

• Elements of the environment:

– Site: a set of closely linked computing and/or storage resources. – Client: a software entity used by the user to operate on data. – FUSE: : a software interface that allows non-privileged users to create their own filesystems without editing kernel code. – FUSE client: a client that is based on FUSE.

• Actors involved in data access:

– Organization: an entity that has a collective goal and is linked to the external envi- ronment. – Provider: an organization that owns/operates computing and/or storage resources and provides them to the user. The provider’s resources may form one or several sites. Data is stored and manipulated using software and hardware that belong to a particular provider. – User: a person or organization that possesses some data and/or performs computa- tions. – Developer: producer of applications/services that offer particular features to users.

• Terms used to describe relations between elements of the environment:

– Federation: a group of computing or network providers who agree upon standards of operation in a collective fashion. – Nonfederated organizations (NFOs): organizations that do not have any cooperation agreement in force. 1.6. DEFINITIONS OF TERMS 9

– Support of user by provider: a term that describes the relation between the user and the provider when the provider makes its resources available for the user to store/access/process data. In such a case, the provider stores and manages meta- data to provide data access that fulfills the user’s requirements. 2 Background Survey

According to the CAP theorem [3] it is impossible for a distributed data store to simultaneously provide more than two of the following three guarantees: consistency, availability and partition tolerance. When supporting Open Science, Big Data processing and cooperation among users who work with organizationally distributed resources, it is particularly important to provide availability and partition tolerance. This statement can be justified by a use case where several scientists process different parts of a read-only dataset, e.g., they analyze different scans of the human brain. Analysis of each part of the dataset takes a lot of time and requires significant computational resources, hence it is conducted at several computing centers. While processing a particular portion of the dataset does not require access to its full contents, outcomes should be made available to everyone to facilitate further comparison of results obtained for different parts of the dataset. As processing of each part can take days or weeks, it is critical to allow processing of selected parts even when other parts are temporarily unavailable or the connection to other computing centers that take part in processing has been lost. For such a use case it is more important to ultimately provide access to results produced using multiple parts than to ensure a consistent view of temporary data during processing. Thus, the author focuses on tools that allow efficient data access at any given time regardless of consistency. Tools such as GlobalFS [48], that aim to ensure strongly consistent filesystem operations in the event of node failures, but come at the cost of reduced availability, are not considered. This chapter begins with an overview of computational environments followed by a discus- sion of typical data access tools. Subsequently, other solutions for data access are investigated. Finally, problems related to data access and management in organizationally distributed envi- ronments are summarized.

2.1 Computational Environments

Since scientific experiments often require substantial computing power, the author focuses on two most popular large-scale approaches: grid and cloud computing (also referred to as the grid and the cloud respectively). The aim of both is resource sharing but the way in which the resources are offered differs.

2.1.1 Grid Computing

The computing grid is “a hardware and software infrastructure that provides dependable, consis- tent, pervasive, and inexpensive access to high-end computational capabilities” [2]. The essence 2.1. COMPUTATIONAL ENVIRONMENTS 11 of the grid is (1) coordination of resources that are not subject to centralized control, (2) use of standard, open, general-purpose protocols and interfaces, (3) providing nontrivial quality of service [4]. Other definitions emphasize the use of virtualization to present a unified system image [108]. Two important aspects for this thesis appear in the above-mentioned definitions: linking resources contributed by several organizations and achieving a unified system image. From the user’s point of view, the grid can be perceived as a coherent whole due to its dedicated middleware and single sign-on features. However, sites have local, independently managed storage systems and may differ with regard to data access policies. The existence of multiple storage systems inside the grid, each managed by a different organization, makes data access and management difficult for users as well as for providers. Moreover, most storage solutions are suited only for a limited number of use cases and the user is often forced to combine several tools to achieve the desired effect (see Chapter 2.2). Thus, there is room for solutions simplifying data access and management, but grid solutions are loosing their importance currently.

2.1.2 Cloud Computing

According to the NIST1 definition, “cloud computing is a model for enabling ubiquitous, con- venient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and re- leased with minimal management effort or service provider interaction” [104]. Cloud computing is based on virtualization of resources, which can be monitored, controlled and subjected to accounting. There are three basic cloud service models:

• Software as a Service (SaaS),

• Platform as a Service (PaaS),

• Infrastructure as a Service (IaaS).

SaaS offers end-user applications running in the cloud that are accessible from various devices. Examples include Gmail and Google Docs. When using PaaS, in contrast to SaaS, the user has control over the deployed applications and may configure the application hosting environment. This provides greater elasticity than SaaS. Examples include Google App Engine and Microsoft Azure. IaaS (e.g., Amazon Cloud) allows users to run arbitrary software with limited control of networking components (e.g., host firewalls) while in SaaS and PaaS users have no control over the network. IaaS offers the greatest elasticity but also requires substantial knowledge, including familiarity with system administration tasks. The two basic cloud deployment models are private and public clouds. However, two addi- tional models can be found in literature: community and hybrid clouds [104]. A private cloud is an infrastructure dedicated to exclusive use by a single organization, whereas a public cloud is provisioned for open use by the general public. It is often owned, managed, and operated by a business and users are billed for its use. Both private and public clouds can be managed by a third party – the type of deployment depends on the user, not the managing entity. If a cloud infrastructure is provisioned for exclusive use by a specific community of consumers (more than a single organization, but without open access), it is called a community cloud.

1National Institute of Standards and Technology, U.S. Department of Commerce 2.2. TYPICAL GRID AND CLOUD DATA ACCESS TOOLS 12

Community clouds are often owned and managed by organizations that have shared concerns (e.g., mission, security requirements, policy, and compliance considerations). A hybrid cloud is a composition of two or more of the above-mentioned infrastructures (private, community, or public) that remain unique entities but are bound together by standardized or proprietary data and application technology. Although many cloud environments provide data access transparency, this topic is also ad- dressed by the thesis. The main assumption and motivation for the work is the diversity of clouds and their users. Different scientists use different clouds and it is often impossible to process data of all cooperating scientists using a single cloud due to limited resources, dataset migration overhead or formal reasons (e.g., funding obtained for usage of a particular environ- ment). Thus, aspects of cloud users’ collaboration remain an interesting research topic and are addressed by the presented work.

2.2 Typical Grid and Cloud Data Access Tools

Grid and cloud providers offer storage systems for various purposes. The grid usually supports such solutions as (1) scratch for intermediate job results and data processing; (2) long-term data storage for final job results, often accessible through a dedicated API, appropriate for sharing data between different sites. Providers can also offer (3) that manages data as objects (e.g., Amazon S3). Access to objects is fast and scalable, but there is no hierarchical structure or block access such as in traditional filesystems. Object storage is designed to deliver multiple online storage services, whereas traditional storage systems are primarily designed for high-performance computing and transaction processing. Selected examples of tools used in the grid and cloud environments are outlined below. [83] is a parallel distributed filesystem for computational clusters. The Lustre filesystem is often used as a high-performance scratch system in the grid. In such cases, there are usually different Lustre instances on different sites, which means that data stored in this filesystem can only be shared within the local cluster. Although the efficiency of the Lustre system is high, it may nevertheless be improved through dedicated tools such as QstorMan [28; 32; 120]. QstorMan aims at delivering storage QoS and resource usage optimization for applications that use the Lustre filesystem. QStorMan continuously monitors Lustre nodes and dynamically forwards data access requests to storage resources according to predefined storage QoS require- ments. The usage of QStorMan has been shown to improve the data access efficiency of PL-Grid data-intensive applications [120]. Another tool often used as a high-performance scratch system is GPFS [80; 46]: a technology provided by IBM that offers similar usage characteristics to Lustre. In order to make data available outside the site, it should be copied to permanent storage outside the local cluster; e.g., LFC (LCG File Catalog), which is a storage mechanism for metadata management that provides common filesystem functionality for distributed storage resources [8; 22]. It supports file replication for better data protection and availability. It is commonly used with [109] command line utilities. Direct access from the application source code is possible using the GFAL API [77]. Since many users consider the usage of dedicated command- line utilities or APIs a drawback, they may use a FUSE-based [75; 54] implementation of the filesystem called GFAL-FS [77], which provides access to data in the same manner as in a regular Unix-like filesystem. However, data access via GFAL-FS is slower (compared to command-line utilities or GFAL API) and only available in read-only mode. 2.3. TOOLS FOR ANYTIME/ANYPLACE DATA ACCESS 13

OpenStack Object Store (known as Swift [88]) is an example of object storage often used in the cloud. It is able to provide common file names within the grid and cloud infrastructures, and can therefore be applied in similar use cases to LFC. However, the Swift file sharing mechanism (which makes use of API access key sharing or session tokens) seems to be more troublesome for most users compared to LFC file sharing mechanisms based on Unix permissions. There are several reasons behind the high heterogeneity of storage systems:

• user requirements for storage resources with different characteristics depending on the application,

• local resource policies,

• use of spare storage resources which already exist at the given location.

Users often emphasize the importance of specific use cases, such as archiving and efficient access to temporary files [132]. Currently, such use cases are served by different storage systems and tools on different sites. As a result, standard grid and cloud environments do not offer data access transparency when deployed in a multi-organization environment due to the heterogeneity of storage systems. The lack of easy transparent data access results in management problems from the provider’s point of view. Less technically advanced users often work only with scratch storage and manually manage data transfers using SSH-based protocols for both file sharing and staging prior to job execution. This results in suboptimal use of storage and computing resources. Thus, both users and providers require better data access methods.

2.3 Tools for Anytime/Anyplace Data Access

Although anytime/anyplace data access is a term used mainly in marketing, in the author’s opinion it accurately reflects the focus of tools described in this chapter. This section provides a description of solutions whose primary objective is to provision data to the user regardless of the access point. When the user works with resources on multiple sites, all data must be available for each user process on each site. Existing tools for anytime/anyplace data access focus on ease of access. The most popular ones are Dropbox, OneDrive and Google Drive [71]. Client applications are provided for the most popular operating systems, enabling a virtual filesystem to be mounted in order to transparently handle synchronization with cloud storage. If any operation performed without connection to the Internet conflicts with serverside changes performed by other clients, users can resolve the conflict on their own. Other significant features include file sharing mechanisms which allow users to easily publish their data. These tools impose rigorous limits on storage size and transfer speed, which become an obstacle when the research is conducted in a geographically distributed manner and data requires online synchronization across sites. The user has to carefully plan data processing depending on where the data has been generated and what transfer/synchronization operations are foreseen. Another similar sync-and-share tool is ownCloud [34; 41]. It enables users to maintain full control over data location and transfer while hiding the underlying storage infrastructure, ab- stracting file storage available through directory structures or WebDAV. It also provides file synchronization between various operating systems, sharing of files using public URLs, and sup- port for external cloud storage services. Although ownCloud is more flexible than the previously mentioned tools, its performance is also not sufficient for data-intensive applications. 2.4. TOOLS FOR DISTRIBUTED DATA PROCESSING 14

2.4 Tools for Distributed Data Processing

One of the most prominent tools for remote data access is Globus Connect [45]. It is built on the GridFTP protocol to provide fast data transfer and sharing capabilities inside an organization. Globus Connect focuses on data transfer and does not abstract access to existing data resources. Thus, it does not provide any data access transparency. Another option for distributed environments is to provision storage resources through a high-performance parallel filesystem. Solutions of this type intend to provide access to storage resources optimized for performance. They are usually built on top of dedicated storage resources (e.g., ), and expose POSIX-compliant interfaces. Examples include BeeGFS (formerly FhGFS) [65], GlusterFS [76], [103], and PanFS [89]. As there are significant differences between these systems in terms of data access, their most important features are presented below. BeeGFS [65] is an excellent example of a high-performance parallel filesystem because it uses many typical mechanisms for this type of tool. It combines multiple storage servers to provide shared network storage resource with striped file contents. Built on scalable multithreaded core components with native Infiniband support, BeeGFS has no architectural bottlenecks. It dis- tributes file contents across multiple storage servers and likewise distributes filesystem metadata across multiple metadata servers. This results in high availability and low metadata access la- tency. Even given the multitude of metadata servers, it is guaranteed that changes to a file or directory by one client are immediately visible to other clients. BeeGFS has no support for inte- gration of resources managed by several independent organizations. It can be used within a local storage area, but cannot easily provide transparent data access for organizationally distributed environments. GlusterFS constitutes an interesting alternative to metadata servers. It uses an elastic hash- ing algorithm that allows each node to access data without use of metadata or location servers. Storage system nodes have the intelligence to locate any piece of data without looking it up in an index or querying another server. This parallelizes data access and ensures good performance scaling. GlusterFS can scale up to petabytes of storage, available to the user under a single mount point. The developers of GlusterFS [76] point to the use of an elastic hashing algorithm as the heart of ’s fundamental advantages, which results in good performance, availability, stability, and a reduction in the risk of data loss and corruption or the data becoming unavailable. The use of hashing algorithms minimizes traffic flow; however, usage of metadata servers results in better elasticity and easier reconfiguration. On the other hand, the hashing algorithms require more work when a group of data servers is reconfigured, so both solutions have pros and cons; one’s choice of solution should depend on the use case. Similarly to BeeGFS, GlusterFS is dedi- cated to local storage infrastructures with no strong support for an organizationally distributed environment. Coda [103] is an example of a system with strict support for disconnected mode operations. It offers high availability of files by replicating a file volume across many servers and caching files on the client machine. The server and client communicate through Remote Procedure Calls (RPC). When a server is notified of file updates, it instructs clients who cache copies of the affected files to invalidate these copies. Due to client-side replication of data, the user is able to continue working in case of a network failure. However, aggressive caching may lead to conflicts. Automatic conflict resolution may involve data loss (the user may be unaware of the problem) 2.4. TOOLS FOR DISTRIBUTED DATA PROCESSING 15 while manual conflict resolution is inconvenient for users. The main drawback of Coda is its lack of support for organizational distribution of an environment (its cache coherency algorithm would result in high utilization of the network between sites [74]). Another interesting tool is PanFS [89]. This solution creates a single high-performance pool of storage under a global namespace. While most storage systems loosely couple the parallel filesystem software with legacy block storage arrays, PanFS combines the functions of a parallel filesystem, volume manager and RAID engine into one holistic platform. It also makes efficient use of SSD storage to improve performance. It is a commercial tool dedicated to building high-performance storage solutions for a single organization. Thus, similarly to the solutions mentioned above, it is not appropriate for data access in organizationally distributed environments. It is also important to mention solutions built on the top of object storage systems, like CephFS [11; 55; 58], DDN’s WOS [107] and Scality Ring [98]. While CephFS provides a POSIX- compliant distributed filesystem based upon RADOS [95], WOS delivers only true object storage with no underlying filesystem. The architecture of WOS consists of three components: building blocks, WOS Core software, and a selection of simple interfaces. WOS Core has self-healing capabilities and a single management console for the entire infrastructure. The distribution of data in WOS may be configured to use geographic replication for disaster protection. Scality Ring is a software-only storage solution built using a distributed share-nothing architecture. A distinguishing feature of this tool is its built-in tiering that provides high flexibility in storage configuration. Unfortunately, none of these tools offer transparent data access when deployed over distributed resources managed by several NFOs. Hadoop Distributed (HDFS) [21; 52] offers support for the map-reduce paradigm, which is another direction of effective distributed data processing. It is designed to stream large datasets at high bandwidth for the user’s processes. Data and metadata are stored separately. HDFS benefits from replication of data among the available storage resources, but the metadata server can be considered a single point of failure that decreases the level of fault tolerance of the system. Another interesting tool that supports the map-reduce paradigm is Tachyon [102], which provides high performance for map-reduce applications by aggressively using memory. Neither HDFS nor Tachyon envision transparent data access in a federation. Although HDFS supports federations, the purpose of an HDFS federation is to improve scalability and isolation rather than to hide data distribution from the user. The above-mentioned tools differ in terms of implementation details, which also affects their non-functional features. GlusterFS uses an elastic hashing algorithm, while CephFS uses a metadata server and Scality RING utilizes a routing-based algorithm within a P2P network. The presented systems use different approaches to offline work and caching. Moreover, some of them require dedicated hardware, increasing efficiency at the cost of additional investments. Despite their variety, the presented tools are not well suited for organizationally distributed environments due to their limited support for provider cooperation. This is caused by their centrally managed storage systems, able to provide transparent data access only inside a single organization. Moreover, most of the above-mentioned tools are also difficult to use because of limited support for deployment on resources where some data is already stored. 2.5. TOOLS FOR UNIFIED VIEW OF MULTIORGANIZATIONAL DATA 16

2.5 Tools for Unified View of Multiorganizational Data

Another type of solution for exposing storage resources comprises systems that provide a layer of abstraction on top of storage resources across multiple organizations. They can provide a consistent view of user data stored in different systems. They expose a single namespace for data storage and often facilitate data management by enabling providers to define data management rules. An exemplary tool is iRODS [19; 33] developed for grid environments. Under iRODS data can be stored in designated folders on any number of data servers. To integrate various external data management systems, such as GridFTP-enabled systems, SRM-compatible systems or Amazon’s S3 service, a plug-in mechanism can be used. Data integration in the iRODS system is based on a metadata catalogue – iCAT – that is involved in the processing of the majority of data access requests. Metadata describing the actual data stored in the system includes information such as file name, size and location, along with user-defined parameters. The user can search for data that has been tagged, while the administrator of the system can query the metadata catalogue directly by using an SQL-like language to provide aggregated information about the system. Although iCAT provides a rich set of features for both users and administrators, it also represents a weakness of the iRODS system due to its reliance on a relational database – potentially a systemic bottleneck and a single point of failure. Due to iRODS’s ability to adjust data management to provider/user needs, it is often referred to as adaptive middleware. To allow for dynamic adaptation, the iRODS system uses rules. A rule is a chain of activities provided by low-level modules (built in or supported externally) to facilitate the required functionality. User actions are monitored by the rule engine which can activate specific rules. Typical user interfaces are available – utilization of a POSIX interface on any FUSE-compatible Unix system [75] is enabled by a FUSE-based filesystem provided by iRODS. The iRODS system provides built-in support for federalization through its Zone mechanism. A Zone is a single iRODS installation. Each iRODS Zone is a separate entity that can be independently managed and can include different storage resources. To access data located in another Zone, dedicated user accounts can be created in a remote Zone with a pointer to the home Zone of the user. Although iRODS is a powerful and flexible tool that allows for connecting organizationally distributed installations, it also has some drawbacks. For example, it does not provide location transparency for data stored across multiple federated iRODS installations. Users manage data locations by themselves, so data access transparency is not maintained. Parrot [10] allows for attaching existing programs to remote data management systems which expose other access protocols (e.g., HTTP, FTP or XRootD) through the filesystem interface. Parrot utilizes a ptrace debugging interface to trap the system calls of a program and replace them with remote I/O operations. As a result, remote data can be accessed in the same way as local files. Unfortunately, the performance of Parrot is limited, as ptrace can generate significant overhead [10]. As a result, Parrot is not well suited for data-intensive applications. Other interesting solutions include Syndicate Drive [101] and Storj [100]. Syndicate Drive is a virtual cloud storage system that combines the advantages of local storage, cloud storage, com- modity CDNs, and network caches. Storj is a peer-to-peer cloud-storage network implementing end-to-end encryption to allow users to transfer and share data without support from a third- party data provider. Although these solutions contain algorithms that speed up data access, they are rarely used in common computing infrastructures. The requirement to download data 2.6. SUMMARY 17

Table 2.1: Features of data access solutions Feature Tool example Anytime/anyplace data access with location transparency Dropbox Increased storage system efficiency for chosen application on demand QStorMan Increased efficiency of data access due to use of dedicated hardware PanFS Efficient work with many clients owing to replication of components BeeGFS Fast and reliable data transfer between sites thanks to efficient protocols and transfer Globus Connect supervision Geographic replication that results in disaster protection and reduces risk of data loss WOS Stable and efficient operation when the network is slow or unreliable through client-side Coda caching and strict support for disconnected mode High flexibility in storage configuration due to built-in tiering Scality Ring Dynamic adaptation and adjusting data management behaviour to provider/user needs iRODS due to rules subsystem Support for distributed data management in federations through Zones mechanism iRODS Creating a unified view over independent data sources by providing the ability to Parrot attach remote data management systems Integration with various storage systems through plug-ins iRODS Discovery and identification of datasets DataCite prior to its use is a drawback, since such preparation should be performed in the background. Another system worth mentioning is FAX (Federating ATLAS storage systems using XrootD) [40]. FAX brings Tier 1, Tier 2, and Tier 3 storage resources together into a common namespace that is accessible anywhere. FAX Client software tools (e.g., ROOT, xrdcp) are able to reach data regardless of its location. However, the N2N component that is responsible for mapping global names to physical file names may be a performance bottleneck because of its reliance on LFC. To enable discovery and identification of datasets, open access services such as DataCite [68; 50] or OpenAIRE [31; 51] can be used. These services rely on standards such as OAI-PMH [7] for integration with existing platforms for publication metadata harvesting, and identify datasets through globally unique handles such as DOI [27] or PID. However, these services do not directly address the issue of accessing the underlying data by end users. In this context, we should also mention National Data Storage (NDS) [93; 20] – a national initiative and pilot implementation of a distributed data storage system intended to provide high quality backup, archiving and data access services. It introduces several useful features such as advanced monitoring and prediction, as well as replication techniques to increase availability and performance of data access. However, it lacks ease of deployment and scalability. The tools described in this chapter are all high-level data management-oriented. The standard POSIX filesystem interface is arguably preferable for most applications. For this reason, the effort undertaken to abstract any custom interface with a POSIX overlay is appreciated by users. However, the main goal of these systems is to enable data access from anywhere in a uniform way, rather than to achieve high performance. Hence, despite the comfort of data access and management that they offer, their applicability to HPC application execution is limited.

2.6 Summary

This chapter summarizes existing data access solutions and identifies several interesting features, as listed in Table 2.1. The features offered by existing tools can be harnessed to meet the requirements of various user groups. However, all of these solutions have drawbacks (see Tables 2.2 and Figure 2.1). In 2.6. SUMMARY 18

Figure 2.1: Drawbacks of existing tools for federated computational environments (red - drawbacks, green - advantages offset by other drawbacks) particular, none of the analyzed solutions support all of the following features:

• transparent anytime/anyplace data access,

• high efficiency and scalability,

• distributed (decentralized) data management.

To the best of the author’s knowledge, none of the existing services or tools combine all three of the listed elements. Limitations of existing tools have led to various extensions, e.g. IBM AFM (Active File Management) [63] for GPFS or xCache [82] for xRootD. Both AFM and xCache provide additional caching that improves the quality of work in a federation. However, such improvement is only possible when all providers agree to use a common storage solution (GPFS or xRootD) across all sites – and this is often impractical. Furthermore, existing initiatives also introduce certain drawbacks. For example, NDS [93] lacks ease of deployment and scalability, while the DataNet Federation Consortium (DFC) [69] is based on iRODS and therefore exhibits the iRODS drawbacks described in the previous chapter. 2.6. SUMMARY 19

Table 2.2: Existing solution characteristics Solution Data access Characteristics Drawbacks type solution Lustre Common High-performance cluster solution GPFS Use of several tools together grid/cloud LFC is needed - no data access data access Provide common file names within grid or cloud Swift transparency tools QStorMan Improves data access performance on demand Tools for Dropbox anytime/ Onedrive Limits on storage size and Easy to use anyplace Google Drive transfer speed data access ownCloud Tools for Globus Provides fast data movement and data sharing Does not abstract access to fast data Connect capabilities based on GridFTP data resources movement High availability and performance due to High- BeeGFS scalable multithreaded core components with performance native Infiniband support Designed to be used by single parallel Scalability and performance due to elastic GlusterFS organization - no support for file hashing algorithm organizationally distributed systems High availability due to strict support for Coda environments disconnected mode operations High performance due to combination of functionality of parallel filesystem, volume PanFS manager and RAID engine into one holistic platform Solutions CephFS POSIX-compliant distributed filesystem based DDN’s WOS True object storage on object Software-only storage solution with built-in storage Scality Ring tiering that provides high flexibility in storage configuration Designed to stream large datasets at high Tools for HDFS No support for bandwidth to user processes map-reduce organizationally distributed Provides high performance for map-reduce Tachyon environments applications using memory aggressively Integrate various external data management iRODS systems using metadata catalogue - iCAT Tools for Allows attaching existing programs to remote unified Performance and/or Parrot data management systems through file system view of scalability is not sufficiently interface using ptrace debugging interface multi- high for data-intensive Syndicate organizational applications Drive Based on data download before usage data Storj Brings Tier 1, Tier 2 and Tier 3 storage FAX resources together into a common namespace DataCite Enable discovery and identification of datasets OpenAIRE Provides high quality backup, archiving and NDS data access services 3 MACAS - Model of Transpar- ent Data Access with Context Awareness

This chapter defines the Model of Transparent Data Access with Context Awareness (MACAS) which enables transparent, easy, efficient and scalable data access supported by knowledge of the distributed environment (i.e. by the context). MACAS assumes that data is stored on sites that are managed by various providers (see Figure 3.1) and accessed by users using client software. The model introduces a set of layers, each of which provides certain features to fulfill the requirements of a data access stakeholder, and cross-cutting concerns (later referred to as con- cerns) that describe aspects that affect many layers at once. Each layer makes use of different metadata which describes the context needed to fulfill its tasks. The model defines not only metadata which is required at a particular layer, but also describes metadata consistency and synchronization in a distributed environment. Finally, MACAS defines an algorithm that shows how the data access system should use layers and concerns to provide functionality to the user. Thus, it can be concluded that MACAS is defined by the following elements: metadata, layers and concerns and the algorithm making use of the mentioned above elements. Metadata consistency and synchronization models determine guarantees of consistency, avail- ability and partition tolerance (see CAP theorem [3]) that can be provided by the system based on the MACAS model. Provider independence is a strong argument to consider availability and partition tolerance as more important than consistency – this is because data availability should not be limited by the state of resources that are not managed by the provider that hosts the data. However, lack of consistency results in different data views on different sites. Data access cannot be considered transparent if it depends on the place of access. Thus, MACAS divides metadata into groups that are treated differently to provide a balance between these mutually exclusive guarantees. To complete the model, three steps must be performed:

1. identification of data access stakeholders and analysis of their requirements,

2. identification of metadata required to fulfill stakeholders’ requirements and definition of metadata classes with different consistency and synchronization models to avoid bottle- necks, 3.1. DATA ACCESS STAKEHOLDERS 21

Figure 3.1: Scheme of the data access system

Figure 3.2: Stakeholders’ relation to data access in federated computational environments

3. design of MACAS layers and concerns, including the context and functionality of each MACAS layer and concern along with its algorithmic representation.

The author performed the main tasks associated with steps 1-2, along with full personal involvement in step 3, i.e., in MACAS model development.

3.1 Data Access Stakeholders

Three main classes of data access stakeholders can be identified: users, providers and devel- opers. The user expects a set of specific features while the provider tries to satisfy the user’s requirements with the limited resources at his disposal (see Figure 3.2). The developer creates services that provide functionality for the user. These services represent IT platforms or tools which support use cases typical for a given scientific discipline. Thus, users and providers are the most important stakeholders. The developer can be perceived as an advanced user that requires additional functionality to allow integration of newly created services with the data access system. The main issues from the users’ perspective are enumerated below. They are encountered while dealing with Big Data characterized by volume, velocity, variety and value:

1. easy anytime/anywhere data access,

2. easy data sharing, 3.1. DATA ACCESS STAKEHOLDERS 22

3. efficient access to large volumes of data,

4. archiving or publishing data,

5. advanced control over data storage,

6. data dynamics,

7. data security.

These issues will now be described in further detail. (1) One of the problems often encountered by the users is the lack of a uniform and easy method for anytime/anywhere access [132] to data located on distributed sites and storage systems. The users expect to access and manage their distributed data wherever they work. (2) If they cooperate with other users, they also expect easy sharing and processing of data even when the collaborators work at different sites. (3) The users also process large volumes of data and hence expect efficient data access with low latency and high throughput. (4) However, some users also require long-term storage for archiving and/or publishing data. Thus, optimization of data access efficiency cannot lead to changes in access paths. (5) While many users expect simplicity in the above-mentioned use cases, some of the most advanced users may require better control over data storage and processing. Therefore, users should be able to influence data management policies. For example, some confidential and im- portant data may be kept only on specific secure storage resources. Advanced customization may be required to influence system management by tagging data with the appropriate metadata. (6) Some users work on large datasets subjected to continuous modifications by external services. Such users expect dedicated features to ensure consistency of externally modified data. (7) Provisioning of any new functionality should not affect data security; hence appropriate permissions have to be set for each storage system. Issue (6) is closely related to developer expectations. Developers create services that process data; therefore, they expect easy integration of their services with the data access system. In other words, they require an API that enables their services to access data on behalf of the user or ensure that the data access system can detect changes performed directly upon storage resources. The providers’ point of view also takes into account user expectations with regard to efficient, easy and safe access to data. Providers manage data and resources in such a way as to maximize end-user benefits. Thus, they expect:

1. the ability to tune the management algorithms to their resources,

2. monitoring and management,

3. accounting,

4. data security.

These features are explained in more detail below. (1) Since the storage systems used by each provider differ in terms of speed, capacity and cost of purchase and maintenance, providers require the ability to influence automatic data management in order to tune the management algorithms to their resources. Therefore, providers must be able to choose a data management policy. Moreover, providers require the ability 3.2. BASIS OF MACAS 23

Table 3.1: Features of the data access system expected by stakeholders Number of issue Stakeholder Feature of data access system related with feature 1. Management capable of following different policies Provider 1 Provider 2. Control of storage systems for fair resource sharing Provider 2 3. Accounting Provider 3 User and 4. Data storage and access security Provider 4, User 7 Provider 5. Safe cooperation with other providers Provider 4, User 7 6. Access to distributed multiprovider environment from one place User 1 7. Advanced cooperation with other providers’ users User 2 8. Efficient processing of large amounts of data User 3 User 9. Long-term reliable data storage User 4, 5 10. Easy realization of typical use cases with advanced customization User1, 5 for advanced users 11. Integration with external domain-specific services User 6 Developer 12. Support for integration with legacy datasets subject to frequent or User User 6 modifications to differentiate access policies according to several factors, e.g., data format, storage system characteristics as well as user requirements described by metadata. (2) Providers also expect monitoring and management of resource usage in order to ensure fair allocation and prevent deterioration of service quality due to unjustified usage or high demand. (3) Another issue connected with fair resource sharing is accounting. Providers are limited by costs incurred not only when purchasing resources, but also during their operation (e.g., power expenditures). Thus, to control costs, providers must be able to measure resource usage associated with specific user operations, and block activity which exceeds user quota. (4) Cooperation of providers results in additional security and operational aspects. Although providers agree to certain cooperation rules when forming a federation, they usually require full autonomy of internal resource management and access control. Thus, data access requests originating from sites which belong to other providers must be accompanied by information needed to validate access permissions. All of the above issues are related to features of the data access system, as outlined in Table 3.1. MACAS was designed to allow development of data access systems that fulfill all the stated requirements. Focusing on particular features enables better identification of the required context (metadata) as well as better decomposition of the model into layers and concerns. These features (referenced by numbers) are then directly addressed by the algorithmic representation of MACAS (see Chapter 3.4.2).

3.2 Basis of MACAS

MACAS aims to provide features required by the stakeholders (see Chapter 3.1). The author has decided to take advantage of automatically managed metadata. To maintain ease of access, MACAS assumes that data and metadata management are transparent to the user. Although the model does not exclude user involvement in metadata management, it assumes that such involvement is optional, reserved only for the most advanced users. Typically, the system should generate the required metadata by monitoring the environment along with user actions (i.e. it should be context-aware). Thus, key features of MACAS are data access transparency and context awareness. Relying on automatically managed metadata not only hides the system’s complexity from the user but also provides flexibility, e.g., it allows easy import of large datasets without the need to copy large quantities of data (only metadata is created) or to modify legacy 3.2. BASIS OF MACAS 24

Figure 3.3: Basic metadata used during data access systems that use such datasets (a MACAS-compliant system may use a legacy dataset directly as long as it is able to import its contents). MACAS assumes that data is manged using logical files and datasets. Basic metadata is connected with logical files while datasets group logical files to simplify management. The most important part of a logical file’s metadata is the description of datachunks (see Figure 3.3). Each logical file consists of one or more datachunks. The role of a datachunk is similar to that of a block in a standard filesystem; however, it may refer to multiple blocks in a block storage system or to other entities in different systems (e.g., objects). Although a datachunk can be stored using one or several storage systems, the use of many storage systems to store a single datachunk is limited to replication of the entire datachunk. If a datachunk needs to be split into several parts in such a way that each part is stored in a different storage system, it is divided into smaller datachunks. Thus, datachunks are of variable size (see Chapter 4.1). An important assumption behind MACAS is that transparent management must not consti- tute a bottleneck of a MACAS-compliant data access system. To provide data access efficiency, the data access system should be able to use the full potential of high-performance storage systems where datachunks are stored. Moreover, replication of data to multiple storage sys- tems may increase data access throughput. The use of metadata allows such optimization to be hidden from users and can maintain access paths for a long time. However, it is impossi- ble to preserve the quality of access if metadata processing introduces large overheads (as is often the case [13; 11; 15]). Such overheads can be related to metadata access over the net- work. Thus, metadata is replicated across sites and clients. Minimization of metadata access overhead via replication may produce inconsistencies or additional overhead associated with metadata updates. As the environment grows (in terms of round-trip times), determination of appropriate metadata consistency and synchronization models becomes necessary. Thus, to provide transparent, easy, efficient and scalable data access, MACAS defines metadata together with consistency and synchronization models to process metadata locally and/or asynchronously whenever possible. This approach constitutes the basis upon which MACAS layers, concerns and algorithm are defined. 3.3. CONTEXT MODELLING IN MACAS 25

The above-mentioned metadata represents only a small part of metadata useful during data access and management. All kinds of metadata can be used by the data access system:

• descriptive metadata that describes resources for purposes such as discovery and identifi- cation, e.g., metadata that describes storage systems [106],

• structural metadata that describes data containers and indicates how compound objects are put together [106], e.g., metadata that describes the structure of a filesystem,

• administrative metadata that provides information to help manage a resource, such as access times or access permissions [106],

• statistical metadata (also called process data) that describes processes that collect, pro- cess, or produce statistical data [105], e.g., information about elements of the accounting subsystem,

• reference metadata that describes the contents and quality of statistical data.

Data is accessed through client software which can be used not only by the end user, but also by external services that access data on behalf of users – instead of inconveniently uploading data to a service before processing and downloading results thereafter. The key assumption connected with client design is that data access requests produced by a particular client are handled with the resources of a single site. Binding a client to a single site has its pros and cons. When the client requests data not available on the site which handles that client’s requests, the data access system has to replicate data to the appropriate storage system before the client can access it. However, such an approach also simplifies metadata management because the client might otherwise obtain conflicting information from several external sites (e.g., due to temporary metadata inconsistencies). The author has opted for this approach because processing of some metadata by the client can greatly reduce access delays, and moreover the presented approach avoids conflicts in client-side metadata caches. In addition, appropriate use of dataset metadata to choose the site that handles client requests limits the amount of data that needs to be replicated. Finally, creation of such replicas also reduces network traffic because further data access operations (including those performed by other clients) can be handled using the replica that is closest to the client. Although MACAS does not define an algorithm for datachunk (data) replication, it recom- mends that synchronous modifications (occurring during data access operations) are only per- formed on replicas stored on a single site. This mitigates access delays and overheads connected with synchronous reconciliation of replicas over networks with high round-trip times. The sim- plest algorithm that fulfills this recommendation is one that assumes creation of a replica when reading data, and invalidation of all replicas (except the one being currently modified) when writing data (see Figure 3.4 and Chapter 4.1 for more information about default datachunk management policy).

3.3 Context Modelling in MACAS

In MACAS, context is introduced to support the stakeholders’ operations, particularly with regard to storage and access to large volumes of data distributed across heterogeneous resources. The context is modelled using metadata. This results in parallel processing of metadata on multiple sites, and can therefore introduce significant overhead, potentially bottlenecking the entire data access system. 3.3. CONTEXT MODELLING IN MACAS 26

Figure 3.4: Example of datachunk replication

Since the use of metadata allows for effective access to data, a key design assumption involves replication of metadata for low metadata access delays. As a result, the system must provide efficient consistency and synchronization models for metadata, eliminating bottlenecks and pro- viding fault tolerance. The use of weaker consistency models limits access delays while overhead can be limited by decreasing the number of entities involved in metadata synchronization. On the other hand, since some features require synchronization upon accessing metadata, stronger consistency must also be provided as an option.

3.3.1 Types of Metadata

The following types of metadata are identified:

• user metadata – provides high-level information about the users, often obtained from third- party authentication services or during registration,

• site metadata – contains information about the site name, supported authorization meth- ods, endpoints and the provider which owns the site,

• storage metadata – composed of information about storage types and interfaces, as well as restrictions and capabilities such as availability, latency, and throughput,

• dataset metadata – provides information about data collections that are managed together, containing the name of the collection, access control list and list of sites storing elements of the collection,

• namespace metadata – holds identifiers of entities (e.g., files, documents) stored within a dataset,

• administrative metadata – describes entities stored within a dataset (e.g., creator, owner, access times, permissions, size),

• custom metadata – defined by the users; it may have an arbitrary structure and/or indicate special handling of particular data,

• location metadata – provides a mapping between logical data entities created by the users and the distributed locations of the actual data (datachunks),

• activity metadata – used to track current activity (e.g., information about user sessions, open resource handles, usage statistics, measurements from monitoring of hardware re- sources).

Figure 3.5 shows the main dependencies between metadata types. Namespace metadata is connected with dataset metadata – the identifiers (namespace metadata) are created within 3.3. CONTEXT MODELLING IN MACAS 27

Figure 3.5: Main metadata dependencies data collections described by dataset metadata. Administrative, custom and location metadata is created for entities indicated by these identifiers. A dataset can be stored by several sites described by site metadata and accessed by several users described by user metadata. Each site can have several storage systems described by storage metadata. Activity metadata can be connected to other metadata types depending on the activity it tracks (e.g., user activity or storage usage).

3.3.2 Description of Metadata

In the scope of context modelling, metadata relates to clients and sites (see Figure 3.1). It is assumed that each client is associated with a single site that handles its requests (see Chapter 3.2). Although each site is owned by a provider, this does not influence metadata management and providers are not used to describe metadata. The following definitions are adopted:

• basic sets:

– M = {m : m is metadata} - set of metadata, – S = {s : s is site} - set of sites, – C = {c : c is client} - set of clients, – E = {e : e is entity} = S ∪ C - set of entities (sites and clients), – T = {t : t is type of metadata} = {u, si, d, n, a, cu, l, st, act} - set of metadata types (see Table 3.2 for mapping of symbols to types defined in Chapter 3.3.1),

• relations:

– RT : m RT t - metadata m ∈ M is related to type t ∈ T if the type of metadata m is t, 3.3. CONTEXT MODELLING IN MACAS 28

Table 3.2: Abbreviations for types of metadata used in equations Symbol Type of metadata u user metadata si site metadata d dataset metadata n namespace metadata a administrative metadata cu custom metadata l location metadata st storage metadata act activity metadata

– RC : c RC s - client c ∈ C is related to site s ∈ S if site s processes the requests of client c, – RE : m RE e - metadata m ∈ M is related to entity e ∈ E if metadata m is processed by entity e.

Basing on these definitions distribution of metadata is described. Client software processes selected types of metadata while each site is capable of processing all types of metadata:

^ [ {m : m RE c} = {m : m RE c ∧ m RT t} (3.1) c∈C t∈{n,a,cu,l,st,act}

^ [ {m : m RE s} = {m : m RE s ∧ m RT t} (3.2) s∈S t∈T Although not all client software must process namespace, administrative, custom, location, storage and activity metadata, the use of a client that processes metadata to reduce the negative impact of network communication is assumed (see Chapters 3.4 and 4.1). All sites operate on the same user, site and dataset metadata:

^ ^ {m : m RE s1 ∧ m RT t} = {m : m RE s2 ∧ m RT t} (3.3)

t∈{u,si,d} s1,s2∈S Namespace, administrative, custom and location metadata processed by all clients connected to a particular site forms a subset of that site’s metadata:

^ ^ [ {m : m RE c ∧ m RT t} ⊆ t∈{n,a,cu,l} s∈S c∈{c:c RC s} {m : m RE s ∧ m RT t} (3.4)

Activity and storage metadata is local for each site and its clients:

^ ^ (s1 6= s2) ⇒ ({m : m RE s1 ∧ m RT t}∩

t∈{st,act} s1,s2∈S

{m : m RE s2 ∧ m RT t} = ∅) (3.5)

Subsets of other types of metadata processed by different sites can differ depending on 3.3. CONTEXT MODELLING IN MACAS 29

Table 3.3: Classes of metadata with abbreviations used in equations Class of metadata Symbol Class name class 1 cl1 cooperation metadata class 2 cl2 logical file metadata class 3 cl3 runtime metadata class 3a cl3a private runtime metadata class 3b cl3b shared runtime metadata class 3c cl3c public runtime metadata datasets hosted by a particular site.

3.3.3 Classification of Metadata

Given the distribution of metadata, metadata classes are introduced in Table 3.3 and further defined in this chapter. Therefore, a set of metadata classes can be distinguished:

• CL = {cl : cl is class of metadata} = {cl1, cl2, cl3, cl3a, cl3b, cl3c}.

Additionally, the following relation is defined:

• RCL : m RCL cl - metadata m ∈ M is related to class cl ∈ CL if the class of metadata m is cl.

The sets of metadata that belong to the main classes are defined below while properties of the class and subclasses of class 3 are described later on in this chapter. Class 1 - cooperation metadata consists of user, site and dataset metadata:

[ {m : m RCL cl1} = {m : m RT t} (3.6) t∈{u,si,d} Class 2 - logical file metadata consists of namespace, administrative, custom and location metadata:

[ {m : m RCL cl2} = {m : m RT t} (3.7) t∈{n,a,cu,l} Class 3 - runtime metadata consists of storage and activity metadata:

[ {m : m RCL cl3} = {m : m RT t} (3.8) t∈{st,act} All sites operate on the same cooperation metadata (cl1):

^ {m : m RE s1 ∧ m RCL cl1} = {m : m RE s2 ∧ m RCL cl1} (3.9)

s1,s2∈S Cooperation metadata (cl1) is not used by any client:

^ {m : m RE c ∧ m RCL cl1} = ∅ (3.10) c∈C

Cooperation metadata is used for authentication, authorization and routing of data access requests. User metadata and site metadata are used to authenticate the user. The user can use different accounts to benefit from resources on multiple sites, belonging to different providers; 3.3. CONTEXT MODELLING IN MACAS 30 hence, a reference to a local account is required. Such metadata also contains a description of user relations (e.g., groups the user belongs to) which are often needed when setting permissions. Site metadata and dataset metadata are required to choose the site that should handle a data access request. This type of metadata rarely changes but is frequently read. It is critical to respect each change in cooperation metadata as otherwise a security breach might occur. However, delayed propagation of changes in this metadata is allowed in many use cases, e.g., many supercomputing centres require that their data access system disallows access not later than 24 hours after a user account is banned or has expired. MACAS assumes that data is presented to the user as ’logical files’. Logical file metadata (cl2) is shared between sites such that each site is interested in a subset of this metadata. Subsets required by different sites may share common parts. Each client can access part of class 2 metadata that is processed by the site handling its requests:

^ [ {m : m RE c ∧ m RCL cl2} ⊆ s∈S c∈{c:c RC s} {m : m RE s ∧ m RCL cl2} (3.11)

Namespace metadata describes the structure of data. It may specify a flat entity structure, such as in a key-value store, or a hierarchical one, such as in a filesystem. Each logical file is associated with administrative metadata and location metadata. Logical files can be translated to different entities depending on the dataset type, e.g. files for a filesystem or documents for a NoSQL database. Thus, the structure of administrative metadata and location metadata can vary. The metadata itself is usually accessed and updated frequently. Logical file metadata can also include custom metadata defined by the user. The usage characteristics of custom metadata depend mainly on the user and may result in frequent read operations as well as frequent updates. Activity metadata describes the state of resources and data usage. Together with storage metadata, it constitutes runtime metadata (cl3). Each site manages its runtime metadata independently:

^ (s1 6= s2) ⇒ ({m : m RE s1 ∧ m RCL cl3}∩

s1,s2∈S

{m : m RE s2 ∧ m RCL cl3} = ∅) (3.12)

Classess cl3a, cl3b and cl3c are subsets of runtime metadata stored and processed in sim- ilar ways. Private runtime metadata (cl3a) describes the information needed for continuous operation (information shared between requests), e.g., file handles. Private runtime metadata is not shared with anyone. Corresponding metadata structures may be created by a client and at a specific site, e.g., when the user opens a file, the client can create a handle and the cor- responding information about the opened file can also be saved on the site. However, these are different metadata structures, because the client requires information about storage system- specific file/object handles, while on the site it is enough to store an identifier of the client that opened the file. Private runtime metadata stored by sites also contains part of storage metadata which clients are not interested in: 3.3. CONTEXT MODELLING IN MACAS 31

^ (c1 6= c2) ⇒ ({m : m RE c1 ∧ m RCL cl3a}∩

c1,c2∈C

{m : m RE c2 ∧ m RCL cl3a} = ∅) (3.13)

^ ^ {m : m RE s ∧ m RCL cl3a} ∩ {m : m RE c ∧ m RCL cl3a} = ∅ (3.14) s∈S c∈C A certain portion of runtime metadata (cl3b) is shared between the client and the site. This mainly concerns metadata that describes the environment. While it is useful to gather it on the site in order to provide load control and fair resource sharing, clients are typically not interested in exchanging such data. Thus, class 3b metadata is shared only between a specific client and the site:

^ (c1 6= c2) ⇒ ({m : m RE c1 ∧ m RCL cl3b}∩

c1,c2∈C

{m : m RE c2 ∧ m RCL cl3b} = ∅) (3.15)

^ {m : m RE s ∧ m RCL cl3b} = s∈S [ {m : m RE c ∧ m RCL cl3b} (3.16) c∈{c:c RC s}

The next component of runtime metadata is public runtime metadata (cl3c), shared between the site and its clients. It is used to control the site configuration, e.g., parameters of storage systems. It can also be used to set limits that are not client-specific, e.g., throttle1 all clients (throttling only one selected client requires the use of class 3a). This type of metadata also contains a portion of storage metadata that is needed by all clients, e.g., description of interfaces.

^ ^ {m : m RE c1 ∧ m RCL cl3c} =

s∈S c1,c2∈{c:c RC s}

{m : m RE c2 ∧ m RCL cl3c} (3.17)

^ ^ {m : m RE c ∧ m RCL cl3c} = s∈S c∈{c:c RC s} {m : m RE s ∧ m RCL cl3c} (3.18)

Metadata classes are defined in relation to distribution of metadata. Thus, for distributed metadata, consistency and synchronization models have to be specified.

1To throttle – to regulate speed (usually implies slowing down). 3.3. CONTEXT MODELLING IN MACAS 32

3.3.4 Metadata Consistency and Synchronization Models

Keeping metadata synchronized at all times in a large-scale, distributed data access system is too expensive and wastes processing resources since CPUs will often have to wait for syn- chronization of metadata. Moreover, it reduces system’s availability in case of network failure. Although temporary metadata inconsistencies results in potential problems such as data losses connected to metadata changes conflicts (see Chapter 4.1.3), they may usually be alleviated by modifying data processing scripts in order to adapt them to a distributed environment. Thus, the author has decided to analyze possibility of use of weaker consistency models and delayed synchronization. As stated before, class 1 metadata is read intensively and rarely changes. As it is critical to acknowledge each change, weak consistency models (e.g., eventual consistency) are not applicable in this context. However, delayed propagation of changes is often permitted. The use of lazy (instead of eager) replication may accelerate operations; hence MACAS does not require the use of a strong consistency model for class 1 metadata. Instead, sequential or causal consistency is permitted. The causal consistency model is weaker than the sequential consistency model because it stipulates that only causally related writes must be observed in a set order by all processes. Thus, the author has decided to apply causal consistency for better scalability. Class 2 metadata typically has a much larger volume than class 1 metadata, and is also subject to more frequent updates. Applying causal consistency to class 2 metadata would represent a bottleneck. As the user expects to be able to access all data everywhere regardless of its origin, the author has decided to use eventual consistency for class 2 metadata. This metadata is changed locally and synchronized asynchronically. If a conflict occurs, it is resolved automatically. To decrease synchronization overhead, the synchronization model also assumes that metadata is synchronized only within the sites (and their clients) that are interested in a particular dataset. Information about sites interested in the metadata of a particular dataset is stored in dataset metadata that is synchronized across all providers. The proposed consistency and synchronization model enables efficient parallel processing of metadata on multiple sites. This efficiency comes at the cost of possible conflicts and temporary inconsistencies when metadata changes. Since MACAS assumes support for HPC, its metadata management cannot represent a bottleneck; thus, support for conflict-free data access is realized with the help of dataset metadata. By using dataset metadata the user can indicate that class 2 metadata of a particular dataset should be managed by a single site, limiting the performance of access to data stored in such a dataset. Class 3a metadata is always managed by a single entity (client or site) and therefore not subject to synchronization. Class 3b metadata is processed by a single site and client while class 3c is processed by a single site and many clients. In both cases the metadata is used mainly for monitoring and behavioral tuning so it is possible to propagate changes with some delay. Thus, both metadata classes (3b and 3c) apply the eventual consistency model, differing in terms of their synchronization scope. It should also be mentioned that there are no conflicts in class 3 metadata because even when many entities are interested in particular metadata, it is always modified by a single entity, e.g., the client modifies metadata describing the machine upon which it operates, while the site only analyzes this data. Similarly, the site modifies metadata that contains throttling settings while clients are limited to reading it. Consistency and synchronization models of metadata classes are summarized in Table 3.4. Combining several models is necessary to ensure data access transparency and quality in terms 3.4. MODEL DESCRIPTION 33

Table 3.4: Metadata classes Consistency Class Description Synchronization model model Class 1 cooperation causal all sites interested in metadata, modifications are executed in (cl1) metadata consistency order to avoid conflicts, lazy changes broadcast to all sites Class 2 logical file eventual subset of sites and their clients interested in particular metadata, (cl2) metadata consistency parallel modifications with automatic conflict resolution Class 3a private runtime - - (cl3a) metadata Class 3b shared runtime eventual single client and site interested in particular metadata, single (cl3b) metadata consistency writer, lazy change replication Class 3c public runtime eventual single site and many clients interested in particular metadata, (cl3c) metadata consistency single writer, lazy change broadcast of performance. While using weaker consistency models and limiting the scope of synchroniza- tion is necessary to process metadata efficiently, such models are not applicable to metadata crucial for security and integration of different users’ local accounts necessary to provide data access transparency. Usage of weaker consistency models also results in temporary differences in data views. Thus, appropriate parameterization of synchronization methods is needed to maintain data access transparency in terms of place of access (see Chapter 4.1.3). Further work with metadata within this thesis comprises two main activities: designing a metadata process- ing mechanism and implementing it in a distributed environment while ensuring efficiency, as presented in Chapter 4.

3.4 Model Description

The model of Transparent Data Access with Context Awareness consists of layers that provide different features to fulfill stakeholders’ requirements (see Chapter 3.1). This is achieved by using contextual information described by metadata. The following layers are defined:

• Access Layer – enabling interaction with diverse storage resources. Allows handling of dif- ferent data formats. Uses class 3 metadata to provide quality connected with performance, security and reliability.

• Executive Layer – coordinates execution of multiple operations for using more than one storage system. Manages data according to the selected policy. Uses class 2 and 3 metadata.

• Routing and Performance Layer – covers technical aspects connected with distribution of the environment. Uses class 3 metadata to balance utilization of resources and react to brief load fluctuations, and class 1 metadata for redirections between sites.

• Communication Layer – provides interfaces that enable client interaction with the data access system. Uses class 3 metadata to maximize connection capabilities.

• Cooperation Layer – provides the functionality required for cross-site cooperation (includ- ing between sites that belong to different providers) using class 1 metadata.

• Client Layer – enables user interaction with the data access system. Uses class 2 and 3 metadata to reduce network overhead.

The model also includes two cross-cutting concerns:

• Monitoring and Management Concern – produces class 3 metadata used by the layers. 3.4. MODEL DESCRIPTION 34

Figure 3.6: Model of Transparent Data Access with Context Awareness

• Security Concern – ensures data safety. Uses class 1 metadata to provide different types of access credential for different layers, and uses class 1 and class 2 metadata to verify access permissions for a particular file.

Appropriate combinations of features provided by each layer and concern facilitate support for various use cases (see Chapter 3.4.2).

3.4.1 Description of MACAS Layers and Concerns

The MACAS layers and concerns are depicted in Figure 3.6. Metadata classes used in the figure refer to classes defined in Chapters 3.3.3 and 3.3.4 (see Table 3.4). The Access Layer provides access to data stored by users regardless the point of access and location of data. It unifies data access methods due to virtualization, providing the illusion that all data is stored locally. The Access Layer unifies access methods for storage systems that differ with respect to interfaces and basic concepts (e.g., keeping data in classic filesystems vs. database tables vs. flat object structures). It possesses context awareness in terms of storage system configuration and can provide quality in terms of low-level data access operations. Although the performance of access operations is strictly connected with storage system capabilities, the Access Layer enables better utilization of the storage system, e.g., buffering writes for block size optimization (see Chapter 5.2.1). It also provides data storage security and reliability, e.g., encrypting or replicating data across multiple storage nodes. The Executive Layer is introduced to meet the needs of user groups which require several distinct storage resources. It handles management of storage resources virtualized by the Access Layer to create a unified view of data stored on several virtualized resources. The Executive Layer is characterized by high context awareness. It enforces selected policies using class 3 metadata to detect the state of the environment and class 2 metadata to acknowledge user 3.4. MODEL DESCRIPTION 35 requirements. The functionality of the Executive Layer plays a vital role in ensuring quality by choosing/combining different data access and storage methods for specific activities (e.g., by replicating selected data between storage systems), regardless of differences in quality offered by various storage systems on the Access Layer. The Client Layer and the Communication Layer both enable the user to interact with the data access system. While many users expect a simple, intuitive interface with basic options, some may require advanced functionality. However, extending the basic interface with advanced options may confuse less advanced users. Thus, the Client Layer provides several data access methods including a POSIX-compliant interface and a WebGUI client suitable for most users. Individual clients may require different methods of communication with other layers, so the Communication Layer hides technical aspects to enable various ways of communicating with clients. This layer handles communication with popular generic clients (e.g., curl [67]) and fulfills advanced requests (e.g., CDMI [66]). The Client Layer and Communication Layer are responsible for quality in terms of ease of use. They are also context-aware with regard to access location and quality of available communication media, and attempt to reduce network overhead (e.g., caching) while maximizing connection throughput. The Cooperation Layer and Security Concern handle secure cooperation of sites that belong to different providers sharing common goals (according to some agreed-upon policy). Typically, the basic goal is to facilitate the use of a multiprovider environment. This functionality is realized by provisioning coordinated access to class 1 metadata on all sites. The Cooperation Layer provides a unique identification to the user. It facilitates multisite operations despite different authentication and authorization systems used by each provider. The Security Concern maps the unique user identifier to formats required by layers, e.g., to credentials for accessing a particular storage system. The Security Concern also processes class 2 metadata to check access permissions for a particular file. Additionally, by using class 1 metadata, the Security Concern is responsible for configuration of safe protocols used by the Communication Layer, setting up secure communication channels between sites and enabling authentication of each site. Class 1 metadata managed by the Cooperation Layer can also cover other aspects of cooperation, e.g., indicate that a particular type of request is handled only by a chosen set of sites with adequate resources. The Routing and Performance Layer forwards requests to system elements that can either handle them directly or reroute them to other sites, as needed. This layer, together with the Ex- ecutive and Cooperation Layers, is a key element for providing quality in terms of performance, reliability and availability. While the Executive and Cooperation Layers make decisions that have long-term impact on the system, e.g., where the data is stored and/or replicated, the Rout- ing and Performance Layer takes into account the load and current capabilities of resources to balance utilization and react to short-term fluctuations, e.g., temporary overload of a network interface. The Routing and Performance Layer is also able to overcome problems connected with failures. It provides a very important feature – high availability – routing requests only to elements that are currently active (assuming redundancy of system components). The Monitoring and Management Concern provides mechanisms for monitoring and adapting to the changing environment. It gathers information by executing monitoring tasks, e.g. checking memory utilization or gathering statistics which describe the communication flow. This system state data is subsequently processed to produce class 3 metadata that is used by layers. The author refers to such metadata as advice. The process of creating advice may vary, e.g., it may involve execution of commands or be based on own counters. Different layers 3.4. MODEL DESCRIPTION 36 also use advice differently. It can be used to affect the entire system or selected users (e.g., throttling data access for users who have exceeded their quotas). An important responsibility of the Monitoring and Management Concern is accounting, i.e. control over soft and hard access quota and undertaking appropriate actions when such quota are exceeded. The policy may call for denying access, blocking a particular operation or activating performance limiters for selected clients or users. The layers and concerns cover different aspects of quality, while users expect different types of quality, e.g., high performance vs. high security. Such functionality is often provided by different storage systems. Thus, the Executive Layer, which decides how data is distributed across resources with differing characteristics, plays a key role in implementing different policies. By leveraging class 2 metadata which describes user and provider expectations, the Executive Layer is able to act differently in seemingly similar situations.

3.4.2 MACAS Algorithm

Figure 3.7 shows an algorithmic representation of MACAS for a typical use case of accessing user data via client software in order to work with data stored on resources that belong to one or many sites. The figure connects each step of the algorithm with the layer or concern responsible for its realization. To begin with, the user requests access to data and the client reads input parameters and configuration values. The user action is then processed by the Cooperation Layer, which deter- mines the site and resources that will be used to handle it. Next, the Security Concern provides the appropriate credentials to enable authentication. Once the connection between the client and the site has been established, the user action is further processed, generating one or more low-level client requests. For each request, the client checks the availability of metadata required to handle that request locally. When no suitable metadata can be found, the Communication Layer and Routing and Performance Layer are used to initiate metadata synchronization. The Security Concern verifies each request and maps the user account to appropriate storage system accounts. The Executive Layer is responsible for processing requests connected to logical file metadata management, e.g., choosing the storage system that should be used to handle the next low-level data access request. The Cooperation Layer is used if the request modifies any class 1 metadata, e.g., when it adds a user to a group. If the request involves data access, it is processed by the Access Layer, which is responsible for interaction with the selected storage system. If needed, the Access Layer can delegate some work to other layers, e.g., when data is stored at a remote location. MACAS assumes that, regardless of user activities, the environment is constantly monitored by the Monitoring and Management Concern, which provides advice for layers (advice is provided in the form of metadata – see Chapter 3.4.1). The described generic algorithm enables provisioning of features listed in Table 3.1 in Chapter 3.1. It supports a wide range of use cases owing to appropriate implementation of individual steps. Thus, feature numbers used in the following section refer to the aforementioned table, while step numbers are taken from Figure 3.7. Easy implementation of typical use cases with customization for advanced users (Feature 10) is realized by providing different interfaces – a simple one that exposes only basic operations and a complex one, extended with more advanced operations (algorithm Step 5). The client also provides integration with external domain-specific services (Feature 11), e.g., using tokens to operate on behalf of the user (Steps 3 and 11). 3.4. MODEL DESCRIPTION 37

Figure 3.7: Algorithmic representation of MACAS 3.5. SUMMARY 38

Support for integration with legacy datasets which are continuously modified by external tools (Feature 12) is possible due to background activities of the Executive Layer. If requested by the user, the Executive Layer initiates the process of creating class 2 metadata for an existing dataset (Step 19). This process then involves monitoring the dataset for external changes. Long-term reliable data storage (Feature 9) is realized by the Executive Layer. It processes class 2 metadata (especially custom metadata) and chooses the appropriate storage resources (Step 13). Similarly, the Executive Layer uses class 2 metadata to select a safe storage system when data storage and access security is requested (Feature 4). A vital role in providing security is fulfilled by the Security Concern which relies on class 1 metadata to manage access credentials (Step 11). Interaction with a distributed multiprovider environment from a single access point (Feature 6) is enabled by the Cooperation Layer (Step 2) which uses class 1 metadata to choose the appropriate site (providing efficient access to class 2 metadata) and by the Routing and Per- formance Layer (Step 10) which proxies requests when needed (e.g., when a single client works with several datasets connected to different sites). When the sites communicate with each other, secure cooperation (Feature 5) is provided by the Security Concern (Step 11) which uses class 1 metadata to verify the permissions of the user on whose behalf one site accesses another. In addition, the Cooperation Layer provides support for advanced cooperation between users supported by different providers (Feature 7), managing class 1 metadata that describes user relations (Step 14). Management which implements different policies (Feature 1) is provided by the Executive Layer, using metadata (including user-defined metadata) to select the best resources for partic- ular data (Step 13). Control over storage systems to ensure fair resource sharing (Feature 2) and accounting (Feature 3) is enabled by the Monitoring and Management Concern, with class 3 metadata that describes resource usage by specific clients/users (Steps 16 and 17). The Executive Layer uses this metadata to control clients (Step 18). Class 3 metadata provided by the Monitoring and Management Concern is also important for efficient processing of large amounts of data (Feature 8). It is used by the Access Layer to optimize work with a single storage system (Step 15), the Routing and Performance Layer to react to load fluctuations (Step 10) and the Communication Layer to optimize network usage (Step 4). The most important role in the entire algorithm falls to the Executive Layer which uses both class 2 and class 3 metadata to choose the optimal storage system and access method for particular data (Step 13). It also creates and manages class 2 metadata that describes access permissions (Step 13) for individual files to allow authorization by the Security Concern (Step 12). However, the Client Layer and the Cooperation Layer are also important. The Client Layer reduces network communication overhead (Step 8) while the Cooperation Layer uses class 1 metadata to indicate sites that can process particular data/metadata efficiently, using resources contributed by multiple sites (Step 2).

3.5 Summary

Stakeholder requirements have been used to identify the desired features of the data access system. The author subsequently defined a data access model – MACAS – using the following elements: metadata, layers and concerns and the algorithm making use of the mentioned above elements. The goal of MACAS is to ensure data access transparency, implying that data access 3.5. SUMMARY 39 management should not involve the user unless specifically requested. The author has decided to base data access on logical files that are composed of datachunks. Datachunks are portions of data stored on one or several storage systems. They are described by automatically managed metadata. To prevent metadata from bottlenecking the data access system even in a highly distributed environment, several classes of metadata with different consistency and synchronization models have been defined. However, weakening of consistency results in possible temporary differences in data views on different sites; hence, appropriate implementation of synchronization methods is necessary to maintain data access transparency in terms of place of access. The functionality of MACAS is modeled using layers and concerns that process the previously defined metadata. Additionally, an algorithm has been introduced to describe the provisioning of data access using MACAS layers and concerns. Architecture and Selected As- 4 pects of Implementation

In this chapter the architecture and implementation of two main MACAS components – the Data Management Component (DMC) and the FUSE client (FClient), together with the auxiliary Handler component – are discussed. As described in Chapter 1.3 the architecture is obtained by mapping MACAS elements to elements of the architecture itself, and thereafter to components of its implementation. In the implementation phase the Erlang and C programming environments are used. The author has contributed to the overall architecture of the system by co-developing it with Lukasz Dutka, PhD. The author’s work focused on DMC: in particular, the author developed the architecture and implementation of the Data Management Component kernel called DMC Core.

4.1 Overall Architecture of the System

Since data is maintained on sites and accessed via clients, the two main elements of the MACAS model are the Data Management Component (DMC) and the FUSE client (FClient). Both use Handlers to access different types of storage systems. DMC is responsible for management of data and resources on a single site, while FClient provides a user interface (see Figure 4.1). The system exposes a single logical namespace to users while interfacing many storage systems. FClients are launched on resources which facilitate data access, in order to reduce the impact of network communication between user processes and DMC. Each FClient caches metadata required to translate logical file names to storage system locations, and accesses data via its own Handler instances, avoiding contact with DMC whenever possible (see Figure 4.2). Storage Systems represent resources managed by the system. Effective data management is enabled by direct communication between storage systems and DMC instances which manage them. Computing Elements represent nodes which execute user processes, accessing data via FClients. DMC implements the functionality of the Executive and Communication Layers to provide interfaces to resources and manage the way in which these resources are used. It performs its job by processing metadata. Due to the diversity of resources managed by DMC, data access is delegated to the appropriate Handler responsible for a particular resource. The Handler performs both unification of access methods and access optimization, e.g., by aggregating/splitting write requests depending on the optimal block size. 4.1. OVERALL ARCHITECTURE OF THE SYSTEM 41

Figure 4.1: Overall architecture of the system

Figure 4.2: FUSE client (FClient) concept

According to the stated assumptions, DMC can expose multiple interfaces for various client types; however, the most important DMC interface is the one which supports POSIX operations for FClients. Each FClient accesses data on behalf of the given user. It provides a set of callbacks executed for successive user actions (see Figure 4.3a). The FClient is designed to provide efficient data access for high-performance applications by combining two main features: performing operations at the actual data location (using Handlers) and handling metadata calls connected with these operations. The FClient works locally whenever possible (basing on cached metadata), asynchronously informing DMC about its activities (see Figures 4.3c and 4.3d). The FClient communicates synchronously with DMC only to ensure consistency of its metadata cache (e.g., obtaining information about the Handler, storage and ID of a file in the storage system when that file is being opened – see Figure 4.3b) or to request data synchronization (see Figure 4.3d). DMC updates FClient caches when a particular metadata element changes, while FClients generate events to inform DMC about their actions (see Chapter 4.1.2). The basic pseudocode for handling FClient requests by DMC is depicted in Figure 4.4. The request can be handled locally or rerouted to another DMC when it concerns a dataset that is not supported by the local DMC. When the request is handled locally, user permissions are verified and code of appropriate module (see Chapter 4.2.2) is executed on one of DMC’s nodes (see Chapters 4.2.1 and 4.2.3). It should be emphasized that DMC handles multiple requests in parallel. Moreover, handling user actions or client requests is only part of FClient and DMC responsibility. Both elements also execute some background tasks (e.g., monitoring – see next paragraph) irrespective of user actions. Some functionality is implemented by both DMC and FClient, e.g., DMC may be aware of the state of resources but not always able to monitor the FClient’s environment. Information about this environment is crucial in order to select the best access method and storage system for a particular FClient request. Thus, both components contain monitoring and routing modules 4.1. OVERALL ARCHITECTURE OF THE SYSTEM 42

Figure 4.3: Sample FClient pseudocode

Figure 4.4: Basic pseudocode for handling FClient requests 4.1. OVERALL ARCHITECTURE OF THE SYSTEM 43

Table 4.1: Implementation assumptions for MACAS Where Functionality functionality Layer or Concern Comments Component functionality is provided Management of data Site that DMC manages data in stored on particular Executive Layer Resources of site are resources owns linked and should single site. Provision of access to Communication resources be managed together. particular resources Layer Management of Any place but DMCs choose one site that metadata describing Requests depend on with global Cooperation Layer manages cooperation relationships between each other. coordination metadata. users and datasets Synchronization of DMCs synchronize metadata that Sites that metadata. Cooperation Cooperation and Conflicts resolution describes data process metadata is used to choose Executive Layers is needed. distributed among metadata subsets of DMCs that take several sites part in synchronization. FClient handles users’ Provisioning of user requests. It processes Client Client Layer - interface metadata to reduce network impact. Data may be Handler provides access to accessed anywhere a particular storage Accessing data on but different storage Access Layer system. There are different storage systems systems require Handlers for different Client and different access storage systems. site methods. Routing and Routing of requests DMC and FClients DMC and FClients have Performance Layer work in different modules for routing and Monitoring and Environment environments. Management monitoring. monitoring Concern Preventing Security Concern DMC manages FClients unauthorized access - and Handlers to provide Executive Layer, Adaptation to Monitoring and security and desired provider and user Management, needs characteristics. Security Concerns to ensure that this functionality can be provided throughout the system. Table 4.1 presents the overall implementation assumptions. Each DMC supervises a set of tightly linked resources and manages client requests. Data management is based on logical files (accessed through paths or UUIDs) and datachunks (see Chapter 3.2). Data is not synchronized automatically. Data replicas described by each datachunk are created/deleted when the data is read/written, unless another policy has been specified (see Chapter 3.2 and Chapter 5.3 for a description of the corresponding tests). The pseudocode for FClient datachunk synchronization request handling is outlined in Figure 4.5. When such a request appears, DMC generates a list of datachunks that have to be requested from other DMCs, splitting datachunks on the fly if needed. Subsequently, datachunk synchro- nization requests are submitted to the appropriate module that manages data transfer between DMCs. When data is synchronized, datachunk descriptions are updated and information about updates is sent directly to the FClient that issued the given request, as well as – through events – to other interested FClients (see Chapter 4.1.2). Other DMCs receive information about dat- achunk updates through the metadata synchronization mechanism (also described in Chapter 4.1.2). 4.1. OVERALL ARCHITECTURE OF THE SYSTEM 44

Figure 4.5: Pseudocode for FClient datachunk synchronization request handling

The pseudocode for updating metadata which describes datachunks is depicted in Figure 4.6. DMC processes information not only about datachunks stored on its own resources, but also about datachunks stored on resources which belong to other DMCs to determine the recipients of synchronization requests (see Figure 4.5). Thus, the metadata describing a datachunk can be updated in the cases when an event which involves data modification appears, information about changes to a datachunk introduced by an external DMC is received, or a synchronization request is handled. When metadata describing a datachunk update results from an event or synchronization request, newly appearing datachunks are merged with existing ones if possible. Receipt of information concerning a change to a datachunk introduced by another DMC results in updating metadata which describes the datachunks of that DMC. Afterwards, this information is used to invalidate metadata describing datachunks stored on the DMC’s own resources, if these datachunks overlap with datachunks updated on the resources managed by another DMC. To decrease synchronization overhead, all datachunks that describe data stored on the DMC’s own resources are marked as either public or private. Datachunks marked as public are created as a result of data modifications (other DMCs need to know where to locate the most recent data). Datachunks marked as private are created as a result of replication, so there is no need to broadcast information about them quickly (i.e. there is at least one replica of that data on resources managed by another DMC). As random reads can result in creation of thousands of datachunks per second, synchronizing all metadata describing these datachunks would intro- duce high overhead. Thus, private datachunks are treated in a special manner and metadata describing them is broadcast only after datachunks merging into larger ones (see Figure 4.6a). The presented approach provides high elasticity and reduces network load, but calls for effective processing of metadata describing datachunks in a size-dependent manner. The smaller the datachunk, the more metadata has to be updated. However, if the datachunk is stored on several storage systems, changing even a single byte causes that datachunk to be invalidated across all systems except the one where the change took place. This can lead to unnecessary data transfers. Given that both large and small datachunks have their pros and cons, DMC supports variable datachunk size and permits datachunks to be aggregated and split on the fly. When information about a datachunk change is received by the system the datachunk may either be merged with existing datachunks or invalidation of existing datachunks may occur (see Figures 4.6b and 4.6c respectively), which can also result in splitting datachunks – as depicted in Figure 4.7. The list of datachunks to be merged or invalidated is created from a sorted tree of 4.1. OVERALL ARCHITECTURE OF THE SYSTEM 45

Figure 4.6: Pseudocodes for updates of metadata describing datachunks existing datachunks, selecting those that overlap with the changed datachunk. During invalida- tion and merging, metadata describing beginnings and endings of existing datachunks is changed to properly describe data distribution following modifications. Some metadata describing dat- achunks can also be deleted (datachunks may be merged with each other while others might be fully contained in the datachunk which initiated the invalidation process). Since such operations on metadata require additional effort, the system limits the number of metadata operations by introducing an appropriate update mechanism (see Chapter 4.1.3). DMC instances share class 2 metadata (see Chapters 3.3.3 and 3.3.4) when data is distributed between resources supervised by them. Metadata used for authentication and authorization is managed by a DMC module called the cooperation manager (see Chapter 4.2.2). When many DMCs are deployed, one is chosen to store and manage cooperation metadata while others delegate cooperation metadata modification requests to the selected DMC.

4.1.1 Metadata Distribution

Since the design assumptions call for local metadata processing, metadata stores and caches are introduced (see Figure 4.8). The classes referred to in this chapter are defined in Chapters 3.3.3 and 3.3.4. The terms “metadata store” and “metadata cache” describe entities that maintain metadata either in memory or in fixed storage. However, they are not interchangeable. The metadata store is responsible for storing metadata from the time of its creation until it is not longer used and can be completely forgotten. The metadata cache is used by DMC or FClient to keep copies of metadata which is maintained in a remote1 metadata store. Thus, selected metadata is copied to the metadata cache when required by a particular FClient or DMC, and deleted from the cache when deemed irrelevant by that FClient or DMC. Removal of metadata from the metadata cache does not equate with complete deletion of metadata (however, this equality holds for the metadata store). The value of a cached metadata element does not have to be consistent with the value of 1Remote is interpreted as not located in the memory of a particular FClient or DMC instance, e.g., from the perspective of a FClient the DMC metadata store can be classified as remote. 4.1. OVERALL ARCHITECTURE OF THE SYSTEM 46

Figure 4.7: Pseudocodes for metadata updates following merger and invalidation of datachunks

Figure 4.8: Metadata stores and caches 4.1. OVERALL ARCHITECTURE OF THE SYSTEM 47 the same element in the metadata store. An updated value in the cache can be asynchronously synchronized with the store, while updates carried out in the store can be asynchronously broad- cast to caches interested in the given metadata (see Chapter 4.1.2). The consequences of such behaviour are discussed in Chapter 4.1.3. Each metadata class (see Chapters 3.3.3 and 3.3.4) is stored and managed separately. Class 1 metadata is more frequently read than updated and it is critical to preserve each change. Updates are performed in a transactional way on the level of the whole system. The cooperation manager module of DMC redirects class 1 metadata modification requests to the DMC which manages the class 1 metadata store. Additionally, all DMCs use class 1 metadata caches where copies of such metadata are stored. All modifications are performed directly within the class 1 metadata store and dispatched asynchronously to all caches. Each DMC stores class 2 metadata that describes logical files along with descriptions of their datachunks. Processing such metadata may incur significant load due to the operations on data which call for metadata updates. Thus, processing involves local modifications of metadata, with asynchronous multicasting of aggregated changes (only to interested DMCs) and automatic conflict resolution (see Chapter 4.1.2). The subset of DMCs which take part in synchronization of particular class 2 metadata is determined using class 1 metadata. Each DMC uses a persistent class 2 metadata store with a memory cache for better performance (see Chapter 4.2.3). Class 3 metadata is used by the DMC to supervise the state of the environment and recon- figure system elements if needed, e.g., change connection parameters to optimize network usage. Class 3 metadata includes FClient session state, so it is intensively used when many FClients are connected. To decrease system load, it is neither shared between DMCs nor protected from hardware failures. FClients heavily rely on class 2 and 3 metadata, e.g., the attributes of files presented to the user or the handles to opened logical files. For performance reasons, all required metadata is cached in FClient memory. For class 2 metadata, operations are performed in the cache and then asynchronously propagated to the DMC metadata store. If DMC changes the metadata or receives a change from an FClient, it dispatches asynchronous notifications to all FClients that cache the affected metadata. For class 3a (private runtime metadata - see Chapters 3.3.3 and 3.3.4) no synchronization is needed. This class includes information required operationally by each entity (DMC and FClient); hence each entity creates its own class 3a metadata. However, the corresponding metadata structures are often created as a result of synchronous calls, e.g., local representations of sessions. Class 3b (shared runtime metadata - see Chapters 3.3.3 and 3.3.4) updates are gathered and preprocessed by FClients. They describe the activities of each FClient. Gathered changes are sent to DMC. By using class 3b metadata gathered from many FClients, along with its own knowledge/configuration, DMC is capable of controlling the environment. Such control is performed through updates of class 3c metadata (public runtime metadata - see Chapters 3.3.3 and 3.3.4) that describe various parameters relevant for FClients, e.g. storage system parameters. Thus, changes in class 3c metadata are broadcast asynchronously to FClients to update their configurations. Figure 4.9 shows a sample flow of metadata between stores and caches triggered by handling user actions in the FClient (e.g., accessing logical file metadata) as well as by automatic, periodic actions (e.g., checking environment state). Again, the key assumption involves local processing of frequently used metadata to avoid overheads connected with remote metadata updates. The metadata is processed as soon as possible, especially on the FClient machine. Changes are 4.1. OVERALL ARCHITECTURE OF THE SYSTEM 48 aggregated and applied in batches. Any conflicts are resolved automatically, using an algorithm which ensures eventual global consistency.

4.1.2 Handling Metadata Updates

Updates of class 2 and 3b metadata are crucial for the system. Due to the fact that strong consistency of this metadata is not required, updates may be processed asynchronously with some delay. Asynchronous updates are performed using events. Two important aspects for processing of events are aggregation of events and triggering event dispatches. Since the FClient is able to produce many similar events in a short period of time, events are buffered and then sent (by the FClient) or processed (by DMC) in groups. Due to the delay in sending and processing events, multiple similar events can be merged into a single event (Chapter 4.1.3 discuses the consequences of such delays). A typical example of aggregation involves events that describe access time updates for a logical file. They can be aggregated into one event which only includes the latest timestamp. Another example is aggregation of many events that contain descriptions of datachunk changes, producing a single event which contains a full list of the changes. Since aggregation may result in excessively large events or cause unacceptable caching delays, sets of sending and processing triggers are introduced (see Chapter 4.1.3). Metadata updates are handled as follows:

• for class 2 and 3b metadata:

1. update of metadata by the FClient in its cache generates an event, 2. the event is preprocessed on the FClient side and aggregated with a similar event, if present in the buffer, 3. sending triggers are evaluated by the FClient, 4. preprocessed events are sent to DMC if a trigger fires, 5. DMC aggregates the event with a similar event, if one has been received from another FClient, 6. processing triggers are evaluated by DMC, 7. if a processing trigger fires, the aggregated events are processed, resulting in metadata changes, 8. if some FClients are interested in the changed metadata, information describing the relevant changes is sent to interested FClients (and only to them),

• for class 2 metadata:

9. DMC chooses other DMCs that should be notified of changes in the given metadata, 10. information about the changes is aggregated over a period of time – if there is more than one update to a single metadata item, only the most recent version is used, 11. the changes are propagated to the selected DMCs, 12. other DMCs apply the changes using an automatic conflict resolution algorithm if needed (see Figure 4.10), 13. if some FClients of DMCs which received the changes are interested in the changed metadata, information about the relevant changes (including the outcome of conflict resolution) is sent to interested FClients (and only to them). 4.1. OVERALL ARCHITECTURE OF THE SYSTEM 49

Figure 4.9: Sample flow between metadata stores and metadata caches 4.1. OVERALL ARCHITECTURE OF THE SYSTEM 50

Figure 4.10: Conflict resolution pseudocode

Table 4.2: Aggregation of events and changes Aggregation Aggregation goal FClient messages by FClient Reduce DMC load connected with client message handling FClient messages by DMC Limit number of actions triggered by events Synchronization messages by DMC Optimize network utilization

As described, the update algorithm involves three instances of aggregation (see Table 4.2). Before processing by DMC, events are aggregated on the FClient side to reduce DMC load. As processing of an event may call for reads/writes/updates of metadata, events from various FClients are aggregated to decrease the number of metadata operations. Additional aggregation is performed during propagation of changes to other DMCs in order to optimize network usage. The conflicts resolution mechanism (see Fig. 4.10) bases on comparison of revision numbers and hashes. Each change to a metadata document results in incrementation of its revision number and generation of a new hash. The algorithm chooses documents with greater revision numbers, or hashes if the numbers are equal. Custom conflict resolvers may be used for some metadata documents, e.g., for metadata documents that describe access times the latest timestamps are always chosen regardless of revision numbers and hashes. Although delayed processing of events reduces DMC load, it also causes delayed propagation of metadata updates and temporary inconsistency of system state (see Chapter 4.1.3). In the adopted approach this is assumed to be acceptable since otherwise system-wide transactional metadata updates would be very slow due to network delays between DMCs. Moreover, pre- processing and aggregation settings may be tuned to strike a balance between the duration of an inconsistent state and performance considerations, while locks enable permanent consistency (at the cost of reduced performance).

4.1.3 Propagation Delay for Metadata Changes and its Consequences

Transmission and processing of metadata can be triggered using various criteria, depending on the nature of the given event. Some events may require immediate processing while others permit delays. Two typical triggers for events that can be delayed are the time and size triggers. While the size trigger is used to prevent buffer overflows, the time trigger influences the duration of temporary system state inconsistencies. 4.2. ARCHITECTURE OF DATA MANAGEMENT COMPONENT 51

Higher event processing delays translate into lower DMC load connected with event process- ing due to more aggressive caching and preprocessing. However, higher delays also result in longer system state inconsistencies and therefore a greater risk of metadata change conflicts. While some conflicts can be resolved without any negative consequences (e.g., for metadata that describes access times), others may cause problems for the user. One example is renaming a file through different FClients connected to different DMCs. If two scripts work with the same file and decide to rename it in parallel, only one metadata (name) change can be acknowledged during the conflict resolving process. As a result, one of the script may crash because it will not find the file under its expected new name. Other example involves use of different replicas of the same data by such FClients. When one FClient modifies the data, the unmodified replica must be invalidated. Invalidation is triggered when appropriate metadata which describes a modifica- tion appears. If the second FClient performs a read operation before its replica is invalidated, it will receive outdated data. Thus, the length of metadata change propagation delays influences the likelihood of operating on outdated data. The above-mentioned problem is common for most tools that support distributed data pro- cessing. Conflicts can be overcome using locking or other synchronization mechanisms. However, such mechanisms may increase overhead and result in suspension of some threads/processes. Thus, the author has opted to not use any synchronization mechanism except metadata key for system’s security, in order to ensure better utilization of resources. Still, aggregation timeouts have been configured in such a way that consistency is provided within a few seconds. It is enough to noticeably reduce DMC load, showing changes to all users connected to the system in acceptable time. If the user requires stronger synchronization mechanisms, they can be im- plemented at a higher level than the data access system, e.g., in a service which uses the data access system or directly in the user’s script. While a greater delay in processing events enables better aggregation and reduces overhead, it also entails a longer period of system state inconsistency. The optimal balance between overhead and duration of inconsistency (directly associated with event processing delays) depends on the use case. Thus, the system allows parameterization of metadata sending and processing triggers using default values that reduce DMC load, delaying metadata synchronization only for a few seconds. One can argue whether such temporary inconsistencies affecting data views on different sites prevent data access from being called transparent in terms of place of access. However, complete elimination of inconsistencies results not only in performance problems but also reduces availability (if the network between sites fails, a choice must be made between consistency and availability – see CAP [3]). Thus, the author postulates that data access transparency should allow for temporary differences in data views. Determining the period of allowed inconsistency is up to the user.

4.2 Architecture of Data Management Component

As previously mentioned, this thesis focuses on the functionality, architecture and implemen- tation of the Data Management Component, regarded as a crucial element of the proposed approach to ensuring transparent data access in federated computational infrastructures. In this chapter two important elements of the DMC architecture are presented along with their respective deployment details: • DMC Core,

• a set of DMC modules activated by DMC Core. 4.2. ARCHITECTURE OF DATA MANAGEMENT COMPONENT 52

Figure 4.11: Deployment of DMC Core

The overall architecture of DMC consists of cooperating modules (see Chapter 4.2.2) deployed within a highly scalable and available cluster solution represented by DMC Core (see Chapter 4.2.1). The main focus of DMC Core is to be able to process large numbers of requests, including large-scale metadata changes.

4.2.1 DMC Core

The DMC Core architecture follows the master-slave paradigm for parallel processing of user requests (see Figure 4.11). Each slave supervises basic constituent processes of DMC, including the Node Manager, the Request Dispatcher (described later on in this chapter) and optional Module Host processes. The latter are used by DMC modules to provide end-user functional- ity (described further below; see also Chapter 4.2.2), while high scalability and availability is achieved by deploying DMC Core as a cluster solution (the DMC cluster). 4.2. ARCHITECTURE OF DATA MANAGEMENT COMPONENT 53

The Erlang language (OTP 20.0 release) was selected for implementation of high-performance request processing. Its dedicated execution environment – the Erlang Virtual Machine – provides very lightweight processes compared to standard operating systems. Erlang supports two types of applications referred to as Erlang Applications (EA) and Erlang Distributed Applications (EDA). The former is executed inside a single Erlang Virtual Machine, while the latter spans several Erlang Virtual Machines that may be run on several physical hosts. In order to ensure persistence of metadata, the author decided to use Couchbase NoSQL database. This decision was prompted by the database’s scalability and features useful in the context of metadata synchronization between DMCs. The db sync module (see Chapter 4.2.2) uses Couchbase views to track metadata changes. Persistence Nodes communicate with processes which copy metadata from the Erlang Virtual Machine memory to Couchbase. DMC operates on a dedicated set of DMC cluster nodes which host Couchbase along with two types of applications:

1. Erlang Worker Application (EWA), which hosts DMC modules (see Chapter 4.2.2). It is a standard Erlang Application responsible for provisioning DMC functionality. It may be replicated to many nodes.

2. Erlang Management Distributed Application (EMDA), which manages EWA instances. It is an Erlang Distributed Application (EDA) responsible for implementing non-functional requirements, i.e. high availability.

Although a single standard DMC cluster node is sufficiently powerful to run DMC Core, using several nodes ensures high availability and supports handling of a greater number of concurrent requests. In this case, EWA instances are deployed on each cluster node while EMDA is deployed on a chosen subset of nodes. EMDA is initialized on all selected nodes, however only one node initiates execution of application code. If that node fails, one of the remaining live nodes automatically takes over execution. EMDA persists key information in a database, so it is able to retrieve its state in the event of a node failure. This mechanism effectively prevents introduction of a single point of failure. EWA instances work as independent applications supervised by EMDA. EWA and EMDA do not share an Erlang Virtual Machine – if both are hosted on the same cluster node, two separate Erlang Virtual Machines must be launched. Scaling and High Availability (HA) of DMC is achieved thanks to the features of DMC Core. Although the cluster nodes which host EWA and EMDA can potentially be implemented as virtual machines or Docker containers [70], the use of physical machines makes HA attainable by providing hardware fault tolerance. In addition, physical machines facilitate scalability by increasing the number of EWA machines that process requests (see Chapter 4.2.3 for further information about load balancing in a multi-node environment). Four groups of DMC cluster nodes are identified:

• A Master Node which hosts both EWA and EMDA. This is the node on which EMDA operates.

• Reserve Master Nodes which host both EWA and EMDA. On each of these nodes EMDA remains on standby, able to react to Master Node failures.

• Slave Nodes which host only EWA,

• Persistence Nodes which host only Couchbase. 4.2. ARCHITECTURE OF DATA MANAGEMENT COMPONENT 54

When the Master Node fails, one of the Reserve Master Nodes becomes the new Master Node. Each Reserve Master Node maintains a prioritized list of all Reserve Master Nodes, enabling selection of a new node to replace the failed Master Node. To provide the required functionality and to meet the associated non-functional requirements, the following EWA and EMDA elements have been developed (see Figure 4.11):

• Central Manager – coordinates all nodes which comprise DMC.

• Node Manager – monitors the state of the node where the application is deployed and provides this information to the Central Manager.

• Supervisors – monitor execution of code and restart processes following failures. They can also restart entire sets of connected processes if a given process fails. Supervisors are linked with one another in a treelike structure. The Application Supervisor is the root of this tree.

• Request Dispatcher – responsible for forwarding requests to the appropriate Module Host.

• Module Host – executes the code of the selected module (see Chapter 4.2.2).

EWA Module Hosts implement the features of the respective modules. Each Module Host processes requests concurrently, optionally creating a permanent master process to improve supervision, if needed (see Chapter 4.2.3). It can parallelize execution of certain tasks by starting multiple slave processes, each of which can also be supervised on demand. Additional Module Hosts are hosted by each EWA to manage the metadata store and handle DNS requests (see Chapter 4.2.3). Module Hosts that provide metadata storage make use of EWA memory as well as the database deployed on Persistence Nodes (see Chapter 4.2.3). The nodes and EWA instances are independent. They cooperate to balance load. All infor- mation required for such cooperation is provided by the Central Manager, which – owing to EDA properties – does not constitute a single point of failure. The set of nodes may change dynam- ically through the use of Node Managers. Each Node Manager periodically sends a heartbeat to the Central Manager so the Central Manager is able to discover new EWA instances. The Central Manager also monitors each EWA Erlang Virtual Machine to detect potential failures, e.g. due to the failure of a node. Node Managers constantly monitor the load of their nodes (processor and memory usage, I/O load) and forward this information to the Central Manager, which, in turn, prepares advice for the load balancing algorithm implemented by the Request Dispatcher and the DNS module. The main assumptions of the load balancing algorithm are that DNS is responsible for balancing the load generated by multiple clients while the Request Dispatcher only reacts to short load fluctuations (see [154]). Thus, the Node Manager imple- ments the functionality of the Monitoring and Management Concern of the MACAS model, while the Request Dispatcher, along with the DNS module and the Central Manager, take part in implementing the Routing and Performance Layer. The Supervisor is an Erlang element that monitors the Erlang processes or other supervisors and restarts them in case of failures. The Application Supervisor is the top-level supervisor (it is not monitored by any other supervisor). While the Central Manager supervises all Erlang nodes, the tree structure of Supervisors is used to control the processes executed on each node. Each process that requires monitoring (failures of some processes can be ignored) is supervised by the Application Supervisor, either directly or indirectly through other Supervisors that reside at lower levels of the supervision tree. Since a Supervisor can restart processes in case of failure, 4.2. ARCHITECTURE OF DATA MANAGEMENT COMPONENT 55

Figure 4.12: DMC modules the Central Manager is not involved in the recovery process. However, if a process does not resume normal operation after several restarts, the Supervisor ceases its attempts and instead propagates an error message up the supervision tree. In this situation, a higher-level Supervisor can restart the child Supervisor. This may result in a restart of the entire sub-tree which is supervised. Cooperating processes are usually linked to a common Supervisor, so restarting the sub-tree may clear whatever problem is encountered by the failed processes. If such restarts prove ineffective, the failure is propagated to the root of the supervision tree (to the Application Supervisor) and the whole application fails. The Central Manager then recognizes a node failure and prevents clients and other nodes from sending requests to the failed node. After the node is repaired, it can be rediscovered due to the heartbeat mechanism. The proposed design of DMC makes it scalable and resistant to failures. The Central Manager is able to capture any changes and respond to them. All nodes are able to work independently in case of network failures since they have their own Application Supervisors. Once the network is repaired, they reconnect to each other and the Central Manager is able to reconfigure the node again, if required.

4.2.2 DMC Modules

DMC consists of a set of cooperating modules that implement the features of MACAS layers and concerns. The DMC executes modules to provide end-user functionality via Module Hosts activated by Slave Nodes (cf. Figure 4.11). The modules themselves are depicted in Figure 4.12. The connections depicted in the figure indicate modules that cooperate closely. Two additional Module Hosts are started to handle DNS requests (see Chapter 4.2.1) and metadata storage (datastore – see Chapter 4.2.3). They are not described here since they effectively constitute elements of the core system rather than typical modules. The most important module is fslogic. When the user accesses a file, the fslogic module manages the metadata that describes the distribution of datachunks. If the data access results in metadata changes that require synchronization between sites, the db sync module broadcasts the changes and resolves conflicts, ensuring eventual consistency of metadata, and therefore eventual consistency of data views across all sites. Metadata synchronization does not result in automatic replication of data. When data stored on resources supervised by another DMC instance is needed, a multi-channel data connection is opened. This feature is provided by rtransfer and gateway modules. The fslogic module 4.2. ARCHITECTURE OF DATA MANAGEMENT COMPONENT 56

Table 4.3: Implementation of MACAS layers and concerns by DMC modules Module Layer(s) and/or Concern(s) implemented by module fslogic Executive Layer, Cooperation Layer, Routing and Quality Layer remote Communication Layer session Communication Layer, Security Concern rtransfer and gateway Communication Layer event manager Routing and Quality Layer, Different actions can be connected with different layers sequencer Communication Layer http worker Communication Layer, Client Layer db sync Cooperation Layer cooperation manager Cooperation Layer selects the datachunks to be involved in the transfer. If datachunks which precisely match the required data range do not yet exist, fslogic creates the required datachunks on the fly by splitting existing datachunks. Subsequently, fslogic initiates rtransfer to replicate only the chosen datachunks instead of the entire file. In addition to simple transfer, rtransfer also supports parallel transfer of datachunks, depending on user needs. rtransfer is also able to dynamically change request priorities, e.g., when a datachunk is frequently requested or a new higher-priority request for the same data appears. The fslogic module assigns higher priority to transfers requested by FClients and lower priority to other transfers, e.g., involving prefetched data. Users can access data in various ways, using FClient, a Web-based GUI or by connecting with any tool capable of generating HTTPS requests, compatible with REST or CDMI APIs. The session module is responsible for handling incoming connections using ssl. Thereafter, synchronous requests are forwarded to fslogic, remote file manager and http worker. While fslogic handles requests concerning logical files, http worker handles requests from Web-based GUIs and remote file manager mediates between FClient and the storage system during I/O operations when FClient is not directly connected to the required storage system. Asynchronous requests are handled by the event manager module, which executes actions triggered by incoming messages. Such actions may involve simple metadata updates or perform complex operations such as data replication between storage systems supervised by different DMCs. The task of the event manager module is to trigger actions – carrying out these actions is often delegated to other modules, e.g., fslogic. The event manager module also enables event trigger management, as well as aggregation of events, whenever necessary (see Chapter 4.1.2). When some incoming requests have to be processed in order, the sequencer module is used to ensure that requests are not intermixed. The cooperation manager is responsible for access to class 1 metadata. It redirects class 1 metadata modification requests to the site chosen for managing class 1 metadata. The above-mentioned modules cover the functionality of MACAS layers and concerns (out- lined in Figure 3.6) according to Table 4.3. The fslogic module provides Executive Layer functionality while db sync handles synchronization. The fslogic and event manager mod- ules, together with DMC Core (see Chapter 4.2.1), take part in routing requests between sites (fslogic) and routing for aggregation purposes (event manager). The fslogic module also takes part in ensuring data security. Together with the session module which provides user identification, it controls access permissions. 4.2. ARCHITECTURE OF DATA MANAGEMENT COMPONENT 57

4.2.3 Request Handling and Load Balancing

Requests are divided into five groups depending on their resource usage:

• light synchronous, e.g., FClient metadata requests,

• light asynchronous, e.g., metadata updates through events by FClient,

• small transfers, e.g., read/write operations generated by FClient when DMC is proxying requests to the storage systems,

• large transfers, e.g., large data downloas via Web GUI or prefetching large datasets from another site for computations,

• computationally intensive, e.g., filtering datasets on the basis of user-defined metadata.

The usage of computational resources and network bandwidth by light synchronous and light asynchronous requests is low. Light synchronous requests are processed with no delays while light asynchronous requests allow delayed processing. However, as these requests may cause metadata updates, the associated delay should be relatively short. Small and large transfers compete for network resources – this calls for proper management to avoid blocking of small transfers for FClients as a result of prefetching large datasets. Finally, computationally demanding requests consume significant processing power and/or memory. The system provides three following features to handle different types of requests efficiently:

1. load balancing depending on load type [154],

2. four different request handling modes,

3. two different metadata access modes.

(1) Load balancing must be invoked before a request reaches the I/O interface of the DMC node. Although it is possible to implement load balancing on the FClient side, for clients other than FClient, e.g., generic tools for issuing HTTP requests, the only means of control involves ap- propriate preparation of DNS server replies (implementation of client-side load balancing would require modifying such tools). However, the DNS reply caching time is significantly longer than the frequency of resource utilization changes. As a result, if a single node is flooded by requests, it may take a lot of time to rebalance load within DNS (clients change their behaviour only after the DNS reply cache is invalidated). Thus, load balancing follows a two-level approach – the DNS level (executed as a Module Host – see Chapter 4.2.1) routes requests to the appropriate node, while the Request Dispatcher (see Chapter 4.2.1) is used for internal rerouting of requests in response to rapid load fluctuations [154]. (2) Different types of requests also call for different processing modes (see Figure 4.13). Thus, the Module Host that handles requests may operate in the following modes:

1. direct processing,

2. processing by a dedicated process,

3. supervision by a single master,

4. use of process pool. 4.2. ARCHITECTURE OF DATA MANAGEMENT COMPONENT 58

Requests appear in DMC by way of sockets. Each socket is connected to a process, which receives requests from the socket and is able to send responses through the socket. We will refer to this process as the connection process. The use of direct processing (mode 1) results in client requests being handled by the connection process. This method provides very low overhead but also limits the supervision capability. If processing stalls, no information can be returned to the client. Moreover, this mode blocks all other requests coming through the given connection. Thus, it is used to process connection control requests, e.g., handshake requests. Better control of request processing is obtained by using mode 2. In this case, a slave process is started and the main process can supervise request processing and react to any failures by killing the slave process and returning an error. Moreover, this method enables parallelization of request processing. On the other hand, spawning additional processes generates overhead. Mode 2 is used to handle most FClient requests, especially large numbers of asynchronous messages, i.e., events associated with metadata updates. Mode 3 assumes starting new processes through a single master for a particular request type. These processes are bound to the master process. The master process is supervised by the Supervisor to ensure reliability. In generates additional overhead, especially when many similar requests appear, but also offers great management possibilities for the master process, including killing/restarting all related processes if any of them fail. The master process can be considered a bottleneck in this processing mode. Mode 3 is useful for processing requests that concern data transfers between DMCs. Request processing may fail due to network errors or other types of DMC failures. The final mode (mode 4) assumes the use of a process pool for load balancing. It enables optimal load control and is especially useful when request processing imposes a high load on any resource. However, similarly to modes 2 and 3, it is expected to generate overhead – in this case, resulting from management of the process pool. The pool is usually dedicated to processing an entire dataset (e.g., moving data from one storage system to another), accelerating request processing while preventing system overloads. Another feature of the system is support for different metadata access modes. Despite the fact that some existing databases provide in-memory caching, database access still generates access delays. To minimize this effect, a Module Host (see Chapter 4.2.1) – datastore – is responsible for creation of metadata stores in the memory of the Erlang Virtual Machine, enabling instant metadata access. For metadata which requires persistence, the datastore triggers asynchronous write processes in the database. If memory usage increases, a portion of the metadata is cleared from memory and can be reloaded if needed. Storing metadata in the memory of the Erlang Virtual Machine becomes more complicated when DMC is deployed on many nodes. In such scenarios it is crucial to process requests on the nodes that store the metadata required for a particular request. However, replicating metadata between nodes results in higher memory usage and overhead. Thus, the datastore handles access requests in two ways: 1. The first option is the transactional mode. It is access-time oriented. Metadata is replicated to the memory of each node. Modifications are handled using Erlang Mnesia database transactions [84]. Following each modification, an Erlang process is started to persist that modification in the database, if needed. In spite of reduced data access delay, this approach can result in the database being overloaded by asynchronous processes that save changes. Additionally, metadata is replicated to the memory of all nodes and therefore transactions that modify metadata consume resources on all nodes. As a result, increasing the number of 4.3. SUMMARY 59

Figure 4.13: Request flow – different modes with different features

nodes only improves read throughput, while write throughput may deteriorate. To increase metadata storage space, it is necessary to increase the memory capacity of each node.

2. The second option is the key manager mode. It requires assigning a master node to each metadata key. Metadata is then kept in the memory of the assigned node, so read operations may be slower than in the former algorithm due to delegation of read requests to the appropriate nodes. However, more metadata may be cached when many nodes are used. All updates of metadata connected to a single key are managed by a single process that also manages asynchronous propagation of metadata changes to the persistent database. The process is able to aggregate requests and process them in batches. It also propagates aggregated changes to the database to reduce database load. It executes modification operations without any transactional context (resulting in lower resource consumption) as it only modifies the metadata connected with a particular key. Another advantage of the presented mechanism is that it simplifies load control.

The implemented system uses the latter metadata caching algorithm for most metadata, caching all keys connected with a given logical file at the same node to enable effective handling of requests concerning that file. The former mechanism is used for metadata that describes users and their permissions with regard to the whole datasets because this metadata is read frequently and rarely changes. The quantity of this metadata is also rather low compared to other metadata types. The former metadata access mechanism ensures low delays while the latter offers better throughput and scalability. Each method of request processing has its pros and cons. However, reliable and efficient operation of the system hinges upon appropriate cooperation of all presented mechanisms. Appropriate routing of client requests to nodes that cache the required metadata minimizes the metadata access time because no delegation of metadata access requests is needed.

4.3 Summary

This chapter presented the architecture and implementation of two main components: the Data Management Component (DMC) and the FUSE client (FClient). The auxiliary Handler com- 4.3. SUMMARY 60 ponent was also discussed. Since one of the design assumptions is local metadata processing, metadata stores and caches were introduced, along with an efficient metadata update mechanism that is crucial for the system. The designed solution utilizes different mechanisms used by multi- processor computers. Depending on metadata type, caches use write-through and write-back policies to prevent inconsistency or provide higher efficiency. Push and pull methods are also combined to put/update metadata in selected caches in order to decrease metadata access delay and overhead. Although the aforementioned methods and policies are not, in themselves, novel, it has not been a trivial task to adopt and combine them into a single system that provides the expected efficiency, reliability and data access transparency. The description of the DMC module focused on two aspects:

• DMC Core – a highly scalable and available cluster solution implemented in Erlang,

• a set of DMC modules activated by DMC Core.

The presented elements and mechanisms enable processing of large numbers of requests, including modification of metadata at a massive scale. Thus, they are crucial to the deployment of MACAS in any environment. 5 Experimental Evaluation

Evaluation of large systems is a complex process. It involves validation whether the specification captures stakeholders’ needs, as well as ensuring that the software meets the specification. In case of this thesis feedback from stakeholders was gathered in the course of participating in several research projects (see Chapter 1.4). To verify the created software, functional and non- functional tests were developed. Testing included creation of unit, integration, acceptance, performance, scalability and stress tests. Code coverage was analyzed to check if all system elements were checked. This chapter only describes those tests that relate to verification of the thesis or evaluation of key elements implemented by the author. More specifically, the first section describes evaluation of DMC Core – a key element of the thesis. As the main goals of DMC Core are efficient and reliable metadata access and storage, performance and integration tests of DMC Core are also presented. The described DMC performance tests have verified non- functional requirements concerning overhead of load-balancing subsystem, requests handling and metadata access times. The DMC integration tests are white-box tests with mock-ups set up to verify its reliability. Afterwards, in order to fully verify the model, DMC Core is linked with other system elements in order to make it operational for testing purposes. The chapter contains a description of performance, scalability and acceptance black box tests. Performance and scalability tests verify the non-functional requirement stating that the system implementing MACAS must maintain quality of access (see Chapter 1.2) while acceptance tests verify whether the system provides functionality required by stakeholders as a result of context awareness and usage of datachunks. Tests were executed using several storage systems. They utilized physical machines, virtual machines and Docker containers depending on the features of the particular test environment. While physical machines are recommended for production deployments, the use of virtualization was necessary for certain tests due to the author’s limited permissions and/or the need to execute tests using more nodes than can be provided by a particular environment. When dealing with sufficiently powerful physical or virtual machines, Docker containers were applied to simulate multiple nodes (as a lightweight alternative to standard VMs). However, if the given environment supported only small physical or virtual machines, Docker containers were not used – instead, each machine hosted a single component of the system. Virtual machines used during the tests were deployed on dedicated servers separated from other load for reproducibility of results. 5.1. DMC CORE TESTS 62

Figure 5.1: Normalized throughput with similar load on all DMC cluster nodes

5.1 DMC Core Tests

DMC Core was tested to verify its three basic features: routing (including load balancing for DMC cluster nodes), different modes of handling and processing requests and different modes of accessing metadata, as outlined in Figure 4.13. The reliability of DMC Core was also verified. The types of nodes used (e.g., DMC cluster Slave Node) are defined in Chapter 4.2.1 and depicted in Figure 4.11. Although Erlang provides functions that can measure time in nanoseconds, it does not guarantee accurate timing in the nanosecond range. Hence, the microsecond resolution has been used.

5.1.1 Evaluation of Request Routing and Processing

Each request can either be handled by the DMC cluster Slave Node that received it, or routed to another node to balance the processing load (see Chapter 4.2.3). The first performance test was originally presented in [154]. The author of the thesis designed the test and analyzed the results. The test measured the system’s throughput in two scenarios:

• similar load on all DMC cluster nodes,

• DMC cluster nodes divided into two groups with different load.

Results were compared with the throughput of a system which handles each request im- mediately upon receipt, without any analysis or rerouting (referred to as a system with its load-balancing subsystem deactivated). This reference case was selected because it does not introduce any overhead. However, it is also unable to balance load fluctuations. Throughput was compared in an environment which included a DMC cluster Master Node, a DMC cluster Persistence Node and 1-12 DMC cluster Slave Nodes responsible for handling requests. Each node was hosted by a virtual machine with enabled network emulation. As a result, increasing the number of nodes also increased the available computing power. In the first case, regardless of the number of nodes throughput remained similar with and without the load- balancing subsystem activated (see Figure 5.1). In the second case, throughput with the load balancing subsystem activated was over 70% higher than with the load-balancing subsystem 5.1. DMC CORE TESTS 63

Figure 5.2: Normalized throughput with DMC cluster nodes divided into two groups with different load deactivated (see Figure 5.2). This demonstrates that the load-balancing subsystem does not introduce any significant overhead while improving throughput in case of sharp load fluctuations. In order to test request processing, the methods outlined in Chapter 4.2.3 were used. These methods were tested against performance in an environment consisting of a DMC cluster Master Node, a DMC cluster Persistence Node and two DMC cluster Slave Nodes running inside ded- icated Docker containers, sharing 6 cores and 8 GB of RAM. The test included measurements of three values with 100 runs each:

• request handling time at the node which received the request (local handling),

• request handling time at a node other than the one which received the request (remote handling),

• request handling time under load (50000 requests).

The measured request handling time was defined as the time of execution of a testing handle function that verifies the request content and responds with either a confirmation or an error message. Depending on the processing method, the test handle function may also include dele- gation of request processing to another process (slave process, master or pool process); the time of delegation is then included in the execution time. In order to measure the average request handling time for a load of 50000 requests, 50000 Erlang processes were started on DMC cluster Slave Nodes. Each process generated a request, executed the testing handle function, measured its execution time and finally terminated. Results are summarized in Table 5.1. Table 5.1 reveals that the direct processing mode (1), where each request is processed by the the process which handles the corresponding client’s connection, introduces very low overhead. Processing by a separate (i.e. slave) process (mode(2)) generates greater overhead. As long as the system is not loaded, the overhead involved in starting local processes remains low. When the request is processed by a remote node, additional time is needed for communication between nodes. Such delegation is performed only when the local node is highly loaded. Thus, process initialization overhead increases when multiple requests are processed in parallel, but remains acceptable. Mode (3) involves launching new processes through a single master. This also generates over- head, especially when many similar requests arrive in short order. In the final mode, (4), which assumes the use of a process pool for load balancing, overhead is also significant – specifically, 5.1. DMC CORE TESTS 64

Table 5.1: Request handling time and characteristics of request processing modes Handling time Processing Local handling Remote handling No. when processing Characteristics mode time [µs] time [µs] 50000 requests [µs]

direct 1 <1 − <1 Low overhead processing

Single request control, processing in 2 6.8 ± 1.0 206 ± 8 475 ± 19 best parallelization of slave process processing

supervision by Supervision over class 3 21.1 ± 3.1 303 ± 12 72930 ± 1640 single master of requests

4 process pool 56.3 ± 5.8 512 ± 20 12740 ± 494 Load control

lower than the overhead of mode (3) when the system is highly loaded (12740 µs vs. 72930 µs), but higher in other cases. The tests reveal that the overhead of all modes remains acceptable when these modes are applied in accordance with their intended purpose (see Chapter 4.2.3). The actual overhead introduced by each mode depends on additional functionality available in that mode. While the overhead is higher when handling requests remotely, such delegation decreases the load of the node delegating the request and may yield benefits when load is not balanced across nodes. The overhead of processing modes (3) and (4) increases substantially when many concurrent requests utilize a single master or a single pool. However, processing modes (3) and (4) are designed to handle only a limited number of heavyweight requests at any given time.

5.1.2 Metadata Access Evaluation

The performance of metadata access was tested in environments consisting of a single DMC cluster Master Node, several DMC cluster Slave Nodes and several DMC cluster Persistence Nodes. The first environment included one DMC cluster Slave Node for request processing and one DMC cluster Persistence Node for metadata storage. The second environment included three DMC cluster Slave Nodes for request processing and three DMC cluster Persistence Nodes for database hosting. The third and fourth environments included 10 and 14 nodes respectively – half of them for request processing and the rest for database hosting. All nodes were encapsulated by Docker containers sharing 6 cores and 8 GB of RAM. DMC cluster Slave Nodes and DMC cluster Persistence Nodes were able to utilize multiple cores, hence DMCs had access to similar computing power in all environments. Tests were performed for four configurations representing various combinations of metadata storage/update and persistence modes (cf. Table 5.2). In line with Chapter 4.2.3 two metadata storage/update options were used:

1. key manager mode – storing/caching metadata in the memory of the selected node, exe- cution of metadata writes through a dedicated process for each metadata key, 5.1. DMC CORE TESTS 65

Table 5.2: Test configurations for metadata access Configuration Metadata storage/update mode Metadata persistence mode

1 Key manager mode Memory only 2 Key manager mode Persistent database with memory cache 3 Transactional mode Memory only 4 Transactional mode Persistent database with memory cache

Table 5.3: Average metadata access times for different configurations and computing environments

1 Slave Node 3 Slave Nodes 5 Slave Nodes 7 Slave Nodes Configuration Put [µs] Get [µs] Put [µs] Get [µs] Put [µs] Get [µs] Put [µs] Get [µs]

1 70.2±4.2 6.1±0.9 286±8 183±7 353±10 250±8 400±12 310±9 2 99.1±4.9 7.8±1.1 323±9 215±7 387±10 255±8 476±14 312±9 3 88.4±5.1 3.3±0.5 966±21 4.0±0.5 1527±29 4.2±0.6 2172±39 5.5±0.8 4 133±6 5.9±0.9 1393±30 7.2±1.0 2188±38 9.1±1.3 3123±61 16.4±1.7

2. transactional mode – storing/caching metadata copies in the memory of all nodes, execution of metadata writes in multi-node transactions,

This was coupled with two metadata persistence modes, i.e., in-memory storage and database storage with a memory cache. For each configuration tests were executed 1000 times for each of the available computing environments. Each test run included metadata put and get operations. Each operation accessed a metadata document with the following contents: record name, two binary values, two integer values, boolean value, two short lists of binaries and a nested 4-field record. Table 5.3 shows the average metadata operation time. On the single DMC cluster Slave Node environment there is no significant difference between both modes – the former offers slightly faster put operations, while in the latter get operations are slightly faster. Both modes introduce overhead when using a database. This is due to additional operations needed to asynchronously save the document in the database, or execute a fallback query if the required metadata is not found in memory. The test reveals overhead associated with increases in the number of nodes, given that all nodes share physical resources. While adding database nodes (DMC cluster Persistence Nodes) consumes resources and negatively affects access time, it can be shown that the tested modes offer varying characteristics. The key manager mode introduces overhead which is associated with delegating requests to dedicated processes (depending on the metadata key). Since the test did not involve any usage of the routing algorithm for optimal selection of a Slave Node, overhead increases slowly along with the number of nodes due to the increasing probability that an access operation will be handled by a remote node. The transactional mode (i.e. configurations 3 and 4) introduces high overhead for put opera- tions but supports rapid get operations. The latter feature is available because each node hosts a copy of the metadata. However, transactional updates of such copies result in substantial increases in put times corresponding to the number of nodes in the system. The test reveals important differences in the characteristics of each mode. In addition to different access times, the greater memory consumption of the transactional mode should be emphasized. Table 5.4 shows the number of memory slots (each slot is used to store a single 5.1. DMC CORE TESTS 66

Table 5.4: Number of memory slots occupied at the end of the test for different configurations and computing environments Configuration 1 Slave Node 3 Slave Nodes 5 Slave Nodes 7 Slave Nodes 1 1000 1000 1000 1000 2 1000 1000 1000 1000 3 1000 3000 5000 7000 4 1000 3000 5000 7000

metadata document in the memory of a single node) occupied at the end of the test. While memory consumption associated with the key manager mode (i.e. configurations 1 and 2) depends on the number of stored documents, in the transactional mode (i.e. configurations 3 and 4) it further depends on the total number of nodes. Thus, while the key manager mode supports adding nodes to increase the pool of memory available for metadata storage and caching, the transactional mode instead calls for the memory of each node to be extended instead if more metadata is to be stored or cached. Although the key manager mode is more scalable, the transactional mode can be useful for optimization of access to small amounts of frequently read metadata. The test also shows that the metadata access time can be kept low when the appropriate (according to needs, see Chapter 4.2.3) storage/update mode is used. In this case the system can scale (in terms of available memory slots) not only by expanding the memory of each node but also by increasing the number of nodes. Thus, similarly to request processing tests, metadata access tests prove that the tested modes provide a sound basis for the whole system. Moreover, to store metadata efficiently, an appropriate mechanism for routing requests is required (see Chapter 4.2.3).

5.1.3 Reliability Evaluation

For evaluation of DMC reliability a set of possible failure cases was defined and simulated due to mocking of appropriate modules and functions. This evaluation was based on monitoring system behavior. The test environment consisted of 13 nodes – DMC cluster Master Node, DMC cluster Reserve Master Node, 5 DMC cluster Slave Nodes for request processing, 5 DMC cluster Persistence Nodes for database hosting and 1 node hosting the FClient. All nodes were hosted in Docker containers sharing 6 cores and 8 GB of RAM. Results of this test are shown in Table 5.5. The test shows that when one of the specified cases occurs, the system is able to continue working without human intervention. Figure 5.3 shows log fragments that contain descriptions of the actions triggered. For case 1 the log shows that the supervisor noticed the crash of the Erlang process and was forced to react. In case 2 the log shows that the crash of the Slave Node was noticed by the manager of DMC and FClient was subsequently forced to connect to another Slave Node. The case 3 log shows that the DMC master was restarted on the Reserve Master Node following failure. Logs for case 4 show the reaction of the DMC manager, which is similar to case 2, along with a confirmation of network problems resulting in termination of all active sessions. The test also reveals two basic mechanisms used in case of failure - detecting process crashes and detecting node/network crashes. The following elements are involved in monitoring activities 5.2. PERFORMANCE EVALUATION OF INTEGRATED SYSTEM 67

Table 5.5: Results of reliability tests No. Simulated failure case Observed actions performed by the system Crash of process that coordinates 1 Module is restarted by supervisor provisioning of particular functionality FClient connects to another node, load balancing 2 Crash of Slave Node algorithm is requested to exclude crashed node from Central Manager replies 3 Crash of Master Node Manager is started at Reserve Master Node All sessions are terminated by Node Manager of Slave Node to prevent metadata inconsistency, other Failure of network between Slave Node and elements of the system act as if Slave Node has 4 Master Node crashed, load balancing algorithm is requested to exclude problematic node from Central Manager replies

during tests:

• Supervisor – for discovering crashed processes and restarting them (case 1).

• Node Manager of Master Node – for discovering crashes of all nodes (case 2) or network failures on particular nodes (case 4), and sending appropriate recommendations to Request Dispatchers and the DNS module.

• Node Manager of Slave Node – for discovering network failures (case 4) and halting activ- ities on the affected node.

• Erlang node monitor (i.e. Erlang mechanism) – for discovering crashes of the entire appli- cation (case 3) and starting reserve instances.

When a failure is discovered, two other elements may also become involved:

• Request Dispatcher – for ensuring that internal requests are routed to healthy nodes only,

• DNS module – for issuing recommendations for clients, suggesting that they route their requests to healthy nodes only.

As follows from the presented results, DMC Core provides the functionality needed to build a reliable system.

5.2 Performance Evaluation of Integrated System

Performance evaluation tests measure the overhead and scalability in a single-DMC environment. Specifically, measurements concern the throughput of data access by a single process and the aggregated throughput of data access performed by multiple processes. The overhead in a multi- DMC environment is also measured. The tests assume that the system is a black box where many factors (including elements not implemented by the author) may influence performance. However, good performance of algorithms designed and implemented by the author is a necessary condition for achieving good results of tests described in the chapter. Thus, the described tests are an important aspect of the verification of this thesis. 5.2. PERFORMANCE EVALUATION OF INTEGRATED SYSTEM 68

Figure 5.3: Fragments of logs from reliability tests

5.2.1 Overhead Evaluation

The aim of these tests was to compare the observed system overhead to other typical systems and settings. The test measured the throughput of a single process accessing data through a system which implements the MACAS model (using FClient) and the throughput of direct access to the storage system. To better simulate possible use cases, the tests utilized different storage systems along with various scripts/tools, and were deployed in different hardware configurations (see Table 5.6). The storage systems included cluster storage and object storage. The test environment varied between a single virtual machine and a large Grid storage system (GPFS). The hardware included machines with SSD storage as well as extremely limited virtual machines with 1 virtual CPU and 0.5 GB RAM. Tools used in the test included popular open-source benchmarks (SysBench [78]), a programming API and dd – a command-line utility commonly used by administrators to verify filesystem performance. All test environments included FClients with direct access to the storage system. Additionally, tests involving EVS storage system and Python file access API (see Table 5.6) included measurements of throughput when the FClient is forced to proxy all storage system operations through DMC. Each test was executed 20 times. Table 5.7 presents the average throughput. No additional overhead of the system could be observed. Although the system consumes resources when processing metadata, data access throughput is similar or higher than the direct-access scenario even in environments with low computing power and memory. The throughput is much higher when using S3 storage due to splitting and parallelization of access requests as well as prefetching. The throughput of the network connecting virtual machines which host FClient and DMC (measured using the scp command) in the EVS storage environment (see Table 5.6) was mea- 5.2. PERFORMANCE EVALUATION OF INTEGRATED SYSTEM 69

Table 5.6: Description of overhead tests

Storage Storage System Hardware and environment Used tool Test settings Test aim System Scale description Single 8 vCPU 8 GB RAM Single thread, virtual machine that hosts Verify ability to copy/access Small (local SSD cp, AWS 3 Docker containers with adapt the system S3 whole data storage) CLI [64] DMC and its DB, FClient to particular file/object, 1 GB and S3 Proxy [97] that storage type read/written utilize local SSD storage 1-16 active Four virtual machines, 1 Validate the Small (single threads that vCPU 0.5 GB RAM each, system in a small virtual perform for DMC, its DB, FClient NFS SysBench resource- machine for data read/write and NFS Server that utilize constrained storage) operations, 10 local low-performance environment GB read/written storage Validate the Two virtual machines, 4 system in a 5 threads, vCPU 8 GB RAM each Medium (storage Python cluster operations on 10 (one for DMC with its DB EVS [73] system used by file access environment MB block, 1 GB and one for FClient) with single cluster) API accessed by a read/written directly accessible EVS programming storage API. Grid worker node with one core reserved for FClient; 4 Large (system CPU 16 GB RAM machine Validate the dd with Work with 4-640 used by nodes of for DMC with its DB. system in a GPFS different kB blocks, 150 the Zeus [110] Worker node directly typical Grid block sizes GB read/written supercomputer) connected to GPFS shared environment with other worker nodes that execute jobs

Table 5.7: Throughput of a system implementing the MACAS model in comparison with direct access

Test description Measured value Direct access [MB/s] MACAS-compliant system [MB/s] Write 123 ± 5 266 ± 10 S3 / cp, AWS CLI Read 284 ± 9 610 ± 15 Write 14, 1 ± 0, 1 14, 0 ± 0, 1 NFS / SysBench Read 11, 3 ± 0, 1 11, 2 ± 0, 1 EVS / Python Write 413 ± 10 415 ± 9 direct access Read 368 ± 8 398 ± 8 EVS / Python Write − 100 ± 6 proxy via DMC Read − 71 ± 5 Write 100 ± 1 101 ± 2 GPFS / dd Read 96, 1 ± 2.2 97, 4 ± 2.8 5.2. PERFORMANCE EVALUATION OF INTEGRATED SYSTEM 70 sured at 115 MB/s. The throughput of data access when FClient was forced to proxy operations on the storage system via DMC was 71 MB/s for read and 100 MB/s for write requests. Clearly, the need to access data via DMC limits the throughput (115 MB/s was not reached), and the test shows that read operations are more significantly affected. Regardless of the environment type the system is able to maintain access quality in terms of performance. Achieving high throughput calls for processing large amounts of metadata. Consequently, the following section is dedicated to system scalability tests.

5.2.2 Evaluation of Scalability and System Limits

A dedicated server was used to simulate access to a very fast storage system. Its quad-channel memory configuration (4x8 GB DDR4 3200 MHz CL16 RAM) enabled very fast reads from the storage cache. The CPU was Intel Core i7-5690X @ 4.5 GHz (overclocked, 8-core, 16-threads, 20 MB L3 Cache). During the test, system elements were deployed in separate Docker containers. In accordance with the test scenario, multiple tasks performed single sequential reads of 10 GB of data using the dd command-line utility. The load was generated by monitoring events that provided detailed insight into the FClients’ activity. The total aggregated throughput was measured using two methods (see Figure 5.4): (1) by summing the throughput reported by the dd command-line utility, and (2) by dividing the total size of data by the overall read time measured from the moment of initiating the first read operation with the dd command-line utility until all operations have completed. The first method overestimates throughput (since not all processes are started at exactly the same time), while the second method underestimates it (since it factors in startup and post-test cleanup). Additionally, the htop tool [79] was used to check which system components consume computing power. The test was performed in two variants, using storage memory cache and a null handler – a special storage access handler which generates data on the fly to minimize access time. The overall throughput depended on the number of FClients and the system achieved its highest throughput for 16 FClients. According to the htop tool, over 70% of computing power used by the system was consumed by FClients. Each FClient performs most of its work in a single thread; thus running a higher number of FClient instances results in greater aggregated computing power for event and metadata processing. If the number of FClients is greater than the number of CPU threads, the overall throughput declines slowly due to CPU access contention. The test confirmed the scalability of the system (limited by the number of independent threads in the experiment) and indirectly good quality of DMC Core. The resources available on an 8-core machine allow supervision of data access with a total throughput of around 20 GB/s. Due to the use of DMC Core, which can run on a cluster, it is easy to increase the computing power available to the management component. Moreover, the test revealed that most computing power is consumed by FClients, and in typical production deployments these FClients would be running on dedicated machines. Thus deploying DMC on a single machine would typically result in supervision of data access with a higher total throughput than during the reported test. The test also exposed limitations of the system’s implementation. The FClient performs most of its activities in a single thread, which limits data access throughput. Thus, the throughput of a single FClient can be perceived as a bottleneck in an environment with very fast storage. The limits of a single FClient were also tested using a virtual machine hosted by a node of the 5.2. PERFORMANCE EVALUATION OF INTEGRATED SYSTEM 71

Figure 5.4: Total aggregated throughput and CPU usage

Table 5.8: Total aggregated throughput and number of operations per second

Number of FClients Throughput [GB/s] Operations per second [106/s] 1 1.3 0.33 3 4.1 1.1 5 6.9 1.8

Prometheus supercomputer [94]. The data access throughput reported by the dd command- line utility on this virtual machine was approximately 7 GB/s. Subsequently, a large volume of data was accessed through the system using 1, 3 and 5 FClients respectively. A reduction in throughput was observed when using 1 or 3 FClients. However, as the number of FClients increases, the system can easily process greater numbers of operations per second (measured internally), with throughput close to 7 GB/s (see Table 5.8). Thus, while a bottleneck is present in the system, it can be easily overcome via simultaneous processing of data using multiple FClients. Moreover, this limitation does not follow from the assumptions of the MACAS model, but is instead related to a particular implementation of the FClient and can likely be ameliorated with an optimized multithread implementation. The test revealed that the system is able to process metadata related to data access via several FClients. Although it exposed certain limitations in the presented implementation, these limitations can easily be overcome through appropriate deployment. Thus, the test proved that the system is capable of scaling up.

5.2.3 Evaluation of Overhead in a Multi-DMC Environment

The tests shown in Chapter 5.2.1 proved that the system which uses a single Data Management Component is able to maintain access quality in terms of performance. The test presented in this chapter checks if increasing the number of DMCs in the environment introduces any 5.2. PERFORMANCE EVALUATION OF INTEGRATED SYSTEM 72

Figure 5.5: Data access throughput for local and shared datasets overhead. This test was originally presented in [148]. The author of the thesis designed the test and analyzed its results. To evaluate the overhead, a test environment has been set up, with 9 virtual machines possessing public IP addresses. Each virtual machine had 4x2.6 GHz CPU and 8 GB of RAM. The Data Management Component and FClients were run in Docker containers. The test procedure is as follows. A file was created in a dataset shared across all DMCs. Data was written by an FClient connected to the first DMC and subsequently both read and written by FClients connected to other DMCs. Additionally, each DMC managed a local dataset (not shared with any other DMC) upon which similar writes and reads were carried out. The file size was 1 GB. Average results of 20 test runs are shown in Figure 5.5 [148]. The first read for DMCs 2-9 is presented separately since it requires data transfers between DMCs, which are limited by the network. The figure shows the throughput for accessing data stored in: (1) a local dataset, and (2) a dataset shared across multiple DMCs (referred to as the shared dataset) for which metadata synchronization between DMCs occurs. The throughput for local and shared datasets remains similar, indicating that metadata synchronization does not introduce significant overhead. The average throughput for the first reads is 50 MB/s while the average throughput for inter-VM communication, measured using scp, is equal to 75 MB/s. This results from the behavior of FUSE which reads data blocks sequentially, waiting for data to arrive before sending the next request. The FClient partially compensates for this by prefetching, which, by default, is tuned for accessing fragments of large files. When more aggressive prefetching is selected (which requests all file blocks at the same time), the average throughput for the first read increases to 81 MB/s. Given that the observed throughput for local and shared data reads/writes remains similar, the test proves that the system maintains quality of access in terms of performance in a multi-site environment (single-site tests are described in Chapter 5.2.1). However, the read throughput when data is transferred between sites strongly depends on prefetching settings. 5.3. DATACHUNK MANAGEMENT EVALUATION 73

Figure 5.6: Changing distribution of file fragments among DMCs

Figure 5.7: Test environment for comparing management policies

5.3 Datachunk Management Evaluation

The main features of the system – elasticity and efficiency – are strictly related to data descrip- tion with datachunks. While the key to achieving high data access performance is to work with local datachunks, ease of access requires that management of datachunks be automatic and not involve the user. This chapter begins with a test of default datachunk management, involving several opera- tions on a file. These operations are performed by 4 FClients connected to 4 DMCs (FClient A to DMC A, FClient B to DMC B, etc.) Following each operation, datachunk locations are investigated (see Figure 5.6). At the beginning of the test FClient A creates a file (1). Other DMCs synchronize the metadata of the file, but do not store its contents. Subsequently, FClient B overwrites some parts of the file (2), which results in invalidation of fragments stored in DMC A. At this point new contents are stored in DMC B. In (3) FClient D reads some parts of the file, which results in replication of these parts to that DMC’s storage. Afterwards the corresponding FClient appends some data to the file (4) and other DMCs are informed about the new file size, but do not replicate the new fragment until necessary. Next, a read operation on the file is performed in DMC C, resulting in complete replication to this DMC’s storage (5). However, a subsequent overwrite of the whole file performed in DMC B invalidates all replicas held by other DMCs (6). File fragmentation is handled automatically by the system and remains transparent to the user – all read and write operations are performed regardless of the location of the FClient and distribution of file contents. The next test examines whether the system is able to provide different characteristics by relying on datachunks, differences in configurations of subsystems (e.g., different event aggrega- 5.3. DATACHUNK MANAGEMENT EVALUATION 74 tion times) and different sets of actions automatically triggered when DMC receives a message (e.g., an event describing the modification of a datachunk). This set of actions, along with the corresponding configuration, will be further referred to as a policy. The test environment con- tains processes that produce and consume data through FClients connected to different DMCs (see Figure 5.7). The test scenario is as follows:

1. The data producer writes data to an empty file.

2. The data consumer monitors the file to check whether its size has changed.

3. The data consumer reads the beginning of the file.

4. The data consumer reads the beginning of the file a second time.

5. The data consumer reads the remaining data from the file.

The test scenario was executed four times – each time with a different policy. For better description of policies assume the following abbreviations: DMC1 – the Data Management Component supervising the FClient that reads data, DMC2 – the Data Management Component supervising the FClient that writes data (see Figure 5.7). By default, no data is copied when information about datachunk modification is propagated (see Chapters 3.2 and 4.1). When DMC1 provides a datachunk to FClient that is stored on a storage resource managed by DMC2, a copy of this datachunk is saved on a storage resource managed by DMC1. Test policies specify the following system behavior:

1. Fast read policy – when DMC1 is informed that a datachunk has been created/updated on a storage resource managed by DMC2 (metadata describing this datachunk has been synchronized between DMCs), this datachunk is immediately requested by DMC1 and copied to a storage resource managed by DMC1. Additionally, aggregation of events is reduced to ensure faster propagation of changes. Prefetching is disabled.

2. Storage space usage reduction policy – when DMC1 provides a datachunk to FClient that is stored on a storage resource managed by DMC2, the copy created on a storage re- source managed by DMC1 is deleted immediately following use by the interested FClients. Prefetching is disabled.

3. Balanced policy – when DMC1 provides a datachunk to FClient that is stored on a storage resource managed by DMC2, the copy saved on a storage resource managed by DMC1 is deleted after a fixed time since last usage (when the datachunk is used several times, its deletion is postponed). Prefetching is disabled.

4. Balanced policy with aggressive prefetching – this policy is similar to policy 3. Additionally, when a FClient connected to DMC1 lists the content of a directory, DMC1 creates tem- porary copies of the beginnings of all files in that directory which are not already present on a storage resource managed by DMC1. This behavior is disabled for big directories (containing more than 200 files). When the FClient reads a datachunk from a file, the following datachunks are downloaded to DMC1 storage.

Table 5.9 summarizes the differences in system behavior observed while executing the test scenario with different policies. The aim of the test was to verify whether the system behavior may be adapted to different needs (e.g., frequent data updates vs. read-only access to existing data). Thus, instead of measuring specific values, we compare data flows. 5.3. DATACHUNK MANAGEMENT EVALUATION 75

Table 5.9: Comparison of management policies Processor Network First data access First read Further Policy Storage usage usage interface usage delay throughput reads Limited by local High with data storage or changes, low with network From 1 High High Low multiple reads of throughput if local copy the same data data replication is not finished Low with data changes, high Limited by From 2 Low Low with multiple High network remote reads of the same throughput copy data Medium - additional space Limited by Low - only with From 3 Low temporarily High network first read of data local copy consumed by throughput accessed data Limited by local Medium - storage or From 4 Low Low for Low for network local copy additional sequential reads sequential reads throughput if space reading outpaces temporarily prefetching consumed Medium for random reads - by additional accessed and transfers of prefetched, High for random Limited by prefetched potentially reads if requested network data unnecessary data data is not throughput (policy does not prefetched involve detection of random reads to disable prefetching) 5.4. CONTEXT AWARENESS EVALUATION 76

One can see that the policies utilize resources in different ways. Policy 1 uses more CPU power due to the reduced aggregation of events. However, the main difference is the balance between storage and network utilization. Policy 1 consumes a lot of storage and network bandwidth when the data is written. However, read operations are local and do not use significant network resources. In contrast, policy 2 has low storage requirements. It also does not intensively utilize network resources while writing data. The tradeoff is high network usage and delays when reading data produced at another site. Although the system only transfers data once when many FClients (connected to a particular DMC) read it in parallel, data must be transferred mutiple times if repeatedly demanded at different points in time. Policies 3 and 4 represent a compromise between 1 and 2. They allow temporary utilization of additional storage space to make data access more local – thereby reducing network usage during multiple reads of the same data. Considering policy 3, only the first read action following modification of remote data is, in itself, remote. Policy 4 solves the problem for sequential reads through prefetching. However, this results in higher network usage associated with small random reads of different parts of a large file. The tests described in this chapter prove that the system manages data storage in a trans- parent manner and is able to adapt its actions to the selected policy.

5.4 Context Awareness Evaluation

To show context awareness, an environment with two DMCs and a legacy dataset was created (see Figure 5.8). The environment features two FClients connected to DMCs (one per DMC). Additionally, a legacy dataset may be modified directly within the storage system (without involvement of the data access system). The system is configured to create replicas of read blocks. However, if a replica created in DMC1 is not used for a predefined period of time (set to 5 minutes in the experiments), it is removed. The test scenario is as follows:

1. Create two new files, f1 and f2, directly in the storage system (inside a legacy dataset). List files directly via storage system.

2. List files via FClient 1 and FClient 2.

3. Check data distribution via FClient 1 and FClient 2.

4. Read files f1 and f2 via FClient 1.

5. Check data distribution via FClient 1 and FClient 2.

6. Append 5 bytes to f1 via FClient 1.

7. Overwrite f2 directly in the storage system.

8. List files and check data distribution via FClient 1 and FClient 2.

9. Read f1 via FClient 2.

10. Read f2 via FClient 2.

11. Wait (for 10 minutes in the experiment)

12. Check data distribution via FClient 1 and FClient 2. 5.4. CONTEXT AWARENESS EVALUATION 77

Figure 5.8: Context awareness test environment

Results of selected steps are shown in Figure 5.9 and discussed below, together with infor- mation on how context awareness affects these results.

• In step 2 of the scenario both files (f1 and f2) are listed. This shows that the system is aware that the dataset might be changed by an external tool, and monitors storage for such changes. It also shows that DMC2 is aware of the existence of DMC1, and DMCs synchronize metadata that describe both files (FClient1 connected to DMC1 lists both files).

• Executing step 3 of the scenario reveals that all data is stored at the storage system man- aged by DMC 2. Since it has not been read by FClient 2, there is no need to replicate data between storage systems. This proves that the system is aware of the selected management policy (synchronization of metadata only).

• In step 4 of the scenario a read operation occurs. The transfer log shows that the system is able to detect that the data is being read sequentially and prefetches data to reduce access delays. Thus, the system is aware of the user’s activity.

• Step 5 of the scenario shows that all data is stored in two storage systems (managed by DMC 1 and DMC 2).

• In step 6 of the scenario FClient 1 writes data directly to storage without DMC involvement. This can be observed by listing extended attributes of the file mentioned in step 5 (the access type discovered during the read operation was ’direct’). Thus, FClient is aware of the environment (it has automatically detected that it has direct access to the appropriate storage) and is able to use this knowledge to reduce overhead.

• The system is not involved in executing step 7 of the scenario.

• Step 8 of the scenario shows that DMC 2 is aware of changes performed by FClient con- nected to DMC 1, as well as changes made directly in the storage system. During step 6, additional bytes are added to file f1. As a result, the replication progress of f1 shown by FClient 2 is 50% because newly added bytes have not been replicated to storage managed by DMC2. During step 7, file f2 is modified without changing its size (by overwriting bytes). This modification is also discovered (FClients show a change in access times).

• In steps 9 and 10 of the scenario, the modified data is returned by FClient 2 proving again that DMC 2 is aware of changes made by FClient connected to DMC 1 as well as changes made directly in the storage system. 5.5. CONTRIBUTION OF CONTEXT AWARENESS TO EXPERIMENTS 78

• Executing step 12 of the scenario reveals that the system is aware of user activities (lack of activity over 5 minutes – see step 11), its own state (existing data replicas) and management policy, and is able to manage data automatically basing on these elements (data is removed from the storage system managed by DMC1 – replication progress shown by FClient 1 is 0% for both files).

These results prove that the system uses context information.

5.5 Contribution of Context Awareness to Experiments

Since context awareness was proven by the above-mentioned tests, we may again refer to results concerning system performance (Chapter 5.2), elasticity (Chapter 5.3) and reliability (Chapter 5.1.3) in connection with context awareness. The system utilizes context information concerning:

• the characteristics of the DMC environment and storage systems – to choose the best method of interacting with a particular storage system. This has proven especially prof- itable for the S3 storage system, where adjusting the block size used by FClient (accessing data in the storage system) and opening multiple connections enables faster data access than with the standard API (see Chapter 5.2.1).

• the state of the environment and resources. This can be observed during reliability tests when the system reacts to the failures (see Chapter 5.1.3).

• the characteristics of the FClient environment. Exemplary usage of such information may be observed during performance tests – FClient accesses data directly in the storage system as long as its host machine has direct access to this system (see Chapter 5.2.1).

• users’ and providers’ expectations regarding the data replication policy. The replication policy is selected on the basis of metadata that describes expectations (see Chapter 5.3).

• data locations and data flow. This information is used during the tests to find data which is not directly available on the resources managed by DMC. Additionally, it enables usage of a local copy of data immediately after its download by a particular DMC (see Chapter 5.2.3).

• users’ actions to verify the results of performance tests (see Chapter 5.2.2). The system gathers information about the volume of data read/written by the user. Subsequently, this information is used to validate results obtained through external tools.

Table 5.10 summarizes information about the types of context awareness revealed in tests. It should be noted that the use of context information is not limited to the presented examples (e.g., information about user actions can be used for accounting or throttling purposes). Regardless, the tests prove that the system is able to make good use of context.

5.6 Evaluation Summary

The tests have shown that DMC Core provided various modes of request processing and meta- data access, differing with regard to their characteristics. They also prove that the modes provided by DMC Core could form the basis for implementations – their overhead is acceptable in scenarios for which they have been designed. In addition, the tests have shown that the throughput of data access when using the developed system is similar to or better than the 5.6. EVALUATION SUMMARY 79

Figure 5.9: Results of selected steps of the context awareness test 5.6. EVALUATION SUMMARY 80

Table 5.10: Types of context awareness Type of context awareness Description Characteristics of the DMC Knowledge of the environment is used to optimize work with environment and storage systems different storage systems. State of the resources is monitored to react to failures and Current state of the environment adjust system operations to changes in resource state/load. Characteristics of the FClient Data in the storage system is accessed directly by FClient environment whenever possible. Expectation of the users and The data management policy is changed according to providers (policy) metadata provided by users and providers. The system is aware where the data is stored. DMCs are aware Data locations and data flow of one another and able to synchronize metadata. The system processes information about user activities for User actions accounting and optimization purposes. The system treats datasets differently depending on their Dataset properties metadata to provide appropriate storage systems and features such as monitoring of external changes. throughput of direct access to data via the storage system API. The system has been shown to operate in various environments, including a resource-constrained one. Thus, the tests prove that an appropriate combination of processing and metadata access modes provided by DMC Core has been utilized, enabling the system to maintain the desired quality of access in terms of performance. Although using virtualization during the tests may affect the measured access and handling times and throughput, it does not affect the presented conclusions. The goal of the tests was not to measure specific numerical quantities, but evaluate overhead and identify differences in characteristics between different work modes/configurations/policies. Numerical values compared during each tests were affected by virtualization in a similar manner; therefore results of their analysis remain trustworthy. The system’s fault tolerance has also been confirmed. An identified limitation, related to the throughput of a single FClient, may be overcome by using several FClients at the same time. The tests have also confirmed the system’s context awareness and ability to carry out data management in a user-transparent manner. Thus, as a whole, they have validated all key elements of this thesis: context awareness, transparency and quality. 6 Conclusions and Future Work

In this chapter the most important elements and achievements of the thesis are outlined. The research contribution is summed up and potential areas of application are listed. Possible future work is also described.

6.1 Summary

Existing data access solutions provide several interesting features. However, to the best of the author’s knowledge, none of the existing services or tools is well suited to providing transpar- ent and efficient anytime/anyplace data access in an organizationally distributed environment. Thus, in this dissertation the author investigated the thesis defined in Chapter 1.2 on page 4. To validate the thesis, the author identified data access stakeholders and their requirements when working in federated computational infrastructures. On that basis the author designed the MACAS model which uses context awareness to provide data in a transparent manner while maintaining quality of access. The mapping of MACAS to system components was also described. The design of MACAS connects the stakeholders’ requirements with metadata and layers and concerns that provide specific functionality. The layers and concerns themselves cover several aspects of data access, such as interactions with diverse storage resources, users’ inter- actions with the data access system, coordination of execution of multiple operations to utilize more than one storage system, efficient utilization of network resources, providers’ cooperation and distribution of the environment. One of the most important features of the MACAS model involves introducing metadata to describe the context needed by each layer and concern. Although the appropriate metadata enables convenient access to data and supports various useful features, storing and processing large amounts of metadata can become a significant overhead and bottleneck the entire data access system [13; 11; 15]. Thus, an important aspect of MACAS metadata is its distribution, along with consistency and synchronization models that reduce metadata processing overhead and delays. While stronger consistency models enable implementation of various features at the cost of increased overhead, most MACAS metadata is based on weaker consistency models, utilizing BASE (Basic Availability, Soft-state, Eventual consistency) rather than ACID (Atomic, Consistent, Isolated, Durable) properties. The MACAS classifies metadata according to its required consistency and access scope, providing causal consistency only for metadata crucial for security and providers’ cooperation. This classification supports implementation of appropriate metadata stores, caches and synchronization protocols and, in effect, provisioning of appropriate 6.1. SUMMARY 82

Figure 6.1: Relation of issues connected to transparent data access functionality, such as an eventually consistent view on data, while maintaining scalability and low overheads. Based on the MACAS model a system that provides transparent data access was developed in the framework of collaborative research projects. The system is based on a core element, DMC Core, implemented by the author. Tests show that the model’s basic assumptions are fulfilled since the core element manages to provide appropriate features for processing massive metadata access requests. The main achievement of this work is creation of the MACAS model including definition of metadata types, distribution, consistency and synchronization as well as methods for processing of crucial metadata. Developing the architecture of a system that provides several metadata caches and stores to processes large amounts of metadata, including public and private dat- achunks of variable size, is also important. The efficient implementation of the system core that delivers request processing and metadata storage methods should also be acknowledged. Figure 6.1 presents relations between the issues described in the thesis, showing how high-level con- cepts translate to the system implementation. It is worth to note that it is not possible to make data access fully independent of access origin and data storage site. The capabilities of different resources, especially network connections, may result in different quality of access depending on origin. Thus, the author associates data access transparency with maintaining quality of access, which means that concealment of problems related to differing data formats, storage systems and locations should not affect the quality of access supported by the available resources. While such problems are concealed from users through the use of context represented by metadata, effi- cient distributed metadata processing becomes crucial. Thus, appropriate metadata distribution modelling and implementation is required. Disaggregation of the Cooperation Layer enables MACAS to be considered a versatile model. MACAS has been tested in a federated environment without any assumptions regarding the providers’ relationships. Thus, it can be used in various environments (including a global scenario with several nonfederated organizations), with a suitably adjusted Cooperation Layer. Given that the model was successfully validated, the author believes that it will improve data access operations for organizationally distributed environments and, as a consequence, allow the 6.2. RESEARCH CONTRIBUTION 83 users to focus on better intra-community data sharing and collaboration instead of overcoming difficulties related to remote data access. A system that applies the proposed ideas is continually being improved and its evolution can be observed by following its website [86] or its open source repository [87].

6.2 Research Contribution

The work described within this thesis contributes to computer science on several levels:

• Analytical – the author has analyzed existing federated computing environments and iden- tified data access stakeholders and their requirements. The author has also taken part in formulation of assumptions regarding the data access system fulfilling functional and non-functional requirements of the stakeholders.

• Design – the author has designed the MACAS model and the Data Management Compo- nent Core, and co-designed key mechanisms that implement their respective features.

• Implementation – the author has implemented the DMC Core and has taken part in im- plementation of several other elements.

• Experimental – the author has taken part in design and implementation of tests and analysis of results. The author has designed, implemented and analyzed experiments that verify the system’s flexibility as well as experiments which focus on testing DMC Core.

As the system that implements MACAS is the result of collaborative research projects, Figure 6.2 details the author’s involvement in its creation (compare to Figure 1.4).

Figure 6.2: Author’s individual achievements (green), collaborative achievements (black) and tasks in which the author was not involved (orange)

6.3 Range of Applications

MACAS is a universal model for distributed data access. Scientific computing infrastructures are an especially important area of application, since more users will be able to utilize the provided 6.4. FUTURE WORK 84 resources thanks to streamlined data access and management. This can be instrumental in scientific discovery driven by, e.g., the fourth paradigm [18] or Big Data [26] problems. The application of MACAS should not be limited to federated environments but rather ex- tended further, towards globalized data stored by nonfederated organizations (NFOs). The author is personally involved in implementation of two products for the scientific community which utilize the results of this dissertation – VeilFS [138] and Onedata [86; 143]. While VeilFS is targeted at simplification of data access in federated infrastructures, Onedata simplifies data ac- cess in globalized environments consisting of multiple NFOs [149]. VeilFS and Onedata systems have been deployed in three production environments. The first example is the VeilFS [138] deployment on the Zeus supercomputer [110], based on the GPFS storage solution. The Data Management Component uses a single node connected to GPFS. The worker nodes executing Grid jobs are also connected to GPFS and the system has been integrated with the Grid scheduler to mount FClient before a Grid job starts. The Onedata system [86] has been deployed twice on the Prometheus supercomputer [94]. The former deployment utilizes a single node with dedicated storage and manages over 2TB of data from the HBP project [24], supporting ten dedicated services which access data through FClient. Requests are divided between service instances by nginx [85]. The latter deployment also utilizes a single Prometheus node to host DMC. The storage infrastructure connected to this node is controlled by the Lustre system [83]. The data is processed by a set of computing nodes using FClient for access. Verification of the proposed ideas not only with artificial tests (see Chapter 5) but also in production systems proves their value for the scientific community. However, MACAS can also be used by providers of commercial storage and computing power to offer a more complete prod- uct, e.g., if one provider operates a fast storage mechanism for computations while another has high-capacity replicated storage resources for archiving, they can both provide a computational infrastructure with automatic backup to safe storage. Similarly, small and medium-sized compa- nies can use MACAS to offer services to individual users or business partners (i.e., B2B services) even when they do not possess sufficient resources to operate such services by themselves.

6.4 Future Work

Future work will focus on evolution of layers and concerns:

• extension of Cooperation Layer for Open Data support,

• extensions of Executive Layer and Monitoring and Management Concern for better adap- tation of the system behavior to users’ actions.

The author participated in two initiatives [72; 81] which aimed to create e-infrastructures that deal with open data, easy sharing and access in an organizationally distributed environment. The work within these initiatives considered large-scale environments consisting of large numbers of sites that belong to various providers. Thus, it became important to limit the number of sites involved in metadata synchronization. To determine the sites that should synchronize specific metadata, a data organization was proposed in [142] to enable sites to be rapidly arranged into temporary groups for joint support for particular research. Such sites share metadata only within the given group and, as a result, proliferation of sites does not automatically result in wider metadata synchronization. Future work will utilize this organizational paradigm for crossing federation boundaries [60; 61]. 6.4. FUTURE WORK 85

MACAS provides monitoring capabilities to ensure quality of service. However, further re- search on adaptation to users’ actions is required. The first step involves investigation of over- head introduced by advanced monitoring and processing of the gathered information. Client-side metadata preprocessing consumes resources, and it is therefore important to determine the in- fluence of such preprocessing on various classes of applications. The second step is to improve algorithms that adjust system behavior. One potential direction of development involves decen- tralization of management. Instead of using DMC to analyze the state of the whole environment and than reconfigure FClients, the FClients themselves can act as independent agents that build their own view of the environment [150; 128] and tune their behavior in accordance with this view. Other possible improvements include the use of deep learning [38], resulting in a syn- ergistic approach to high-performance computing, distributed/cloud computing and artificial intelligence, all of which represent crucial concepts in modern supercomputing research (see e.g., [56; 59]). Bibliography

[1] Katzy, B. Design and Implementation of Virtual Organisations. In Proc. of Thirty-First Hawaii International Conference on System Sciences (1998), pp. 44–48.

[2] Foster, I., and Kesselman, C. The Grid: Blueprint for a New Computing Infrastruc- ture. Morgan Kaufmann Publishers, 1998.

[3] Gilbert, S., and Lynch, N. Brewer’s Conjecture and the Feasibility of Consistent Available Partition-Tolerant Web Services. ACM SIGACT News 33, 2 (2002), 51–59.

[4] Foster, I. What is the Grid? A Three Point Checklist. GRIDtoday 1, 6 (2002).

[5] Ghemawat, S., Gobioff, H., and Leung, S.-T. The Google file system. In Proc. of The 19th ACM Symposium on Operating Systems Principles (2003), pp. 29–43.

[6] Azagury, A., Dreizin, V., Factor, M., Henis, E., Naor, D., Rinetzky, N., Rodeh, O., Satran, J., Tavory, A., and Yerushalmi, L. Towards an object store. In Proc. of 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies (2003), pp. 165–176.

[7] Sompel, H. V. D., Nelson, M., Lagoze, C., and Warner, S. Resource Harvesting within the OAI-PMH Framework. D-Lib Magazine 10, 12 (2004).

[8] Baud, J.-P. B., Caey, J., Lemaitre, S., Nicholson, C., Smith, D., and Stewart, G. LCG Data Management: From EDG to EGEE. In Proc. of UK e-Science All Hands Meeting (2005).

[9] Mesnier, M., Ganger, G., and Riedel, E. Object-based storage: pushing more functionality into storage. IEEE Potentials 24, 2 (2005), 31–34.

[10] Thain, D., and Livny, M. Parrot: An Application Environment for Data-Intensive Computing. Scalable Computing: Practice and Experience 6, 3 (2005), 9–18.

[11] Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. E., and Maltzahn, C. : A Scalable, High-Performance Distributed File System. In Proc. of The 7th Symposium on Operating Systems Design and Implementation (2006), pp. 307–320.

[12] Shiers, J. The Worldwide LHC Computing Grid (worldwide LCG). Computer Physics Communications 177, 1-2 (2007), 219–223.

[13] Leong, D. A new revolution in enterprise storage architecture. IEEE Potentials 28, 6 (2009), 32–33. 87

[14] Kryza, B., Dutka,L.,S lota, R., and Kitowski, J. Dynamic VO Establishment in Distributed Heterogeneous Business Environments. In Proc. of International Conference on Computational Science (2009), vol. 5545 of Lecture Notes in Computer Science, pp. 709– 718.

[15] Dong, D., and Herbert, J. FSaaS: File System as a Service. In Proc. of IEEE 38th Annual Computer Software and Applications Conference (2009).

[16] Stantchev, V., and Schropfer,¨ C. Negotiating and Enforcing QoS and SLAs in Grid and Cloud Computing. In Proc. of International Conference on Grid and Pervasive Computing (2009), vol. 5529 of Lecture Notes in Computer Science, pp. 25–35.

[17] Ching-Hsien, H., Tai-Lung, C., and Kun-Ho, L. QoS based parallel file transfer for grid economics. In Proc. of International Conference on Multimedia Information Network- ing and Security (2009), pp. 653–657.

[18] Hey, A., Tansley, S., and Tolle, K. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009.

[19] Hunich,¨ D., and Muller-Pfefferkorn,¨ R. Managing large datasets with iRODS - a performance analyses. In Proc. of International Multiconference on Computer Science and Information Technology (2010), pp. 647–654.

[20] S lota, R., Nikolow, D., Polak, S., Kuta, M., Kapanowski, M., Skalkowski, K., Pogoda, M., and Kitowski, J. Prediction and Load Balancing System for Distributed Storage. Scalable Computing: Practice and Experience 11, 2 (2010).

[21] Shafer, J., Rixner, S., and Cox, A. The Hadoop distributed filesystem: Balancing portability and performance. In Proc. of IEEE International Symposium on Performance Analysis of Systems Software (2010), pp. 122–133.

[22] Munoz,˜ V. M., Vicente, G. A., and Kaci, M. A Decentralized Deployment Strategy and Performance Evaluation of LCG File Catalog Service. Journal of Grid Computing 9, 3 (2011), 345–354.

[23] Kawano, H. Hierarchical Storage Systems and File Formats for Web Archiving. In Proc. of 21st International Conference on Systems Engineering (2011), pp. 217–220.

[24] Markram, H., Meier, K., Lippert, T., Grillner, S., Frackowiak, R. S., De- haene, S., Knoll, A., Sompolinsky, H., Verstreken, K., DeFelipe, J., Grant, S., Changeux, J.-P., and Saria, A. Introducing the Human Brain Project. In Proc. of The European Future Technologies Conference and Exhibition (2011), vol. 7 of Procedia Computer Science, pp. 39–42.

[25] Calder, B., Wang, J., Ogus, A., Nilakantan, N., Skjolsvold, A., McKelvie, S., Xu, Y., Srivastav, S., Wu, J., Simitci, H., Haridas, J., Uddaraju, C., Khatri, H., Edwards, A., Bedekar, V., Mainali, S., Abbasi, R., Agarwal, A., Fahim ul Haq, M., Ikram ul Haq, M., Bhardwaj, D., Dayanand, S., Adusumilli, A., Mc- Nett, M., Sankaran, S., Manivannan, K., and Rigas, L. Windows azure storage: A highly available cloud storage service with strong consistency. In Proc. of Twenty-Third ACM Symposium on Operating Systems Principles (2011), pp. 143–157. 88

[26] Mills, S., Lucas, S., Irakliotis, L., Rappa, M., Carlson, T., and Perlowitz, B. DEMYSTIFYING BIG DATA: A Practical Guide to Transforming the Business of Government. Technical report, 2012. http: // www. ibm. com/ software/ data/ demystifying-big-data/ .

[27] International DOI Foundation, Ed. DOI Handbook. 2012.

[28] S lota, R., Nikolow, D., Kitowski, J., Krol,´ D., and Kryza, B. FiVO/QStorMan Semantic Toolkit for Supporting Data-Intensive Applications in Distributed Environments. Computing and Informatics 31, 5 (2012), 1003–1024.

[29] S lota, R., Nikolow, D., Ska lkowski, K., and Kitowski, J. Management of Data Access with Quality of Service in PL-Grid Environment. Computing and Informatics 31, 2 (2012), 463–479.

[30] Polak, S., and Slota, R. Organization of quality-oriented data access in modern distributed environments based on semantic interoperability of services and systems. Se- mantic Interoperability: Issues, Solutions, Challenges (2012), 131–152.

[31] Rettberg, N., and Principe, P. Paving the way to Open Access scientific scholarly information: OpenAIRE and OpenAIREplus. In Proc. of International Conference on Electronic Publishing (2012).

[32] S lota, R. Storage QoS provisioning for execution programming of data-intensive appli- cations. Scientific Programming 20, 1 (2012), 69–80.

[33] Roblitz,¨ T. Towards Implementing Virtual Data Infrastructures - a Case Study with iRODS. Computer Science 13, 4 (2012), 21–34.

[34] Martini, B., and Choo, R. Cloud storage forensics: ownCloud as a case study. Digital Investigation 10, 4 (2013), 287–299.

[35] Dhar, V. Data Science and Prediction. Communications of the ACM 56, 12 (2013), 64–73.

[36] Nikolow, D., Slota, R., Polak, S., Mitera, D., Pogoda, M., Winiarczyk, P., and Kitowski, J. Model of QoS Management in a Distributed Data Sharing and Archiv- ing System. In Proc. of International Conference on Computational Science (2013), vol. 18 of Procedia Computer Science, pp. 100–109.

[37] Han, L., Huang, H., and Xie, C. Performance Analysis of NAND Flash Based Cache for Network Storage System. In Proc. of IEEE Eighth International Conference on Net- working, Architecture and Storage (2013), pp. 68–75.

[38] wen Chen, X., and Lin, X. Big Data Deep Learning: Challenges and Perspectives. IEEE Access 2 (2014), 514–525.

[39] Du, Y., Zhang, Y., Xiao, N., and Liu, F. CD-RAIS: Constrained dynamic striping in redundant array of independent SSDs. In Proc. of IEEE International Conference on Cluster Computing (2014), pp. 212–220.

[40] Gardner, R., Campana, S., Duckeck, G., Elmsheuser, J., Hanushevsky, A., Honig,¨ F. G., Iven, J., Legger, F., Vukotic, I., and Yang, W. Data federation strategies for ATLAS using XRootD. Journal of Physics: Conference Series 513 (2014). 89

[41] Hildmann, T., and Kao, O. Deploying and Extending On-Premise Cloud Storage Based on ownCloud. In Proc. of IEEE 34th International Conference on Distributed Computing Systems Workshops (2014), pp. 76–81.

[42] Memon, A. S., Jensen, J., Cernivec, A., Benedyczak, K., and Riedel, M. Fed- erated Authentication and Credential Translation in the EUDAT Collaborative Data In- frastructure. In Proc. of IEEE/ACM 7th International Conference on Utility and Cloud Computing (2014), pp. 726–731.

[43] Chen, S., Wang, Y., and Pedram, M. A Joint Optimization Framework for Request Scheduling and Energy Storage Management in a Data Center. In Proc. of 8th Interna- tional Conference on Cloud Computing (2015), pp. 163–170.

[44] Lamanna, G., Antonelli, L. A., Contreras, J. L., Knodlseder, J., Kosack, K., Neyroud, N., Aboudan, A., Arrabito, L., Barbier, C., Bastieri, D., Bois- son, C., Brau-Nogue, S., Bregeon, J., Bulgarelli, A., Carosi, A., Costa, A., De Cesare, G., de los Reyes, R., Fioretti, V., Gallozzi, S., Jacquemier, J., Khelifi, B., Kocot, J., Lombardi, S., Lucarelli, F., Lyard, E., Maier, G., Mas- simino, P., Osborne, J. P., Perri, M., Rico, J., Sanchez, D. A., Satalecka, K., Siejkowski, H., Stolarczyk, T., Szepieniec, T., Testa, V., Walter, R., Ward, J. E., and Zoli, A. Cherenkov Telescope Array Data Management. In Proc. of The 34th International Cosmic Ray Conference (2015).

[45] Ananthakrishnan, R., Chard, K., Foster, I., and Tuecke, S. Globus Platform-as- a-Service for Collaborative Science Applications. Concurrency and Computations: Practice and Experience 27, 2 (2015), 290–305.

[46] Hildebrand, D., and Schmuck, F. B. On Making GPFS Truly General. ;login: 40, 3 (2015).

[47] Tochtermann, K., Scholz, W., and Atif, L. Science 2.0 - Mapping European Per- spectives. Report on the General Stance of Organizations on European Commission’s Pub- lic Consultation on Science 2.0, 2015. http: // www. leibnizopen. de/ suche/ handle/ document/ 115109 .

[48] Pacheco, L., Halalai, R., Schiavoni, V., Pedone, F., Riviere, E., and Felber, P. GlobalFS: A Strongly Consistent Multi-site File System. IEEE Computer Society (2016), 147–156.

[49] Madera, C., and Laurent, A. The next information architecture evolution: the data lake wave. In Proc. of The 8th International ACM Conference on Management of Digital EcoSystems (2016), pp. 174–180.

[50] Robinson-Garcia, N., Mongeon, P., Jeng, W., and Costas, R. DataCite as a novel bibliometric source: Coverage, strengths and limitations. Journal of Informetrics 11, 3 (2017).

[51] Koukounidou, V. V. OpenAIRE: Supporting the H2020 Open Access Mandate. In Proc. of The 21st International Conference on Electronic Publishing (2017), pp. 56–61.

[52] Shankar, V., and Lin, R. Performance Study of Ceph Storage with Intel Cache Accel- eration Software: Decoupling Hadoop MapReduce and HDFS over Ceph Storage. In Proc. 90

of IEEE 4th International Conference on Cyber Security and Cloud Computing (2017), pp. 10–13.

[53] Dautov, R., and Distefano, S. Quantifying volume, velocity, and variety to sup- port (Big) data-intensive application development. In Proc. of 2017 IEEE International Conference on Big Data (2017).

[54] Vangoor, B. K. R., Tarasov, V., and Zadok, E. To FUSE or Not to FUSE: Perfor- mance of User-Space File Systems. In Proc. of The 15th Usenix Conference on File and Storage (2017), pp. 59–72.

[55] Fan, Y., Wang, Y., and Ye, M. An Improved Small File Storage Strategy in Ceph File System. In Proc. of 14th International Conference on Computational Intelligence and Security (2018), pp. 488–491.

[56] Miyazaki, T., Sato, I., and Shimizu, N. Bayesian Optimization of HPC Systems for Energy Efficiency. In Proc. of International Conference on High Performance Computing (2018), vol. 10876 of Lecture Notes in Computer Science, pp. 44–62.

[57] Kelleher, J. D., and Tierney, B. Data Science. MIT Press, 2018.

[58] Bollig, E. F., Allan, G. T., Lynch, B. J., Huerta, Y. A., Mix, M., Mun- sell, E. A., Benson, R. M., and Swartz, B. Leveraging OpenStack and Ceph for a Controlled-Access Data Cloud. In Proc. of Practice and Experience in Advanced Research Computing Conference (2018).

[59] Sridharan, S., Vaidyanathan, K., Kalamkar, D. D., Das, D., Smorkalov, M. E., Shiryaev, M., Mudigere, D., Mellempudi, N., Avancha, S., Kaul, B., and Dubey, P. On Scale-out Deep Learning Training for Cloud and HPC. In Proc. of SysML conference (2018).

[60] The European Open Science Cloud for Research Pilot Project, last access April 2019. https: // eoscpilot. eu/ .

[61] The Horizon 2020 eXtreme DataCloud - XDC project, last access April 2019. http: // www. extreme-datacloud. eu .

[62] ACC Cyfronet AGH, last access November 2018. http: // www. cyfronet. krakow. pl/ en/ 4421,main. html .

[63] Active File Management (AFM), last access November 2018. https: // www. ibm. com/ developerworks/ community/ wikis/ home? lang= en# !/ wiki/ General\ %20Parallel\ %20File\ %20System\ %20( GPFS) /page/ Active\ %20File\ %20Management\ %20( AFM) .

[64] AWS CLI, last access November 2018. https: // aws. amazon. com/ cli/ .

[65] BeeGFS, last access November 2018. http: // \url{ http: // www. beegfs. com/ content/ } .

[66] Cloud Data Management Interface, last access November 2018. https: // www. snia. org/ cdmi . 91

[67] Command line tool and library for transferring data with URLs, last access November 2018. https: // curl. haxx. se/ .

[68] DataCite : helping you to find, access, and reuse research data, last access November 2018. http: // datacite. org .

[69] DataNet Federation Consortium, last access November 2018. http: // datafed. org/ .

[70] Docker, last access November 2018. https: // www. docker. com/ why-docker .

[71] McCallion, J. Dropbox vs OneDrive vs Google Drive: what’s the best cloud storage service of 2014?, last access November 2018. http: // www. pcpro. co. uk/ features/ 389929/ dropbox-vs-onedrive-vs-google-drive-whats-the-best-cloud- storage-service-of-2014 .

[72] EGI-Engage, last access November 2018. https: // www. egi. eu/ about/ egi-engage/ .

[73] EVS, last access November 2018. http: // www. hwclouds. com/ en-us/ product/ evs. html .

[74] File Systems, last access November 2018.

http: // www. cse. msu. edu/ ~ cse812/ fall03/ Slides/ OldSlides/ filesys. pdf .

[75] FUSE: Filesystem in Userspace, last access November 2018. http: // fuse. sourceforge. net/ .

[76] GlusterFS community website, last access November 2018. http: // www. gluster. org/ about/ .

[77] Grid File Access Library 2.0 official page, last access November 2018. https: // svnweb. cern. ch/ trac/ lcgutil/ wiki/ gfal2 .

[78] How to Benchmark Your System with Sysbench, last access November 2018. https: // www. howtoforge. com/ how-to-benchmark-your-system-cpu-file-io-mysql-with-sysbench .

[79] htop - an interactive process viewer for Unix, last access November 2018. https: // hisham. hm/ htop/ .

[80] IBM General Parallel File System (GPFS), last access November 2018. https: // www. ibm. com/ support/ knowledgecenter/ en/ SSPT3X_ 3. 0. 0/ com. ibm. swg. im. infosphere. biginsights. product. doc/ doc/ bi_ gpfs_ overview. html .

[81] Indigo DataCloud, last access November 2018. https: // www. indigo-datacloud. eu/ .

[82] Introduction to Xrootd N2N for Disk Caching Proxy (Xcache) utilizing RUCIO metalink, last access November 2018. https: // github. com/ wyang007/ rucioN2N-for-Xcache/ wiki/ Introduction-to-Xrootd-N2N-for-Disk-Caching- Proxy-( Xcache) -utilizing-RUCIO-metalink . 92

[83] Lustre, last access November 2018. http: // www. whamcloud. com/ lustre/ .

[84] Mnesia, last access November 2018. http: // erlang. org/ doc/ man/ mnesia. html .

[85] Nginx, last access November 2018. https: // nginx. org/ en/ .

[86] Onedata - GLOBAL DATA ACCESS SOLUTION FOR SCIENCE, last access November 2018. https: // onedata. org/ #/ home .

[87] Onedata source repository, last access November 2018. https: // github. com/ onedata .

[88] OpenStack Object Storage (”Swift”), last access November 2018. https: // wiki. openstack. org/ wiki/ Swift .

[89] PanFS Storage Operating System, last access November 2018. http: // www. panasas. com/ products/ panfs .

[90] PLGrid CORE project, last access November 2018. http: // dice. cyfronet. pl/ projects/ details/ PLGrid_ Core .

[91] PLGrid Plus project, last access November 2018. http: // www. plgrid. pl/ en# section-1t .

[92] PLGrid project, last access November 2018. http: // projekt. plgrid. pl/ en .

[93] Polish National Data Storage, last access November 2018. https: // www. elettra. eu/ Conferences/ 2014/ BDOD/ uploads/ Main/ Polish\ %20National\ %20Data\ %20Storage. pdf .

[94] Prometheus High Performance Computer, last access November 2018. http: // www. cyfronet. krakow. pl/ computers/ 15226,artykul,prometheus. html .

[95] Weil, S. A., Leung, A. W., Brandt, S. A., and Maltzahn, C. RADOS: A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters, last access November 2018. http: // ceph. com/ papers/ weil-rados-pdsw07. pdf .

[96] Rehab project, last access November 2018. http: // www. icsr. agh. edu. pl/ rehab/ .

[97] S3 Proxy, last access November 2018. https: // github. com/ andrewgaul/ s3proxy .

[98] Scality, last access November 2018. http: // www. scality. com/ products/ what-is-ring/ .

[99] Software Defined Storage, last access November 2018. http: // www. snia. org/ sites/ default/ files/ SNIA_ Software_ Defined_ Storage_ \% 20White_ Paper_ v1. pdf . 93

[100] Storj, last access November 2018. http: // storj. io/ .

[101] Syndicate drive, last access November 2018. http: // syndicatedrive. com/ .

[102] Tachyon Project, last access November 2018. http: // tachyon-project. org/ .

[103] Braam, P. J. The Coda Distributed File System, last access November 2018. http: // www. coda. cs. cmu. edu/ ljpaper/ lj. html .

[104] Mell, P., and Grance, T. The NIST Definition of Cloud Computing, Recommenda- tions of the National Institute of Standards and Technology, last access November 2018. http: // www-07. ibm. com/ servers/ eserver/ includes/ download/ what_ is_ grid_ faq. pdf .

[105] Dippo, C., and Sundgren, B. The Role of Metadata in Statistics, last access November 2018. https: // www. bls. gov/ osmr/ pdf/ st000040. pdf .

[106] Organization, N. I. S. Understanding Metadata, last access November 2018. https: // groups. niso. org/ apps/ group_ public/ download. php/ 17446/ Understanding\ %20Metadata. pdf .

[107] Web Object Scaler, last access November 2018. http: // www. ddn. com/ products/ object-storage-web-object-scaler-wos/ #aboutwos .

[108] What is Grid Computing for zSeries?, last access November 2018. http: // www-07. ibm. com/ servers/ eserver/ includes/ download/ what_ is_ grid_ faq. pdf .

[109] Worldwide LHC Computing Grid, last access November 2018. http: // wlcg. web. cern. ch/ .

[110] Zeus High Performance Computer, last access November 2018. http: // www. cyfronet. krakow. pl/ computers/ 13725,artykul,zeus. html . Author’s Bibliometric Data

Web of Science (access 17.3.2019) documents: 17 h-index: 4 number of citations: 36 number of citations without self citations: 20

SCOPUS (access 17.3.2019) documents: 17 h-index: 5 total citations: 49 total citations without self citations: 30 (with h=3)

Google Scholar (access 1.4.2019) documents: 21 h-index: 6 number of citations: 94 i10-indeks: 3

DBLP (access 17.3.2019) documents: 22 Author’s Publications

[111] Kuta, M., Wrzeszcz, M., Chrzaszcz, P., and Kitowski, J. Accuracy of Baseline and Complex Methods Applied to Morphosyntactic Tagging of Polish. In Proc. of In- ternational Conference on Computational Science (2008), vol. 5101 of Lecture Notes in Computer Science, pp. 903–912.

[112] Paszynski,´ M., Gurgul, P., Sieniek, M., Wrzeszcz, M., and Schaefer, R. Object oriented hp adaptive finite element method system for multiscale problems. In Proc. of 8th World Congress on Computational Mechanics; 5th European Congress on Computational Methods in Applied Sciences and Engineering (2008), p. 231.

[113] Kuta, M., Wojcik,´ W., Wrzeszcz, M., and Kitowski, J. Application of Stacked Methods to Part-of-Speech Tagging of Polish. In Proc. of International Conference on Par- allel Processing and Applied Mathematics (2009), vol. 6067 of Lecture Notes in Computer Science, pp. 340–349.

[114] Kuta, M., Kitowski, J., Wojcik,´ W., and Wrzeszcz, M. Application of Weighted Voting Taggers to Languages Described with Large Tagsets. Computing and Informatics 29, 2 (2010), 203–225.

[115] Kitowski, J., Wcislo, R., Wrzeszcz, M., S lota, R., Otfinowski, J., Probosz, K., Pisula, M., Sobczyk, A., and Regula, K. Stroke patients rehabilitation sup- ported by remote computer system – methods and results. In Proc. of Cracow Grid Workshop (2010), p. 226–233.

[116] Kryza, B., Wrzeszcz, M., Dutka,L. Kitowski, J., Contat, M., Biran, H., Bracker, H., Schneider, B., Lundin, M., Svenmarck, P., Kvassay, M., and Hluchy,´ L. Supporting Urban Military Operations Analysis and Planning with Modern HPC Infrastructures. In Proc. of Cracow Grid Workshop (2010).

[117] Laclavik, M., Dlugolinsky,´ S., Seleng, M., Kvassay, M., Schneider, B., Bracker, H., Wrzeszcz, M., Kitowski, J., and Hluchy,´ L. Agent-Based Simu- lation Platform Evaluation in the Context of Human Behavior Modeling. In Proc. of In- ternational Conference on Autonomous Agents and Multiagent Systems (2011), vol. 7068 of Lecture Notes in Computer Science, pp. 396–410.

[118] Wcislo, R., Kitowski, J., and Wrzeszcz, M. Remote Rehabilitation of Stroke Pa- tients. In Proc. of International Conference on Health Informatics (2011), pp. 500–503. 96

[119] Dlugolinsky,´ S., Kvassay, M., Hluchy,´ L., Wrzeszcz, M., Krol,´ D., and Ki- towski, J. Using parallelization for simulation of human behaviour. In Proc. of 7th International Workshop on Grid Computing for Complex Problems (2011), pp. 258–265.

[120] S lota, R., Krol,´ D., Skalkowski, K., Orzechowski, M., Nikolow, D., Kryza, B., Wrzeszcz, M., and Kitowski, J. A Toolkit for Storage QoS Provisioning for Data-Intensive Applications. Computer Science 13, 1 (2012), 63–73.

[121] Kvassay, M., Hluchy,´ L., Dlugolinsky,´ S., Laclavik, M., Schneider, B., Bracker, H., Tavcar, A., Gams, M., Krol,´ D., Wrzeszcz, M., and Kitowski, J. An integrated approach to mission analysis and mission rehearsal. In Proc. of Winter Simulation Conference (2012), pp. 362:1–362:2.

[122] Krol,´ D., Kryza, B., Wrzeszcz, M., Dutka,L.,and Kitowski, J. Elastic Infras- tructure for Interactive Data Farming Experiments. In Proc. of International Conference on Computational Science (2012), p. 206–215.

[123] Kryza, B., Krol,´ D., Wrzeszcz, M., Dutka,L., and Kitowski, J. Interactive Cloud Data Farming Environment for Military Mission Planning Support. Computer Science 13, 3 (2012), 89–100.

[124] Wrzeszcz, M., and Kitowski, J. Mobile Social Networks for Live Meetings. Computer Science 13, 4 (2012), 87–100.

[125] Krol,´ D., Wrzeszcz, M., Kryza, B., Dutka,L., and Kitowski, J. Scalarm: Massively Self-scalable Platform for Data Farming. In Proc. of Cracow Grid Workshop (2012), p. 53–54.

[126] Wrzeszcz, M., and Kitowski, J. Social Multi-agent Simulation Framework. In Proc. of The Third International Workshop on Infrastructures And Tools For Multiagent Systems (2012), pp. 163–176.

[127] Krol,´ D., Wrzeszcz, M., Kryza, B., Dutka,L., and Kitowski, J. Using Grid Storage in Virtualized Computational Environments. In Proc. of Cracow Grid Workshop (2012), p. 55–56.

[128] Wrzeszcz, M., and Kitowski, J. Creation of Agent’s Vision of Social Network Through Episodic Memory. In Proc. of International Conference on Parallel Processing and Applied Mathematics (2013), vol. 8385 of Lecture Notes in Computer Science, pp. 741–750.

[129] Krol,´ D., Wrzeszcz, M., Kryza, B., Dutka,L., and Kitowski, J. Massively Scalable Platform for Data Farming Supporting Heterogeneous Infrastructure. In Proc. of The Fourth International Conference on Cloud Computing, GRIDs, and Virtualization (2013), pp. 144–149.

[130] Dutka,L., Slota, R., Wrzeszcz, M., Krol,´ D., Opio la,L., and Kitowski, J. On Super Easy Access to your Data in PL-Grid Infrastructure. In Proc. of Cracow Grid Workshop (2013), pp. 47–48.

[131] Krol,´ D., Wrzeszcz, M., Kryza, B., Dutka,L., S lota, R., and Kitowski, J. Scalarm: Scalable Platform for Data Farming. In Proc. of Sixth ACC Cyfronet AGH Users’ Conference KUKDM (2013), p. 47–48. 97

[132] S lota, R., Dutka,L.,Wrzeszcz, M., Kryza, B., Nikolow, D., Krol,´ D., and Ki- towski, J. Storage Management Systems for Organizationally Distributed Environments PLGrid PLUS Case Study. In Proc. of International Conference on Parallel Processing and Applied Mathematics (2013), vol. 8384 of Lecture Notes in Computer Science, pp. 724–733.

[133] Wrzeszcz, M. Sztuczna inteligencja i systemy agentowe w grach komputerowych: czyli jak stworzy´cinteligentna gre w domu. Wydawnictwo Bezkresy Wiedzy, 2013.

[134] Wrzeszcz, M., Trzepla, K., Mazur, M. Opio la,L.,Dutka, L.,S lota, R., and Kitowski, J. Accounting and monitoring in distributed storage services. In Proc. of Cracow Grid Workshop (2014), p. 29–30.

[135] Dutka,L., Lichon,´ T., S lota, R., Zemek, K., Wrzeszcz, M., S lota, R., and Kitowski, J. Globalization of data access for computing infrastructures. In Proc. of Cracow Grid Workshop (2014), p. 69–70.

[136] Dutka,L., S lota, R., Wrzeszcz, M., Krol,´ D., Opiola, L., S lota, R., and Kitowski, J. Harnessing Organizationally Distributed Data with VeilFS. In Proc. of Seventh ACC Cyfronet AGH Users’ Conference KUKDM (2014), pp. 79–80.

[137] Dutka,L.,S lota, R., Wrzeszcz, M., Krol,´ D., and Kitowski, J. Uniform and Ef- ficient Access to Data in Organizationally Distributed Environments. In PL-Grid, vol. 8500 of Lecture Notes in Computer Science. Springer, 2014, pp. 178–194.

[138] Wrzeszcz, M., Dutka,L., S lota, R., and Kitowski, J. VeilFS – A New Face of Storage as a Service. In Proc. of eChallenges Conference (2014).

[139] Kvassay, M., Hluchy,´ L., Dlugolinsky, S., Schneider, B., Bracker, H., Tavcar, A., Gams, M., Contat, M., Dutka,L., Krol,´ D., Wrzeszcz, M., and Kitowski, J. A Novel Way of Using Simulations to Support Urban Security Operations. Computing and Informatics 34, 6 (2015), 1201–1233.

[140] Zemek, K., Opiola, L.,Wrzeszcz, M., Dutka,L., Slota, R., and Kitowski, J. Delegation of authority in a distributed data access system. In Proc. of CGW Workshop (2015), p. 97–98.

[141] S lota, R., Wrzeszcz, M., Dutka,L., S lota, R., and Kitowski, J. Efficient storing of metadata for distributed data management. In Proc. of CGW Workshop (2015), p. 111–112.

[142] Wrzeszcz, M., Trzepla, K., Slota, R., Zemek, K., Lichon,´ T., Opiola, L., Nikolow, D., Dutka,L.,S lota, R., and Kitowski, J. Metadata Organization and Management for Globalization of Data Access with Onedata. In Proc. of International Conference on Parallel Processing and Applied Mathematics (2015), vol. 9573 of Lecture Notes in Computer Science, pp. 312–321.

[143] Dutka,L., Wrzeszcz, M., Lichon,´ T., S lota, R., Zemek, K., Trzepla, K., Opio la,L.,S lota, R., and Kitowski, J. Onedata - a Step Forward towards Globaliza- tion of Data Access for Computing Infrastructures. In Proc. of International Conference on Computational Science (2015), vol. 51 of Procedia Computer Science, pp. 2843–2847. 98

[144] Opio la,L.,Wrzeszcz, M., Dutka,L.,S lota, R., and Kitowski, J. Two-level load balancing for onedata. In Proc. of Eighth ACC Cyfronet AGH Users’ Conference KUKDM (2015), p. 107–108.

[145] Wrzeszcz, M., Otfinowski, J., Slota, R., and Kitowski, J. Computer Aided Distributed Post-Stroke Rehabilitation Environment. Computer Science 17, 1 (2016), 3– 22.

[146] Zmuda,˙ M., Kryza, B., Wrzeszcz, M., Dutka,L.,S lota, R., and Kitowski, J. Implementation of open data at global scale in low-trust environment. In Proc. of Ninth ACC Cyfronet AGH HPC Users’ Conference KUKDM (2016), p. 53–54.

[147] Opio la,L., Wrzeszcz, M., Dutka,L., S lota, R., and Kitowski, J. Concept of decentralized access control for open network of autonomous data providers. In Proc. of CGW Workshop (2017), p. 43–44.

[148] Wrzeszcz, M., Nikolow, D., Lichon,´ T., S lota, R., Dutka,L.,S lota, R., and Kitowski, J. Consistency Models for Global Scalable Data Access Services. In Proc. of In- ternational Conference on Parallel Processing and Applied Mathematics (2017), vol. 10777 of Lecture Notes in Computer Science, pp. 471–480.

[149] Wrzeszcz, M., Opiola, L., Zemek, K., Kryza, B., Dutka,L., S lota, R., and Kitowski, J. Effective and Scalable Data Access Control in Onedata Large Scale Dis- tributed . In Proc. of International Conference on Computational Science (2017), vol. 108 of Procedia Computer Science, pp. 445–454.

[150] Wrzeszcz, M., Ko´zlak, J., and Kitowski, J. Modelling Agents Cooperation Through Internal Visions of Social Network and Episodic Memory. Computing and Informatics 36, 1 (2017), 86–112.

[151] Kudzia, J., S lota, R., Lichon,´ T., Wrzeszcz, M., Dutka,L., S lota, R., and Kitowski, J. Using Onedata for global sharing and processing of legacy large data sets. In Proc. of Tenth ACC Cyfronet AGH Users’ Conference KUKDM (2017), pp. 49–50.

[152] Dutka,L.,, Kryza, B., Orzechowski, M., Opio la,L.,and Wrzeszcz, M. One- data virtual filesystem for hybrid clouds. In Proc. of Workshop on Cloud Storage Synchro- nization and Sharing Services (2018), p. 10–11.

[153] Wrzeszcz, M., S lota, R., and Kitowski, J. Towards transparent data access with context awareness. Computer Science 19, 2 (2018), 201–221.

[154] Opio la,L., Wrzeszcz, M., Dutka,L., Slota, R., and Kitowski, J. Two-Layer Load Balancing for Onedata System. Computing and Informatics 37, 1 (2018), 1–22.

[155] Wrzeszcz, M., Opio la,L., Kryza, B., Dutka,L., Slota, R., and Kitowski, J. Harmonizing Sequential and Random Access to Datasets in Organizationally Distributed Environments. In Proc. of International Conference on Computational Science (2019), accepted.