An investigation of a high availability DPM-based Grid Storage Element

Kwong Tat Cheung

August 17, 2017

MSc in High Performance Computing with Data Science The University of Edinburgh Year of Presentation: 2017 Abstract

As the data volume of scientific experiments continues to increase, there is an increasing need for Grid Storage Elements to provide a reliable and robust storage solution. This work investigates the limitation of the single point of failure in DPM’s architecture, and identifies the components which prevent the inclusion of using redundant head nodes to provide higher availability. This work also contributes a prototype of a novel high availability DPM architecture, designed using the findings of our investigation. Contents

1 Introduction 1 1.1 Big data in science ...... 1 1.2 Storage on the grid ...... 2 1.3 The problem ...... 2 1.3.1 Challenges in availability ...... 2 1.3.2 Limitations in DPM legacy components ...... 3 1.4 Aim ...... 3 1.5 Project scope ...... 3 1.6 Report structure ...... 4

2 Background 5 2.1 DPM and the Worldwide LHC Computing Grid ...... 5 2.2 DPM architecture ...... 6 2.2.1 DPM head node ...... 6 2.2.2 DPM disk node ...... 7 2.3 DPM evolution ...... 8 2.3.1 DMLite ...... 8 2.3.2 Disk Operations Manager Engine ...... 9 2.4 Trade-offs in distributed systems ...... 10 2.4.1 Implication of CAP Theorem on DPM ...... 10 2.5 Concluding Remarks ...... 11

3 Setting up a legacy-free DPM testbed 12 3.1 Infrastructure ...... 13 3.2 Initial testbed architecture ...... 13 3.3 Testbed specification ...... 13 3.4 Creating the VMs ...... 14 3.5 Setting up a certificate authority ...... 15 3.5.1 Create a CA ...... 15 3.5.2 Create the host certificates ...... 16 3.5.3 Create the user certificate ...... 17 3.6 Nameserver ...... 17 3.7 HTTP frontend ...... 19 3.8 DMLite adaptors ...... 19 3.9 Database and Memcached ...... 19

i 3.10 Creating a VO ...... 19 3.11 Establishing trust between the nodes ...... 20 3.12 Setting up the file systems and disk pool ...... 20 3.13 Verifying the testbed ...... 22 3.14 Problems encountered and lessons learned ...... 23

4 Investigation 24 4.1 Automating the failover mechanism ...... 24 4.1.1 Implementation ...... 25 4.2 Database ...... 26 4.2.1 Metadata and operation status ...... 26 4.2.2 Issues ...... 27 4.2.3 Analysis ...... 27 4.2.4 Options ...... 29 4.2.5 Recommendation ...... 30 4.3 DOME in-memory queues ...... 32 4.3.1 Issues ...... 32 4.3.2 Options ...... 34 4.3.3 Recommendation ...... 36 4.4 DOME metadata cache ...... 37 4.4.1 Issues ...... 38 4.4.2 Options ...... 38 4.4.3 Recommendation ...... 38 4.5 Recommended architecture for High Availability DPM ...... 39 4.5.1 Failover ...... 41 4.5.2 Important considerations ...... 41

5 Evaluation 43 5.1 Durability ...... 43 5.1.1 Methodology ...... 43 5.1.2 Findings ...... 44 5.2 Performance ...... 44 5.2.1 Methodology ...... 44 5.2.2 Findings ...... 45

6 Conclusions 48

7 Future work 50

A Software versions and configurations 51 A.1 Core testbed components ...... 51 A.2 Test tools ...... 51 A.3 Example domehead.conf ...... 51 A.4 Example domedisk.conf ...... 52 A.5 Example dmlite.conf ...... 53

ii A.6 Example domeadapter.conf ...... 53 A.7 Example mysql.conf ...... 53 A.8 Example Galera cluster configuration ...... 54

B Plots 55

iii List of Tables

3.1 Network identifiers of VMs in testbed ...... 14

iv List of Figures

2.1 Current DPM architecture ...... 6 2.2 DMLite architecture ...... 8 2.3 Simplified view of DOME in head node ...... 9 2.4 Simplified view of DOME in disk node ...... 10

3.1 Simplified view of architecture of initial testbed ...... 14

4.1 Failover using keepalived ...... 25 4.2 Synchronising records with Galera cluster ...... 30 4.3 Remodeled work flow of the task queues using replicated Redis caches . 37 4.4 Remodeled work flow of the metadata cache using replicated Redis caches 39 4.5 Recommended architecture for High Availability DPM ...... 40

5.1 Plots of average rate of operations compared to number of threads . . . 46

B.1 Average rate for a write operation ...... 56 B.2 Average rate for a stat operation ...... 56 B.3 Average rate for a read operation ...... 57 B.4 Average rate for a delete operation ...... 57

v Acknowledgements

First and foremost, I would like to express my gratitude to Dr Nicholas Johnson for supervising and arranging the budget for this project. Without the guidance and moti- vation he has provided, the quality of this work would certainly have suffered. I would also like to thank Dr Fabrizio Furano from the DPM development team for putting up with the stream of emails I have bombarded him with, and for answering my queries on the inner-workings of DPM. Chapter 1

Introduction

1.1 Big data in science

Big data has become a well-known phenomenon in the age of social media. The vast amount of user generated contents has undeniably influenced the research and advance- ment in modern distributed computing paradigms [1][2]. However, even before the advent of social media websites, researchers in several scientific fields already faced similar challenges in dealing with a massive amount of data generated by experiments. One such field is high energy physics, including the Large Hadron Collider (LHC) ex- periments based at the European Organization for Nuclear Research (CERN). In 2016 alone, it is estimated that 50 petabytes of data were gathered by the LHC detectors post- filtering [3]. Since the financial resources required to host an infrastructure that is able to process, store, and analyse the data is far too great for any single organisation, the experiments turned to the grid computing approach. Grid computing, which is mostly developed and used in academia, follows the same principle of its commercial counterpart - cloud computing, where computing resources are provided to end-users remotely and on-demand. Similarly, the physical location of the sites which provide the resource, as well as the infrastructure is abstracted away from the users. From the end-users’ perspective, they just have to submit their jobs to an appropriate job management system without any knowledge of where the jobs will be run or where the data are physically stored. In grid computing, these comput- ing resources are often distributed across multiple locations, where a site that provides data storage capacity is called a Storage Element, and one that provides computation capacity is called a Compute Element.

1 1.2 Storage on the grid

Grid storage elements have to support some unique requirements found in grid envi- ronment. For example, the grid relies on the concept of Virtual Organisations (VO) for resource allocation and accounting. A VO represents a group of users, not necessary from the same organisation but usually involved in the same experiment, and manages their membership. Resources on the grid (i.e. storage space provided by a site) are allo- cated to specific VOs instead of individual users. Storage elements also have to support file transfer protocols that are not commonly used outside of the grid environment, such as GridFTP [4] and xrootd [5]. Various storage management systems were developed for grid storage elements to fulfil these requirements, and one such system is the Disk Pool Manager (DPM) [6]. DPM is a storage management system developed by CERN. It is currently the most widely deployed storage system on tier 2 sites, providing the Worldwide LHC Comput- ing Grid (WLCG) around 73 petabytes of storage across 160 instances [7]. The main functionalities of DPM are to provide a straightforward, low maintenance solution to create a disk-based grid storage element, and to support remote file and meta-data op- erations using multiple protocols commonly used in grid environment.

1.3 The problem

This section presents the main challenges for DPM, the specific limitations that motivate this work, and outlines the project’s aim.

1.3.1 Challenges in availability

Due to limitations in the DPM architecture, the current deployment model supports only one meta-data server and command node. This deployment model exposes a single point of failure in a DPM-based storage element. There are several scenarios where this deployment model could affect the availability of a site:

• Hardware failure in the host • Software/OS update that results in the host being offline • Retirement or replacement of machines

If any of the scenario listed above happens to the command node, the entire storage element will become inaccessible, which ultimately means expensive downtime for the site.

2 1.3.2 Limitations in DPM legacy components

Some components in DPM were first developed over 20 years ago. The tightly-coupled natural of these components have limited the extensibility of the DPM system and makes it impractical to modify DPM into a multi-servers system. As the grid evolves, the number of users and storage demand have also increased. New software practices and designs have also emerged that could better fulfil the requirements of a high load storage element. In light of this, the DPM development team have put in considerable amount of effort into modernising the system in the past few years, which resulted in some new com- ponents that could bypass some limitations of the legacy stack. The extensibility of these new components has opened up an opportunity to modify the current deployment model, which this work aims to explore.

1.4 Aim

The aim of this work is to explore the possibility of increasing the availability of a DPM-based grid storage element by modifying its current architecture and components. Specifically, this work includes:

• An investigation into the availability limitations of the current DPM deployment model. • Our experience on setting up and configuring a legacy-free DPM instance includ- ing a step-by-step guide. • An in-depth analysis of the challenges in enabling a highly available DPM in- stance, and provides potential solutions. • A recommended architecture for high availability DPM storage element based on findings of our investigation, along with a prototype testbed for evaluation.

1.5 Project scope

A complete analysis, redesign, modification of the entire DPM system and the access protocol frontends DPM supports would not realistically fit into the time frame of this project. As such, this project aims to act as a preliminary study towards the overall effort in implementing a high availability DPM system. As part of the effort in promoting a wider adoption of the HTTP ecosystem in grid environment, this project will focus on providing a high availability solution for the HTTP frontend. However, compatibility with the other access frontends will also be taken into consideration in the design process, when possible.

3 1.6 Report structure

The remainder of the report is structured as follow. Chapter 2 presents the background of DPM, including its deployment environment and information on the components and services which forms the DPM system. The evolution of DPM and its current development direction are also discussed. Chapter 3 describes our experience in setting up and configuring a legacy-free DPM instance on our testbed, including a step-by-step guide. Chapter 4 provides an in-depth investigation on current DPM components which pro- hibit a high availability deployment model, and describes our suggested modifications. Chapter 5 evaluates the performance and failover behaviour of our prototype high availability testbed. Chapter 6 presents the conclusions of this work, summarising the findings of our in- vestigations and recommendations. Chapter 7 describes some of the future work that is anticipated after the completed of this project.

4 Chapter 2

Background

DPM is a complex system which includes a number of components and services. As such, before examining potential ways to improve the availability and scalability of a DPM storage element, the architecture and components of DPM must first be under- stood. This chapter presents an in-depth analysis of DPM, including its architecture, history and evolution, as well as the functionalities of each component that makes up a DPM system. Common scenarios which could affect the availability of a distributed system, and the trade-offs in a highly available distributed system are also discussed in this chapter.

2.1 DPM and the Worldwide LHC Computing Grid

As mentioned in Chapter 1, DPM is designed specifically to allow the setup and man- agement of storage elements on the WLCG. As such, to gain a better understanding of DPM, one must also be familiar with the environment DPM is deployed in. The WLCG is a global e-infrastructure that provides compute and data storage facilities to support the LHC experiments (Alice, Atlas, CMS, LHCb). Currently, the WLCG is formed by more than 160 sites, and is organised into three main tiers: • Tier 0 - The main data centre at the European Organisation for Nuclear Research (CERN), where raw data gathered by the detectors are processed and kept on tape. • Tier 1 - Thirteen large-scale data centres holding a subset of the raw data. Tier 1 sites also handle the distribution of data to Tier 2 sites. • Tier 2 - Around 150 universities and smaller scientific institutes providing storage for reconstructed data, and computational support for analysis. Since DPM supports only disk and not tape storage, it is mostly used in tier 2 storage elements, storing data files that are used in analysis jobs submitted by physicists. For re- dundancy and accessibility purposes, popular files often have copies and are distributed

5 Disk nodes SOAP Head node

Back ends

DMLite XRootD SRM Disks DMLite GridFTP

DMLite httpd Front ends DPM RFIO DMLite XRootD

DMLite GridFTP

DPNS DMLite httpd

DMLite XRootD Disks DMLite GridFTP

DMLite httpd

RFIO DB

Figure 2.1: Current DPM architecture across different sites, in grid terminology these copies are called replicas. These repli- cas are stored in filesystems on the DPM disk nodes, where a collective of filesystems forms a DPM disk pool.

2.2 DPM architecture

DPM is a distributed system composes of two types of node: the head node and the disk node. A high-level view of the typical DPM architecture used in most of the DPM storage elements is shown in Figure 2.1.

2.2.1 DPM head node

The head node is the entry point to a DPM storage element; it is responsible for handling file meta-data and file access requests that come into the cluster. The head node contains decision making logic regarding load balancing, authorisation and authentication, space quota management, file system status, and physical location of the replicas it manages. In the DPM system, the head node acts as the brain of the cluster and maintains a logical view of the entire file system.

6 A DPM head node contains a number of components providing different services. The components can be grouped into two categories: frontends which facilitate access by different protocols, and backends which provide the underlying functionality.

Protocol frontends

• Httpd - DPM uses the Apache HTTP server to allow meta-data and file operations through HTTP and WebDAV. • SRM - The Storage Resource Manager [8] daemon that is traditional used to provide dynamic space allocation and file management to grid sites. • GridFTP, xrootd, RFIO - These frontends provide access to the DPM system by some of the other popular protocols used in grid environment.

Backends

• DPM - The DPM daemon (not to be confused with the actual DPM system), handles file access requests, manages the asynchronous operation queues, and interact with the data access frontends. • DPNS - The DPM namesever daemon, handles file and directory related meta- data operations. For example, adding or renaming a replica. • MySQL - Two importance databases vital to DPM operations are stored in the MySQL backend. The cns_db database contains all the file meta-data, replicas and their locations in the cluster, as well as information on groups and VOs. The dpm_db database stores information of the filesystems on the disk servers, space quotas, and the status of ongoing and pending file access requests. The database can be deploy either on the same host as the head node, or remotely on another host, depending on expected load. • Memcached - Memcached [9] is a in-memory cache for key-value pairs. In DPM, it is an optional layer that can be set up in front of the MySQL backend to reduce query load to the databases.

2.2.2 DPM disk node

Disk nodes in a DPM storage element host the actual file replicas and provide remote access to meta-data and file access requests from clients. Once authenticated and au- thorized, clients are redirected to the relevant disk nodes by the head node, and never access the disk nodes directly. A disk node will typically contain all the data access frontends supported by the DPM system (e.g. httpd, GridFTP, xrootd, RFIO).

7 Front ends

HTTP/WebDAV XROOTD GridFTP RFIO

DMLite Framework

Namespace Pool Management Pool Driver I/O Management

Legacy DPM Legacy DPM Legacy DPM Legacy DPM

MySQL MySQL Hadoop Hadoop

Memcached S3 S3

Figure 2.2: DMLite architecture

2.3 DPM evolution

Since DPM was first developed in the early 2000s, it has gone through several rounds of major refactoring and enhancements. The historical components of DPM, for exam- ple, the DPM and DPNS daemons, were written a long time ago and extensibility was not one of the design goals. The daemons also introduced several bottlenecks such as excessive inter-processes communications and threading limitations. As such, a lot of effort has been directed to bypassing the so-called legacy components:

• DPM (daemon) • DPNS • SRM • RFIO • Other security and configurations helpers (e.g. CSec)

The most significant changes in the recent iterations are the development of the DMLite framework [10] and the Disk Operations Manager Engine (DOME) [11].

2.3.1 DMLite

DMLite is a plugin-based library that is now at the core of a DPM system. DMLite provides a layer of abstraction above the database, pool management, and I/O access. The architecture of DMLite is shown in Figure 2.2.

8 DOME

Requests to Timed logic Apache httpd Disk nodes (tickers) Workers Checksum mod_proxy_fcgi /domehead/... queue AuthN Filepull queue DMLite mod_lcgdm_dav /dpm/... Request logic External Task stat Executor

DB

Figure 2.3: Simplified view of DOME in head node

By providing an abstraction to the underlying layers, additional plugins can be imple- mented to support other storage types, such as S3 and HDFS. Perhaps more impor- tantly, DMLite also allows a caching layer to be loaded in front of the database backend by using the Memcahced plugin, which could significantly reduce query load to the databases.

2.3.2 Disk Operations Manager Engine

DOME is arguably the most important recent addition to the DPM system because it represents a new direction in DPM development. DOME is run on both the head and disk nodes as a FastCGI daemon, it exposes a set of RESTful APIs which provides the core coordination functions and uses HTTP and JSON to communicate with both clients and other nodes in the cluster. By implementing the functionalities of the legacy daemons and handling inter-cluster communication itself, the legacy components are now, in theory, optional in a DOME enabled DPM instance. The simplified views of a DOME enabled head node and disk node are shown in Figure 2.3 and Figure 2.4, respectively. The heavy use of in-memory queues and inter-processes communication in the legacy components would have made any attempt to modify the single head node deployment model impractical. However, the introduction of DOME has now opened up the pos- sibility of deploying multiple head nodes in a single DPM instance, which will be ex- plored in the next chapter.

9 DOME

Requests to Timed logic Apache httpd Head node (tickers) Workers mod_proxy_fcgi /domedisk/...

AuthN

DMLite mod_lcgdm_dav //... External Request logic file pull Task Internal Executor checksum

Disks

Figure 2.4: Simplified view of DOME in disk node

2.4 Trade-offs in distributed systems

Eric Brewer introduced an idea in 2000 which is now widely known as the CAP The- orem. The CAP Theorem states that in distributed systems, there is a fundamental trade-off between consistency, availability, and partition tolerance [12]. A distributed system can guarantee at most two of the three properties. An informal definition of each these guarantees are listed as follow. Consistency - A read operation should return the most up-to-date result regardless of which node receives the request. Availability - In the event of node failure, the system should still be able to function, meaning each request will receive a response within a reasonable amount of time. Partition tolerance - In the event of a network partition, the system will continue to function.

2.4.1 Implication of CAP Theorem on DPM

Since we cannot have all three guarantees as stated by the CAP Theorem, we need to carefully consider which guarantee we are willing to discard based on our requirements. Availability is our highest priority since our ultimate aim is to design a DPM architec- ture that is resilient to head node failure. This would mean deploying multiple head nodes to increase the availability of the DPM system. DPM relies on records in both the database and cache to function. In a multiple head nodes architecture, these data would likely have to be synchronised on all the head nodes. As such, to ensure operation correctness, consistency is also one of our require- ments. Although any network partition happening in a distributed system is less than ideal.

10 Realistically, as DPM is mostly deployed on machines in close proximity, for instance, in a single data centre as opposed to over WAN, network partition is less of an issue. Any network issues that happen in a typical DPM environment would likely affect all the nodes in the system. Based on the reasoning listed above, we believe our architecture should prioritise con- sistency and availability.

2.5 Concluding Remarks

In this chapter, the architecture and core components of DPM were examined. Limi- tations of the legacy components in DPM and the motivation behind recent refactoring effort were explained. With the addition of DMLite and DOME, it is now worthwhile to explore whether a multiple head nodes deployment is viable with a legacy-free DPM instance. Lastly, we have explained the reasoning behind choosing consistency and availability over partition tolerance as a priority in our new architecture.

11 Chapter 3

Setting up a legacy-free DPM testbed

As DPM composes of a number of components and services with many opportunities of misconfiguration that would result in a non-functioning system, manual configuration is discouraged. Instead, DPM storage elements are usually set up by using supplied puppet manifest templates with the puppet configuration manager. However, since this project aims to explore the possibility of the novel DOME-only multiple head nodes DPM deployment model, some of the components have to be compiled from source then installed and configured manually. The testbed will serve three purposes. Firstly, we want to find out if DPM would func- tion correctly if we exclude all the legacy components, meaning that our DPM instance will only include DMLite, DOME, MySQL (MariaDB on Centos 7), and the Apache HTTP frontend. Secondly, once we have verified our legacy-free testbed is functional and have redesigned some of the components, the testbed will serve as a foundation for us to incorporate additional head nodes and the necessary changes in DPM to facilitate this new deployment model. Lastly, the testbed will be used to evaluate the performance impact of the new design. As DOME has only recently gone into production and no other grid site have yet adopted a DOME-only approach, to the best of our knowledge no one has attempted this outside of the DPM development team. As such, we believe our experience in set- ting up the testbed will be valuable to both the grid sites that may potentially upgrade to DOME later on, and as feedback for the DPM developers. The remaining of this chapter describes the steps that were taken to successfully set up a DOME-only DPM testbed, including details on the infrastructure, specifications, and configurations. Major issues encountered during the process are also discussed.

12 3.1 Infrastructure

For ease of testing and deployment, virtual machines (VM) were used instead of phys- ical machines. This decision will certainly impact the performance of the cluster and will be taken into account during performance evaluation. All VMs used in the testbed are hosted by the University’s OpenStack instance.

3.2 Initial testbed architecture

As mentioned earlier in this chapter, our first objective is to verify the functionality of a legacy-free DPM instance. As such, our initial testbed will only have one head node. Ultimately, redundant head nodes will be included in the testbed once we have verified the functionality of the single head node instance. The testbed will also include two disk nodes for proper testing of file systems and pool management. DPM provides the options to host the database server either locally on the head node, or remotely on another machine. The remote hosting option will remain open to storage elements but in our design, we will also try to accommodate the local database use-case. We will also incorporate our own Domain Name System (DNS) server in the testbed. The rationale behind this is, firstly, we want to evaluate our testbed in isolation. By having our private DNS server, we will be able to monitor the load on the DNS service and examine if it becomes a bottleneck in our tests. Secondly, having full control of the DNS service opens up the possibility to hot-swap the head nodes by changing the IP address mappings in DNS configuration. The initial architecture of the testbed is shown in 3.1.

3.3 Testbed specification

After consulting with the DPM development team, it was decided that VMs with 4 vir- tual CPUs (VCPU) and 8GB RAM is sufficient for the purpose of this project. Among the VM flavours offer by OpenStack, the m1.large flavour provides 4 VCPUs, 8GB of RAM and 80GB of disk space, which fits our needs perfectly. The nameserver requires minimal disk space and CPU, as such, we have chosen to use the m1.small flavour which provides 1 VCPU, 2GB of RAM and 20GB of disk space. All VMs in the testbed runs the Centos 7 . A detailed list of software used in the testbed and their versions can be found in Appendix A.

13 Head node

Back ends Front ends

DMLite httpd

DOME DNS adaptor

MariaDB DOME

Disk node 1 Disk node 2

DMLite httpd DMLite httpd Disks DOME Disks DOME adaptor adaptor

DOME DOME

Figure 3.1: Simplified view of architecture of initial testbed

3.4 Creating the VMs

Four VM instances were created using OpenStack in the nova availability zone (.noval- ocal domain). We then assigned a unique floating IP address to each of these instances so that they can be accessed from outside of the private network. The hostnames and IPs of these instances will be referenced throughout this chapter and are shown in Table 3.1. Hostname FQDN Private IP Floating IP dpm-nameserver dpm-nameserver.novalocal 192.168.1.10 172.16.49.14 dpm-head-1 dpm-head-1.novalocal 192.168.1.14 172.16.49.2 dpm-disk-1 dpm-disk-1.novalocal 192.168.1.12 172.16.48.224 dpm-disk-2 dpm-disk-2.novalocal 192.168.1.6 172.16.48.216

Table 3.1: Network identifiers of VMs in testbed

The fully qualified domain name (FQDN) of these nodes are important as they need to be included in the head and disk nodes’ host certificate exactly as they appear, otherwise the host will not be trusted by other nodes. Since DPM and most of the other grid middleware packages are located in the Extra Packages for Enterprise (EPEL) repository, we need to install the repository on each of these VM.

14 sudo yum install epel-release sudo yum update

Then enable EPEL testing repository for the latest version of DOME and DMLite. sudo yum-config-manager --enable epel-testing

Install DOME and its dependencies: On head node: sudo yum install dmlite-dpmhead-dome

On disk node: sudo yum install dmlite-dpmdisk-dome

Make sure SELinux is disabled on all the nodes, at in sometimes interfere with DPM operations. This is done by setting SELINUX=disabled in /etc/sysconfig/selinux. Before we can further configure the nodes, we need to acquire a host certificate for each of the nodes to be used for SSL communication.

3.5 Setting up a certificate authority

DPM requires valid grid host certificate installed on all its nodes for authentication reasons. Since we do not know how many VMs we will end up using, and to avoid going through the application process to a real CA every time we have to spin up a new VM in the testbed, we decided to set up our own CA to do the signing. It does not matter which host does the signing as long as the CA is installed on that host and it has the private key of the CA. In our testbed we used the nameserver to sign certificate requests. To set up a grid CA, install the globus-gsi-cert-utils-progs and globus-simple-ca pack- ages from the Globus Toolkit. These packages can be found in the EPEL repository.

3.5.1 Create a CA

First we use the grid-ca-create command to create a CA with the X.509 distinguished name (DN) "/O=Grid/OU=DPM-Testbed/OU=dpmCA.novalocal/CN=DPM Testbed CA". This will be the CA we use to sign host certificates with for all the nodes in the cluster.

15 Our new CA will have to be installed in every node in the cluster before the nodes will trust any certificate signed by it. To simplify the process, our CA can be packaged into an RPM by using the grid-ca-package command, which will give us an RPM package containing our CA and its signing policy that can be distributed and installed on the nodes using yum localinstall.

3.5.2 Create the host certificates

Each of the nodes in the cluster will need its own host certificate. Since we have control of both the CA and the nodes, we can issue all the requests on the nameserver on behalf of all the nodes. grid-cert-request -host will generated a both a private key (hostkey.pem) for that host and a certificate request that we have to sign by our CA using the grid-ca-sign command. grid-ca-sign -in certreq.pem -out hostcert.pem

The hostkey.pem and hostcert.pem files will then have to be transferred to the VM that correspond to the FQDN, and stored in the /etc/grid-security/ directory with the correct permission. sudo chmod 400 /etc/grid-security/hostkey.pem sudo chmod 444 /etc/grid-security/hostcert.pem

The certificate and private key also need to be placed in a location used by DPM: sudo mkdir /etc/grid-security/dpmmgr sudo cp /etc/grid-security/hostcert.pem /etc/grid-security/dpmmgr/dpmcert.pem sudo cp /etc/grid-security/hostkey.pem /etc/grid-security/dpmmgr/dpmkey.pem

Make sure the files are owned by the dpmmgr user: sudo groupadd -g 151 dpmmgr sudo useradd -c "DPM manager" -g dpmmgr -u 151 -r -m dpmmgr sudo chown -R dpmmgr.dpmmgr /etc/grid-security/dpmmgr

16 3.5.3 Create the user certificate

We also need to generate a grid user certificate for communicating with the testbed as a client. This certificate will be used during testing, for instance, supplied to stress testing tools. For testing purposes, we will generate a user certificate without a password to make the testing process easier. This is done by using grid-cert-request with the -nodes switch. Our user certificate has the DN: "/O=Grid/OU=DPM-Testbed/OU=dpmCA.novalocal/OU=DPM Testbed CA/CN=Eric Cheung"

3.6 Nameserver

For our nameserver, we chose the popular BIND DNS server [13]. We will discuss the configurations of the DNS server in detail because it is related to how we plan on hot-swapping the head nodes. As a result, it is very important to note how the FQDN of the head node is mapped to its IP address in the configurations. sudo yum install bind bind-utils

In /etc/named.conf, add all the nodes in our cluster that will use our DNS server to the trusted ACL group: acl "trusted" { 192.168.1.10; // this nameserver 192.168.1.14; // dpm-head-1 192.168.1.12; // dpm-disk-1 192.168.1.6; // dpm-disk-2 };

Modify the options block: listen-on port 53 { 127.0.0.1; 192.168.1.10; }; #listen-on-v6 port 53 { ::1; };

Change allow-query to our trusted group of nodes: allow-query { trusted; };

Finally, add this to the end of the file:

17 include "/etc/named/named.conf.local";

Now set up the forward zone for our domain in /etc/named/named.conf.local: zone "novalocal" { type master; file "/etc/named/zones/db.novalocal"; # zone file path };

Then we can create the forward zone file where we can map the FQDNs in our zone to their IP addresses. sudo chmod 755 /etc/named sudo mkdir /etc/named/zones sudo vim /etc/named/zones/db.novalocal

$TTL 604800 @ IN SOA dpm-nameserver.novalocal. admin.novalocal. ( 1 ; Serial 604800 ; Refresh 86400 ; Retry 2419200 ; Expire 604800 ) ; Negative Cache TTL ; ; name servers - NS records IN NS dpm-nameserver.novalocal.

; name servers - A records dpm-nameserver.novalocal. IN A 192.168.1.10

; 192.168.0.0/16 - A records dpm-head-1.novalocal. IN A 192.168.1.14 dpm-disk-1.novalocal. IN A 192.168.1.12 dpm-disk-2.novalocal. IN A 192.168.1.6

The two most important things to note in this configuration is the IP address in the trusted group in named.conf, and the A records of the nodes in db.novalocal. In theory, if we spin up an additional head node with the same FQDN, then we can simply sub- stitute its IP address in place of the IP of the old head node to redirect any inter-cluster communication and client requests toward the new head node, as illustrated in Figure x. In a production site, we recommend setting up a backup DNS server as well as the reverse zone file so that the lookup of FQDNs using IPs is also possible. For the purpose of this project, since we are not studying the availability of the nameserver, nor do we plan on doing reverse lookup, the configurations listed above should suffice.

18 3.7 HTTP frontend

The httpd server and a few other modules are required to allow access to DPM through HTTP and the WebDAV extension. Some key configurations include ensuring the mod_gridsite module and the mod_lcgdm_dav module are installed, which handles au- thentication and WebDAV access, respectively.

3.8 DMLite adaptors

The DMLite framework uses plugins to communicate with the underlying backend ser- vices. A traditional DPM instance would use the adaptor plugin to route requests to the DPM and DPNS daemons. Since we do not have those legacy daemons on the testbed, we need to replace the plugin with the DOME adaptor, so the requests are routed to DOME instead. This is done by editing dmlite.conf and load the dome_adaptor library instead of the old adaptor, and removing the adaptor.conf file.

3.9 Database and Memcached

DPM works with MySQL compatible database management systems (DBMS), on our testbed we used the default relational DBMS on Centos 7 which is MariaDB. The con- figuration process is mostly identical to a legacy DPM instance which involves import- ing the schema of the cns_db and dpm_db, as well as granting access privileges to the DPM process. However, we initially had some troubles with getting our database backend to work in a legacy-free instance. We discovered that the issue was caused by DMLite loading some mysql plugins that are no longer needed in our scenario. We managed to resolve the issue by ensuring DMLite to only load the namespace plugin for mysql related operations. In /etc/dmlite.conf.d/mysql.conf, make sure only the namespace plugin is loaded. LoadPlugin plugin_mysql_ns /usr/lib64/dmlite/plugin_mysql.so

Since DOME now includes an internal metadata cache which fulfils the same purposes of the Memcached layer in a legacy setup, Memcached is not installed on the testbed.

3.10 Creating a VO

Storage elements on the grid use the grid-mapfile to maps all the users from VOs that are supported on the site. For testing purposes, we will use our own VO and directly

19 map our local users to the testbed by using a local grid-mapfile. This is done so that we can bypass the Virtual Organization Management Service (VOMS). The conventional VO name for development is dteam, and we will use that on our testbed. To create the mapfile, add this line to /etc/lcgdm-mkgridmap.conf: gmf_local /etc/lcgdm-mapfile-local

Then create and edit the /etc/lcgdm-mapfile-local file, enter the DN-VO pairs for each users we would like to support. "/O=Grid/OU=DPM-Testbed/OU=dpmCA.novalocal/OU=DPM Testbed CA/CN=Eric Cheung" dteam

Run the supplied script manually to generate the mapfile. In production site this will be set as a cron job so the mapfile stays up-to-date. /usr/libexec/edg-mkgridmap/edg-mkgridmap.pl --conf=/etc/lcgdm-mkgridmap.conf --output=/etc/lcgdm-mapfile --safe

3.11 Establishing trust between the nodes

Oh the head node, edit the /etc/domehead.conf file and add the DNs of the disk nodes to the list of authorised DNs. glb.auth.authorizeDN[]: "CN=dpm-disk-1.novalocal,OU=dpmCA.novalocal,OU=DPM-Testbed,O=Grid", "CN=dpm-disk-2.novalocal,OU=dpmCA.novalocal,OU=DPM-Testbed,O=Grid"

On the disk nodes, edit the /etc/domedisk.conf file and add the DN of the head node to the list of authorised DNs. glb.auth.authorizeDN[]: "CN=dpm-head-1.novalocal,OU=dpmCA.novalocal,OU=DPM-Testbed,O=Grid"

3.12 Setting up the file systems and disk pool

During the configuration process, we had encountered some issues with the dmlite- shell, which is used as an administration tool on the head node. In a normal deployment, DPM would be configured by puppet, which would create the skeleton directory tree in

20 the DPM namespace by inserting the necessary entries into the dns_db database. Since we are manually configuring the system, we have to carry out this step ourselves. The key record is the ’/’ entry, which acts as the root of the logical view of the file system. On the head node: mysql -u root >use cns_db >INSERT INTO Cns_file_metadata (parent_fileid, name, owner_uid, gid) VALUES (0, ’/’, 0, 0);

Then start the dmlite-shell, remove the entry we just added by using unlink -f, then create the entry again but this time using mkdir so that all the required fields are properly set. Once that is done we can also set up the basic directory tree using the shell. sudo dmlite-shell > unlink -f / > mkdir / > mkdir /dpm > mkdir /dpm/novalocal(our domain) > mkdir /dpm/novalocal/home

Add a directory for our VO and set the appropriate ACL: > mkdir /dpm/novaloacl/home/dteam(our VO) > cd /dpm/novalocal/home/dteam > chmod dteam 775 > groupadd dteam > chgrp dteam dteam > acl dteam d:u::rwx,d:g::rwx,d:o::r-x,u::rwx,g::rwx,o::r-x set

Add a volatile disk pool to our testbed: > pooladd pool_01 filesystem V

One we have a disk pool we can add a file system to our newly created disk pool. This has to be done on all disk nodes that wish to participate in the pool. In the normal shell, create a directory which DPM can use as a file system mount point and make sure it is owned by DPM so it can write to it. sudo mkdir /home/dpmmgr/data sudo chown -R dpmmgr.dpmmgr dpmmgr/data

Then we can add the file systems on both disk nodes to our pool. On the head node,

21 inside dmlite-shell: > fsadd /home/dpmmgr/data pool_01 dpm-disk-1.novalocal > fsadd /home/dpmmgr/data pool_01 dpm-disk-2.novalocal

Verify our disk pool (it may take a few seconds before DOME registers the new file systems): > poolinfo pool_01 (filesystem) freespace: 155.14GB poolstatus: 0 filesystems: status: 0 freespace: 77.58GB fs: /home/dpmmgr/data physicalsize: 79.99GB server: dpm-disk-1.novalocal

status: 0 freespace: 77.56GB fs: /home/dpmmgr/data physicalsize: 79.99GB server: dpm-disk-2.novalocal

s_type: 8 physicalsize: 159.97GB defsize: 1.00MB

One last thing we need to do before we can test the instance is to create a space token for our VO, so that we can write to the disk pool. > quotatokenset /dpm/novalocal/home/dteam pool pool_01 size 10GB desc test_quota groups dteam Quotatoken written. poolname: ’pool_01’ t_space: 10737418240 u_token: ’test_quota’

3.13 Verifying the testbed

At this stage, we should have a functional legacy-free testbed that is able to begin serv- ing client requests. To verify the testbed’s functionality, we used the Davix HTTP client to issue a series of requests toward the head node. The operations we performed include uploading and downloading replicas, listing contents of directories and deleting repli- cas.

22 The outcome of the requests was verified against the log files, database entries, and on the disk nodes’ file systems. For example, listing contents of the home directory of our dteam VO: [centos@dpm-nameserver ~]$ davix-ls --cert ~/dpmuser_cert/usercert.pem --key ~/dpmuser_cert/userkey_no_pw.pem --capath /etc/grid-security/certificates/ https://dpm-head-1.novalocal/dpm/novalocal/home/dteam -l drwxrwxr-x 0 12738 2017-08-06 20:18:12 hammer -rw-rw-r-- 0 10485760 2017-07-26 15:03:49 testfile_001.root -rw-rw-r-- 0 10485760 2017-07-31 20:21:31 testfile_002.root -rw-rw-r-- 0 10485760 2017-07-31 20:23:04 testfile_003.root -rw-rw-r-- 0 10485760 2017-07-31 20:23:17 testfile_004.root -rw-rw-r-- 0 10485760 2017-08-02 13:41:59 testfile_005.root -rw-rw-r-- 0 10485760 2017-08-02 13:59:33 testfile_006.root -rw-rw-r-- 0 10485760 2017-08-02 14:06:41 testfile_007.root -rw-rw-r-- 0 10485760 2017-08-02 14:33:20 testfile_008.root

Reading contents of the helloworld.txt file and print to stdout: [centos@dpm-nameserver ~]$ davix-get --cert ~/dpmuser_cert/usercert.pem --key ~/dpmuser_cert/userkey_no_pw.pem --capath /etc/grid-security/certificates/ https://dpm-head-1.novalocal/dpm/novalocal/home/dteam/helloworld.txt Hello world!

3.14 Problems encountered and lessons learned

Many of the issues we have encountered setting up the testbed were due to our initial lack of knowledge of the grid environment. For instance, we were unaware of the X.509 extensions that are used in signing grid certificates and did not understand why our certificates signed using plain OpenSSL were being rejected. We were also unfamiliar with how members of a VO are authenticated by the DPM system, which resulted in a lot of time spent in log monitoring and debugging before the testbed can even be tested. Perhaps most importantly, there are many services and plugins that need to be con- figured correctly in a DPM instance. One single incorrect setting in one of the many configuration files would result in a non-functional system. During the setup process, there were many occasions where we had to maximise the log level in DMlite, DOME, httpd and MariaDB then analyse the log files in order to diagnose the source of miscon- figuration.

23 Chapter 4

Investigation

As mentioned in Chapter 2, DMlite and DOME were designed to replace the legacy components of DPM and aim to bypass some of the limitations imposed by the old stack. However, neither DOME or DMlite was designed to run on more than a single head node. In order to successfully design a functional high availability DPM architec- ture, we must first identify the all the limiting factors in DMlite and DOME that would prevent us from deploying redundant head nodes, and redesign them when possible. A function high availability DPM architecture must contain the follow attributes:

• Resilient to head node failure. Meaning that the system must continue to function and serve client requests should the primary head node fail. • Automatic recovery. In the event of head node failure, a DPM instance using the new architecture must failover automatically in a transparent manner to the clients. • Strong data consistency. The redundant head nodes must have access to the most up-to-date information about the file system and status of the cluster.

Ultimately, providing availability to DPM would mean increasing the number of head nodes, and therefore turning DPM into an even more distributed system. Providing any distributed system availability and consistency guarantee will likely have performance implications, which we must also keep in mind in our design. This rest of this chapter describes the findings of our investigation and the recommended redesign of the relevant components to allow for a high availability DPM architecture.

4.1 Automating the failover mechanism

Ideally, when a head node fails, the system should automatically reroute client requests to one of the redundant head nodes in a way that is transparent to the clients. One of the

24 Head node 1

keepalived DPM Disk node 1

Normal path

Public floating IP Head node 2

Clients

Failover path keepalived DPM Disk node 2

......

Figure 4.1: Failover using keepalived options to achieve this failover mechanism is to use a floating IP address that is shared between all the head nodes, combined with a tool that facilitates nodes monitoring and automatic assignment of this floating IP. Keepalived [14] is a routing software that is designed for this use-case, it offers both load balancing and high availability functionalities to Linux-based infrastructure. In keepalived, high availability is provided by the Virtual Router Redundancy Protocol(VRRP), which is used to dynamically elect a master and backup virtual routers. Figure 4.1 il- lustrates how keepalived can be used to provide automatic failover to a DPM system. In this topology, all client requests are directed at the floating IP address. If the primary head node fails, the keepalived instances on the redundant head nodes will elect a new master based on the priority value of the servers in the configuration file. The new master would then reconfigure the network setting of its node and bind the floating IP to its interface. From the clients’ perspective, their requests continue to be served using the same IP address even though they are now fulfilled by a different head node. If the primary head node rejoins the network, keepalived would simply promote the primary node as master again if its server has the highest priority value. With this topology, we can use a single DNS entry in the nameserver for all the head nodes in the cluster, since they would all have the same FQDN and use the same floating IP address, thus further simplifying the configuration process of the system.

4.1.1 Implementation

Based on our research keepalived would be the ideal solution for head nodes failover. Unfortunately, after spending a considerable amount of effort in configuring keepalived, we discovered that in order to setup keepalived successfully on our testbed, we would require administrative privileges on the OpenStack instance level (to configure Open- Stack Neutron), which we do not have.

25 However, on a production site this should not be an issue, especially when the site has full control of its network and deployed DPM on physical machines instead of VMs.

4.2 Database

Some grid sites prefer to host the database backend locally on the head node for per- formance reasons, and we would like to preserve this use-case. The first step toward achieving this goal is to fully understand what is stored in the databases, and what their roles are in the DPM system.

4.2.1 Metadata and operation status

Information stored in the DPM database backend can be categorised into two groups, metadata and operation status.

Metadata

Metadata kept by DPM include information that is critical for DPM to function cor- rectly, for example, for validating a user’s DN or to translate the logical file name of a replica to its physical file name. The different groups of data kept are summarised as follow. • File system infomation - what file system are available on the disk nodes and which disk pool their belong to. • Pool infomation - Size, status, and policies of disk pools. • Space reserve - Space tokens of supported VOs, describes the available and used storage space of a disk pool a VO has access to. • File metadata - Information on each unique file and directory managed by the system, including POSIX file data such as size, type, and ACL. • Replica metadata - Information on replicas of file, including which disk pool the replica belongs to, which file system the replica is located, and which disk node is hosting that file system. • User and group information - Including DN of users, privilege levels, and ID mappings that are used internally.

Operation status

If a request (read, write, copy) cannot be immediately fulfilled by DPM, for instance, if the requests replica has to be pulled from an external site or because of scheduling

26 by some job management system, the request is recorded in the dpm_db database as pending. Information recorded including protocol used in the request, DN and host name of the client, number of retries, error messages, requested resource, and status of the request.

4.2.2 Issues

DPM cannot function without having access to the information stored in the databases. As our aim is to increase the availability of a DPM storage element, that means we have to provide a certain degree of data redundancy to the database backend. For sites that wish to use a dedicated server to host their database service, they will be responsible for providing redundancy to the service and can choose from a number of options that are likely already built-in to the database. Since we also want to support the local database backend use-case, we have to implement a way to share database records across multiple head nodes. Grid sites are recommended to install DPM on physical hardware instead of VM for performance considerations. As such, simply starting another VM with the latest snap- shot is not a viable solution, not to mention the new head node would not have to most up-to-date data when it is swapped in, which would leave the system in an inconsistent state. NoSQL solutions are also deemed not acceptable because we require the ACID properties provided by a transactional database.

4.2.3 Analysis

There are already a number of technologies which aims to increase the availability of relational database services. The differences in these technologies are which parallel architecture, types of replication, and node management mode they support. A brief overview of these differences is presented as follow.

Parallel database architecture

Parallel database management system is typically based on two architectures: shared nothing or shared something. For shared something architecture they may be shared memory, shared disk, or a combination of both. Below is a brief overview of each of these architectures.

• Shared nothing: As the name implies, in a shared nothing architecture each node maintains a private view of its memory and disk. Because nothing is shared be- tween the nodes, distributed database system using this architecture is often easier to achieve higher availability and extensibility, at the cost of increased data parti- tion complexity for load-balancing reasons [15].

27 • Shared memory and shared disk: The shared memory and shared disk architec- tures depend on fast interconnect to connect processors to available memory pools or disks in the system. Since DPM is designed to be deployed on commodity hardware, these architectures are not applicable for our use-case.

Replication

Replication is the process of synchronising records across multiple database servers to minimise data loss due to node failure. Replication mostly concerns with the create, update, and delete operations. There are two types of replication, each with their ad- vantages and disadvantages. Which type of replication to use depends on the use-cases of the site, as explained as follow.

• Synchronous replication: In synchronous replication, the node which received the request will write the record to local storage and remote database servers simultaneously. As such, the records held in every node in the system always remain synchronised. With synchronous replication, data loss in the event of node failure is minimised as other nodes would have an identical copy of the database. However, this guarantee comes with a cost in performance. To ensure strict consistency, syn- chronous replication has to employ the 2 phase commit (or 3 phase commit) tech- nique, which means transactions are blocking and cannot be committed until ev- ery node in the cluster have confirmed they can do so. • Asynchronous replication: Contrast to synchronous replication, asynchronous replication will commit transactions locally without waiting for confirmation from the other nodes in the cluster. Instead, the transactions are sent to the nodes in near real-time, or in batches at set intervals. From a user’s perspective, there should be little noticeable difference in performance than writing to a standalone database. However, there is a higher chance of data loss if the primary node (the one accepting write request) failed before the transactions have been sent out and applied to the other nodes in the system.

Communication model

Communication model in a multi-server database system means what role each node will take in the cluster. There are typically two models, master-slave or multi-master, also known as active-active. In a master-slave scenario, only one node in the cluster is elected as the primary node (master), with the rest of the nodes being slave nodes. The role of the master node is the accept read and write requests to the cluster. In the event of writes, changes are propagated to the rest of the nodes through replication. Typically in a master-slave cluster the slave nodes will be read-only and act as back up should the master node fail.

28 A master-slave cluster will require some logic component for electing a new master in the event the master node cannot be contacted, either due to node failure or network partition. In a multi-master cluster, each of the nodes is writable. This is advantageous for use- cases where parallel processing is desirable or when data is partitioned. Because there is no slave node in a multi-master cluster, master election logic is not needed.

4.2.4 Options

There are several replication implementations for the MySQL database system that al- ready exists. We will examine the relevant features and behaviour of two of the popular options.

Group Replication

Group replication (GR) [16] is a replication implementation by Oracle, the upstream provider of MySQL. GR uses the master-slave communication model by default but can also be configured to use multi-master. In the master-slave mode, GR will auto- elect a master node, and the user does not influence the process. GR offers virtually synchronous replication using the majority rule, where transactions are committed, and control returns to the client once a majority of nodes in the cluster have confirmed the transaction. This replication method has the advantage of higher performance compared to strict synchronous replication. However, it also carries a risk of losing data should the majority of nodes fail before the minority has received the transactions. In the event of a network partition, separated GR nodes are expelled from the cluster should they lose the quorum, and will not rejoin the cluster until GR is restarted.

Galera

Galera [17] is synchronous multi-master replication library originally developed by Codeship. The communication model in Galera is multi-master by default, and write transactions can be applied to any node in the cluster. Galera’s replication mode is virtually synchronous, where transactions are written syn- chronously to every node in the cluster. If a node fails in a Galera cluster, other healthy nodes in the cluster will wait(block) until either the faulty node rejoins the network or is removed once the timeout expired. In the event of a network partition, the nodes with the quorum will form a new clus- ter, and the minority nodes will attempt to reconnect to the new cluster and rejoin it automatically when able.

29 Head node 1 Head node 2 Head node 3

Writes, deletes.. DOME DOME DOME

Updates

Replicates

Clients ACK

Galera node 1 Galera node 2 Galera node 3

Replicates

ACK

Figure 4.2: Synchronising records with Galera cluster

Discussion For our use-case, load balancing is not the focus, and we only need one head node to be accessible at a given time. A multiple head nodes architecture would require either a multi-master model so that every database node is writable, or a master- slave model where we have influence the election process. This is important because we want to avoid the scenario where the swapped-in redundant head node rejects client requests, which may happen if it has not been elected the new master and cannot write to the database. Another key requirement is operation correctness. For instance, should the primary head node fail before transactions are committed on the redundant head nodes, some of the nodes may have an inconsistent view of the file systems. As such, it is more accept- able to refuse a few client requests than having inconsistent records in the databases on the head nodes. Since Galera’s replication implementation provides strong consistency and automatic cluster recovery, we believe Galera is a better fit to our requirements.

4.2.5 Recommendation

MariaDB offers its own Galera implementation called MariaDB Galera Cluster, and it is integrated into the main package since MariaDB 10.1. As we have already verified the functionality of the stand alone MariaDB our Centos 7 testbed, it makes sense to try upgrading MariaDB to the version which includes Galera Cluster. There are several performance tuning options in Galera, but we will not cover them as it is outside the scope of this project. The topology of our Galera Cluster on the head nodes is shown in Figure 4.2.

30 Setting up the Galera cluster

Once MariaDB Galera Cluster is installed on all the head nodes, we need to config- ure the nodes to form a cluster. The first step is to add the private IP addresses and hostnames of all our Galera nodes to the /etc/hosts file in each node. This is necessary because all our head nodes share a common FQDN by design, but we also want to as- sign a different hostname to each Galera nodes for debugging and cluster monitoring purposes. 192.168.1.14 mariadb01 192.168.1.9 mariadb02 192.168.1.8 mariadb03

On each of the Galera node, edit the /etc/my.cnf.d/server.cnf file. The key configurations here is the IP addresses in the cluster, commutation protocol, the IP of the node and its name. For example, a snippet of the configuration file on our primary head node: [galera] wsrep_cluster_address= "gcomm://192.168.1.14,192.168.1.9,192.168.1.8" wsrep_cluster_name="dpm_db_cluster" wsrep_node_address="192.168.1.14" wsrep_node_name="mariadb01"

Once every node has been configured, initialise the cluster by running sudo galera_- new_cluster on the primary head node and start MariaDB normally on the rest of the nodes. The status of the database cluster can be checked by using the following command: mysql -uroot -p -e "show status like ’wsrep%’"

The fields of note are wsrep_cluster_size, wsrep_incoming_addresses and wsrep_clus- ter_status, which tells us the size of the cluster, the IP addresses and port the node is listening to, and whether the database node is writable, respectively. | wsrep_cluster_size | 3 | | wsrep_incoming_addresses | 192.168.1.14:3306,192.168.1.9:3306,192.168.1.8:3306 | | wsrep_cluster_status | Primary |

Verifying cluster functionality

Two tests were performed to verify the cluster’s functionality. Firstly, to verify record replication, we performed a series of file upload and delete requests towards our primary

31 head node. When the records were created or deleted on our primary head node, we observed the same changes in the databases on our redundant head nodes. Secondly, to verify cluster recovery, we removed our primary head node from the cluster by shutting down the VM. We observed that the two remaining nodes did indeed form a new cluster view and continued to accept requests. Once we have turned the VM back on, our primary head node rejoined the cluster and changes in the records on the redundant nodes were propagated back into the primary node’s database.

Discussion We encountered some issues when upgrading MariaDB from an existing installation. When upgraded, MariaDB would not start because it did not recognise the format of the shared tablespace file. To resolve this issue, we uninstalled MariaDB and cleaned up the /var/lib/mysql/ directory before doing a clean install, then we were able to import the databases using the database dump files we created before. Occasionally, a MariaDB server that belongs to a cluster will refuse to restart because the rsync process is still attached to it. In those instances, we had to manually kill the orphaned rsync process before the server can be restarted.

4.3 DOME in-memory queues

DOME maintains in-memory priority queues of certain operations for load balancing reasons. Currently, the operations which can be queued are checksum requests and file callouts. A checksum request involves DOME dispatching a request to a suitable disk node to calculate the checksum of a specific replica. For file callouts, DOME will dispatch the request to a disk node which will then run a hooked custom script. The custom script can include any file movement mechanism, for example, pulling a file from an external location.

4.3.1 Issues

The main issue with the current design is that the queues are private to a single DOME process. If we were to simply deploy multiple head nodes, once the primary node fails the en-queued operations are lost, and the redundant head nodes would have no way to recover them. This issue, if unsolved, would directly impact the operational correctness of a multiple head nodes DPM instance. As such, it is critical that we identify all the internal functionality which relies on the data held in the queues, and to design a new mechanism that can not only fulfils the same functionality, but also supports data replication across all the head nodes. After analysing DOME’s source code, we have discovered the internal structures of the in-memory queues, as well as the interaction between the queues and many of the DOME classes. Our findings are summarised as follow.

32 Self-healing

DOME can self-heal on the operations to a certain degree. Once a task is dispatched to a disk node, the disk node will constantly update the head node about the status of the tasks. However, queued tasks would still be lost if the head node fails. As such, to ensure operation correctness in a multiple head nodes setting, the information held in these data structure must be replicated to the redundant head nodes in real time.

Task objects

DOME uses the GenPrioQueueItem class to encapsulate information that represents a checksum or file callout task. The class contains information such as the status of the operation, time of insertion and last access. More critically, GenPrioQueueItem instances also contain parameters necessary to carry out an operation such as the target disk node and physical path of a file in the file system. In order to support checksum and file callout requests in a multiple head nodes DPM architecture, data represents by GenPrioQueueItem objects must be replicated to all the head nodes in the cluster.

Priority task queue

The GenPrioQueue class acts as a priority task queue in DOME using the producer- consumer design pattern. Internally the class contains four key data structures:

• items: A map of namekey and GenPrioQueueItem pointers pairs that act as the main container of contents in the queue. • active: A data structure that contains the number of current running operations and the parameters of those operations. DOME seems to use it to enforce limits set in the configuration file, for example, the total maximum number of checksum requests and maximum number of checksum requests per node. • waiting: A sorted list of GenPrioQueueItem based on priority. DOME polls the waiting list at set intervals to acquire new tasks. • timesort: A sorted list of GenPrioQueueItem based on how long since they have been inserted into the queue. DOME polls the timesort list at set intervals to remove items which have exceeded the timeout limit.

The task queue seems to be a way for DOME to quickly select a task object from the queue based on its priority and time spent waiting, as well as to act as a least recently used (LRU) cache. In a sense, the queue structure is not as critical as the task objects since it only affects the order of execution of the tasks, and its functions could poten- tially be replaced by an LRU cache that supports sorted sets, which we will explore in the next section.

33 4.3.2 Options

A simple and straightforward solution would be to move the in-memory queues to the database backend to give them persistency. As we have already implemented the Galera cluster layer, records of the operations and their status would be replicated across the head nodes. However, this approach would likely incur a heavy penalty on performance, as every checksum and file callout request would now result in a database transaction that has to be synchronously replicated across the cluster, not to mention that disk ac- cess is orders of magnitude slower than memory access. Checking the status of these requests would also require a database lookup. Another option is to introduce an additional cache layer outside of the DOME process that supports replication. There are already many open source distributed cache tech- nologies available, but before we can decide on a suitable caching solution, we must first understand our requirements for such a cache. The primary objective of this project is high availability and not horizontal scalability. This means that we do not need to accommodate the use-case where clients’ requests are load-balanced across multiple head nodes, instead, only one head node would be accessible at any given time. Essentially, we want each of our head node to have its own private in-memory cache, and the contents of these caches need to be synchronised. We also want to be certain that in the event of node failure, once client requests are routed to the redundant head node the node needs to be immediately writable. A caching solution that can fulfil our requirements would need to supports replication, and uses one of the following configurations:

• Master-slave configuration that provides failover mechanism in which the master election process can be influenced by us, so we can be certain which redundant head node will be writable should the primary fails. • Multi-master configuration that is writable on every node, so that any redundant head node promoted by keepalived will be able to handle all client requests im- mediately.

Available technologies

Many in-memory caching technologies already exist and are widely used in the industry. The following technologies were examined and analysed for suitability based on our requirements:

• Apache Ignite • Redis/Redis cluster • Couchbase (community edition) • memcached

34 Some of our main criteria for evaluating these technologies are maintainability, features of their client APIs (e.g. multi-threading support, C/C++ client for compatibility with DOME), and the estimated amount of effort required to integrate the technology into the DPM stack. Our analysis suggested that Memcached and Redis would be our best options as a replicated cache solution.

Memcached

Memcached is designed to be a fast and minimalistic distributed cache that supports only the string data type. Memcached by itself does not provide any replication or fail- over mechanism. In fact, individual Memcached server in a cluster is not aware of the existence of each other. As such, any replication logic would have to be handled in the client application. One option to enable replication in Memcached is to put all known head nodes into a configuration file which DOME can parse on startup. During operation, when DOME received requests that would require queuing, instead of inserting the item into DOME’s in-process queue, insert it into all the available Memcached instances in a broadcast- like manner (for write, update, delete). This does however go against the design of Memcache, and it will be difficult to dynamically adjust the number of redundant head nodes without editing the configuration file on every head node, and restarting DOME every time we remove or add a head node to the cluster, so the record remains up-to- date.

Redis

Redis [18] is a in-memory data structure store that can be used as a cache or a database. Redis provides both a stand-alone version and a cluster version called Redis Cluster. We will refer to the standalone version as Redis for the remaining of the report. Although we initially thought Redis Cluster was a better fit to our requirements, further studies have discovered that it is mostly designed for use-cases where sharding support is re- quired. For instance, where the memory required to store the necessary data exceeds the capacity of a single machine and has to be distributed, which is not our current use-case. Redis supports asynchronous replication on the master-slave configurations, where writes are only accepted on the master and are propagated to the slave nodes. Failover mecha- nism in Redis is provided by the monitoring daemon Sentinel [19] which is responsible for monitoring the health of the cluster. If the master node fails, and the redundant head nodes maintain the quorum, Sentinel would promote one of the slaves and configure the other slaves to accept replication from the new master. Should the master node becomes unreachable due to a network partition, the result would depend on which side of the partition has the quorum. If it is on the side with the master node, then nothing has changed. If the master node is on the minority side, then the side with the quorum will elect a new master.

35 More crucially, Redis supports multiple data types including sorted set, which we could potentially use to replace the priority queues currently used by DOME.

Discussion We believe Redis is most suitable for our cache layer because it provides similar performance as Memcached [20], while providing support for more data struc- ture types as well as a failover mechanism. For our use-case, we could have a Redis instance running on each of the head nodes in cache mode, with the primary head node configured as master and the redundant nodes as slaves replicating from the master instance. A Sentinel instance would also need to be running on each of the head nodes to enable failover.

4.3.3 Recommendation

As a proof of concept, we used the hiredis Redis client to add minimal functionality to DOME to communicate with the Redis cache. The work flow involving DOME and the new components is illustrated in Figure 4.3. A brief overview of the designs of the new components is presented in the following sections.

Redis Context Pool

Establishing a new connection to Redis for every operation would be costly and inef- fective. As such, we have implemented a structure that would allow DOME to reuse Redis connections. The RedisContextPool class is a pool of Redis connections that is accessible by DOME’s worker threads. Redis contexts are not thread safe, so the pool needs to be protected by a mutex. The Redis contexts are set to connect to the local instance of Redis(locolhost). This is intentional because each of our head nodes running DOME will have its own Redis instance on the same host, and if a head node gets promoted to primary, we want DOME to talk to its local Redis instance.

Processing and Pending queues

As a replacement for the in-memory priority queue, we can use the sorted set in Redis to simulate the same behaviour. The sorted set in Redis is a data structure of unique elements, where elements can be associated with a score. Two sorted sets will be kept in the Redis cache, one for running tasks and one for waiting tasks. When DOME enqueues a task, instead of inserting to its task queue, it would now serialise the task object in JSON and insert the entry to the pending sorted set based on priority using the ZADD command. At set polling intervals, DOME uses the LPOPR- PUSH command to transfer the top N entries from the pending set to the processing set

36 Head node 1 Head node 2

DOME DOME

Timed logic Run next RedisContextPool Timed logic RedisContextPool

Task executor Task executor

Get top N Requests to disk node Redis cache Redis cache

Processing sorted Processing sorted Pending sorted set Pending sorted set set set

Move to

Replicates state

Figure 4.3: Remodeled work flow of the task queues using replicated Redis caches in a single transaction, then dispatch the entries in the processing set to the disk nodes on a different thread. If an entry stays in the processing set for too long, DOME can transfer the entry back to the pending set with a higher priority score. As Redis supports auto-eviction of entry based on its expire value, the LRU cache behaviour in DOME’s queues is now enforced by Redis. The rationale of this architecture is to allow the head nodes to share the same data space for common data in a producer-consumer pattern. Although at the moment we are mostly concerned with replicating the task objects for availability, the architecture should also allow DPM to scale horizontally by replacing Redis with Redis Cluster if future development direction requires it. Although originally we had some reservations about using a set as it can only contain unique values, we now believe this should not be an issue as the parameters of a task should be unique for that task. For instance, it would make no sense for the same client to repeatedly request checksum of the same file of the same checksum type. Besides, the result would be cached after the first request returns and prevents the need to create another task. To ensure that keepalived and Sentinel both elect the same master in the event of node failure, the priority values for each node in Sentinel master election process must have the same order as the keepalived configuration. This is critical because if the order is different, keepalived may promote a redundant head node that does not have write privileges to the replicated Redis cache.

4.4 DOME metadata cache

DOME keeps a private in-process cache for information on frequently accessed files. Upon analysis of the source code, we discovered the cache is used internally by DOME in the add, remove, move, stat, checksum operations on files and replicas. The cache

37 seems to act as a layer between DOME and the underlying database backend. As such, the introduction of DOME in the DPM stack has made the traditional Memcached layer redundant. The metadata in DOME are held within the DomeMetadataCache class as a container of DomeFileInfo and DomeFileInfoParent objects.

4.4.1 Issues

Because the metadata are now held in private memory of the DOME process itself, it would be difficult to replicate the cache items across multiple head nodes without fundamentally changing how the existing cache functions.

4.4.2 Options

One option is to leave the cache as is. If the primary head node fails, the replacement will start with a cold cache. This does not affect operational correctness, and the impact on performance is likely to be small. Enforcing strong consistency on the caches across nodes may even result in worse performance. Another option is to introduce an in-memory caching layer outside of DOME that is capable of managing multiple nodes and supports replication, or partition if we accept losing a portion of the data on node failure. There are several options for either case. For partition, we could use a Memcached cluster with a server on each head node managing a subset of the metadata. In this sce- nario, DOME would read and write to the Memcached cluster instead of its own cache. In this scenario DOME would not need to know how many Memcached servers are in the cluster on startup, since it can decide which server to write to using a consistent hashing algorithm. However, this approach offers no data redundancy and the metadata managed by a head node would be lost to the cluster if that node fails. For a replicated cache solution, since the cache is just a simplified version of the queue, we can make use of the Redis cache layer introduced to solve the in-memory queue issue. This has the advantage of minimising the number of extra daemons we add to the DPM system as well as to limit maintenance effort.

4.4.3 Recommendation

The issues with replicating items in the in-memory cache are closely related to repli- cating items in the in-memory queues in DOME. As such, we could reuse most of the design described in the last section. The only differences being instead of a sorted set, a normal set would be used for the metadata cache. As with the task queues, key eviction

38 Head node 1 Head node 2

DOME DOME

RedisContextPool RedisContextPool Requests

Query

Redis cache Redis cache Clients

Metadata set Metadata set

Replicates state

Figure 4.4: Remodeled work flow of the metadata cache using replicated Redis caches is enforced internally by Redis. The work flow involving DOME and the new metadata cache is illustrated in Figure 4.4. It should be mentioned that the inclusion of the new metadata cache is optional. More development effort and testing would be required to determine if the cache is indeed beneficial in a high availability DPM architecture.

4.5 Recommended architecture for High Availability DPM

Based on our investigation described in the previous sections, we have designed an architecture which we believe should increase DPM’s resilience to node failure consid- erably. This novel high availability DPM architecture, which we call the HA-DPM, is shown in Figura 4.5. In this novel architecture, high availability is achieved by deploying additional head nodes to provide redundancy. Clients access the DPM service by connecting to the floating IP address which will initially be assigned to the primary head node. The Floating IP is managed by the keepalived instance on every head node and will be reassigned if the primary head node fails. Persistent data such as namespace and users information, traditionally held in a single database backend, are now replicated across all the head nodes by Maria Galera cluster for redundancy. Volatile data such as file metadata and task queue items, originally held in the DOME process’s private memory space, are now handled by a local Redis instance that replicates the data to every Redis instance in the cluster.

39 Head node 1 Disk node 1

Front ends Failover Front ends httpd httpd keepalived

DMLite Sentinel DMLite

DOME DOME

Cache Persistent data Persistent data

Disks Redis

Maria Galera cluster

Public floating IP

Head node 2 Disk node 2 Clients

Front ends Failover Front ends Replicated Replicated httpd keepalived httpd

DMLite Sentinel DMLite

DOME DOME

Cache Persistent data Persistent data

Redis Disks

Maria Galera cluster

......

Figure 4.5: Recommended architecture for High Availability DPM

40 4.5.1 Failover

The core rationale behind the design of this new architecture is to eliminate the single point of failure in DPM’s topology, namely the single head node deployment model. There are several scenarios where node failure could happen, and we will describe the expected behaviour of the nodes under this new architecture. In the event where the primary head node fails or has to be shut down for mainte- nance, one of the redundant head nodes should simply be promoted to primary and take over client requests. This failover mechanism is provided by both the keepalived and Sentinel components, where the redundant head node with the highest priority setting should take over as primary. The promoted head node should be assigned the floating IP by keepalived in a process that is transparent to clients. If one of the redundant head nodes fails, the primary head node should continue to function as normal, as the redundant nodes are not involved in DPM operations and are only there as a backup. In the event of network partitions, the behaviour of the cluster would depend on the location of the partition. If the primary head node is on the side of the partition where the head nodes still have the quorum, then the primary head node should function as normal. Once the network partition has been resolved, the minority nodes should rejoin the cluster and update their database records and Redis caches from the primary head node. If the primary head node is on the minority side of the partition, then the redundant head nodes that have the quorum should elect a new master. The new master should be assigned the floating IP by keepalived and start serving client requests. The new master should now also have write privilege to its Redis cache. Once the network partition has been resolved, the old master head node should rejoin the cluster. Depending on priority settings, the old master node would either rejoin as a slave or be promoted back to master and updates its database and cache.

4.5.2 Important considerations

The following sections describe some key aspects of the configurations that must be taken into account for the new architecture to function correctly.

FQDN of head nodes

Host name of the head nodes has to be identical. This is because the disk nodes are not aware of the existence of the redundant head nodes, meaning all the head nodes have to use the same host certificate with same FQDN. Otherwise, when a redundant head node is promoted, there would be a identify mismatch and the disk nodes will refuse connection from the head node.

41 Priority

The order of priority settings for each head node must be the same for both keepalived and Sentinel. For example, if we have three head nodes in the cluster, A, B, and C, with A being the primary head node, then the priority in both configurations should be:

• A - Highest • B - Medium • C - Lowest

This configuration should ensure that if A fails, B would become the new master node with both the floating IP and write access to the Redis cache.

Number of head nodes

In order to ensure the correct behaviour of the failover mechanism, the total number of head nodes in the cluster should be an odd number greater than three. This is because both keepalived and Sentinel require a majority in the master election process. An even number of head nodes could potentially result in a tie which would prevent the failover mechanism from reaching a consensus.

42 Chapter 5

Evaluation

This chapter presents an evaluation of our high availability DPM architecture. The key aspects of the architecture which we examined include its performance and resilience to node failure.

5.1 Durability

A critical aspect of the testbed’s functionality we want to examine is how resilient it is to node failure. This section describes some of the tests we performed to verify the behaviour of the failover mechanism.

5.1.1 Methodology

Since we cannot use the recommended keepalived on our testbed, we have set up an HAProxy [21] load balancer in front of the head nodes to simulate the election of the primary head node. The proxy was configured to route all requests to the primary head node, and only offload the requests to the redundant nodes if the primary node cannot be contacted. Simple tests are designed to examine the head nodes behaviour in the following scenar- ios:

• Failure of primary head node • Failure of redundant head node

We start the tests by sending read and write requests using Davix to the load balancer, which was configured to use our primary head node only. To simulate node fail, we simply turn off the VM in question. We can then find out if the cluster is still functional

43 by sending requests to the load balancer, and observes the changes in the databases and disk nodes. Unfortunately, as our testbed is deployed on VMs in an environment not controlled by us, we are unable to simulate network partitions. Therefore, the behaviour of the cluster in the event of a network partition is not tested.

5.1.2 Findings

Once we have switched off the VM hosting the primary head node, we have noticed there was a small window where requests would fail. Upon further investigation, we discovered the issue was due to the timeout threshold being set too high in HAProxy, and was delaying the switch over to a redundant head node. We believe this timeout threshold would have to be tuned on keepalived and Sentinel as well for an optimal result. Concerning operation correctness, once routed to the redundant head node, the requests continued to be served by the cluster. The write requests returned successfully, and we were able to locate the files on the disk nodes, with the entries correctly recorded in the databases across the cluster. For read requests, we were still able to download from the instance after the switch over, and the downloaded files have the correct checksum. When we switched off a redundant head node, there was no observable change to op- erations apart from Galera reporting a decrease in cluster size. We then tested database cluster recovery by reintroducing the redundant node into the cluster. Once back on, the node rejoined the database cluster, and we were able to observe records of new files we have created during its absence.

5.2 Performance

The high availability DPM architecture has added data redundancy and replication to the DPM system. As our modifications have introduced extra locking mechanisms and blocking operations, these changes will likely have an impact on the performance of the DPM system. While a slight degradation in performance may be acceptable, a consider- able drop in scalability would require a rethink of our approach. To accurately address these concerns, we performed a series of tests on the testbed using both a single head node configuration and a multiple head-nodes configuration to compare their difference in performance and scalability, as described in the following sections.

5.2.1 Methodology

For the performance benchmarks, we used the grid-hammer [22] load testing framework developed at CERN for testing grid storage element. The test is designed to evaluate

44 the testbed on four different file operations using HTTP (stat uses WebDAV extension):

• Write - upload a file to the DPM instance • Stat - return metadata of a file • Read - read and check the contents of a file • Delete - unlink a file from the instance

The test performs all four operations in sequence on 1000 small files and measures the time taken to complete each operation. To measure the scalability of the testbed, the test was repeated using 1, 4, 10 and 100 threads. Each test was repeated three times, and the average is taken for a more accurate measurement. For comparison, the tests were carried out on both the single head node instance and an instance with three head nodes. Both instances have the same performance tuning configurations.

5.2.2 Findings

While we expected the performance of the file operations on the multiple head nodes instance to be worse than the single head node instance, it is surprising to observe a much more significant drop in performance in read and delete operations. The results of all the tests on both instances are shown in Figure 5.1. Enlarged version of the plots can be found in Appendix B. Figure 5.1a shows the average rate of write operations compared to the number of threads. It can be seen although the high availability instance observed similar per- formance as the single head node instance when once one request is received at a time, its performance starts to degrade as the number of concurrent requests increases. This, however, is expected because synchronise nature of the replication process. When the number of concurrent write request increases on the primary node, as does the load on the replication channel between the databases. Since a transaction on the primary node has to be acknowledged by every database in the cluster before it can be committed, the rate of operation would now depend on how quickly the databases can handle the replications. It is possible that in our testbed the bandwidth of the replication channel was simply not wide enough to handle too many concurrent replications, or because the databases do not have enough threads to process them quickly enough. Overall, the drop in write performance is expected, and it is not too severe. The results for stat operations, shown in Figure 5.1b, shows the performance recorded in both instances are virtually identical. This is as expected as the stat operation involves only a read from the database backend. When the system receives a stat request for a file, the system will simply look for the entry in the local database and return the metadata to the client, and as such, no transaction or replication was involved. Figure 5.1c shows the read performance observed on both instances. This result is un- expected, because for read operation there is no transaction that requires replicating

45 40 900

35 800

700 30 ) )

z 600 z H

25 H ( (

y y

c 500 c n n

e 20 e u u

q 400 q e e f f

15 t e t i a 300 t r S W 10 200

5 100

0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100

Number of threads Number of threads

Multi head nodes Single head node Multi head nodes Single head node (a) Average rate for a write operation (b) Average rate for a stat operation

700 250

600 200

500 ) ) z z H H ( (

y 150 y

400 c c n n e e u u q q

300 e e f f

100 e d t a e l e 200 e R D 50 100

0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Number of threads Number of threads

Multi head nodes Single head node Multi head nodes Single head node (c) Average rate for a read operation (d) Average rate for a delete operation

Figure 5.1: Plots of average rate of operations compared to number of threads to other databases, and therefore should not incur any performance penalty from syn- chronous replication. Although both instances performed comparably for sequential read, it is not immediately clear why the concurrent read performance on the multiple head nodes instance scaled so poorly. We suspect the poor read scalability is due to two reasons. Firstly, for read requests DOME internally performs a database lookup on every parent directory precedes the file. This would have generated a large amount of reads toward the database when the number of concurrent requests increases. It is possible that our Galera tuning configu- rations are not as efficient as the standalone database in the single node instance, and was unable to handle as many concurrent operations. Secondly, although read request does not require updating database records, it is possible that Galera would still lock the record being read in every database in the cluster. This is a likely scenario because, in Galera’s multi-master configuration, every database in the cluster can accept writes. In order to prevent potential conflicts where multiple databases write to the same record, Galera would have to introduce a locking mechanism, which we believe is the cause of performance drop we have observed. The decrease in performance difference between the two instances at 100 threads is likely due to a common bottleneck in other parts of our testbed. Of the four operations tested, delete was evidently the worst impacted by our modifica- tions. The results of the operations, as shown in 5.1d, indicates a severe decrease in the

46 system’s capability to handle concurrent delete requests. When sent ten requests con- currently, the clustered instance was only able to process the requests at less than half the rate of the single node instance. In order to understand why the cluster performed so poorly, we had to investigate how DOME processes a delete request. It seems that a number of select, update, and delete SQL operations are involved in a single delete request. To fulfil the request, DOME has to verify the existence of the file, which implies a read on the record itself and also updating each of its parent entries. Then DOME removes the record but also has to update the multiple fields in the replica, pool, file system, space quota tables. In essence, the delete operation involves the largest number of database transactions out of the four operations, and seems to incur a combination of performance penalties we have seen in the read and write operations.

A note on the results It is worth remembering that our testbeds are hosted on VMs instead of physical machines, and as such, we have very little influence on how the VMs are mapped to hosts. The VMs could be hosted on the same machine, by different hosts on the same rack, or different racks in the data center. The bottom line is we do not know, and this should be taken into account when evaluating our findings. Our results also should not be seen as a representation of the capability of the DPM system. Since firstly, the DPM developers recommend using physical machines for deployment, and secondly, we have only performed minimal performance tuning on our testbed.

47 Chapter 6

Conclusions

In this work, we have carried out a preliminary study on the DPM storage management system regarding its limiting factors on availability. The study was motivated by the need to minimise the downtime of DPM-based grid storage elements, as well as the recent refactoring effort in DPM development which has opened up an opportunity to introduce high availability functionality to the system. The main objectives of this work are to explore the possibility of incorporating the latest changes in DPM into a new design which would allow the deployment of multiple redundant head nodes. In the study, we have documented our experience in manually installing and configuring a legacy-free DPM instance, including details on some of the unexpected issues and how we have resolved them. By doing so, we have established a point where modifications can be made to the DPM systems once we have completed our new design. During our investigation, we have identified several components that would prevent multiple head nodes from functioning as a coherent unit. The main challenges originate from the fact that services on the head node were designed to run on a single host, and therefore only maintain data which are critical to the functionality of DPM in their pri- vate memory space. Our analysis has discovered that these critical data include records in the database local to the head node, and the task objects that are stored in DOME’s in-memory priority queues. As the outcome of our investigations, we have proposed some new designs of the limiting components based on the analysis of several potential options. These modifications including a new database cluster and a replicated cache layer. The novel high availability DPM architecture presented in this work combines the rec- ommended modifications resulted from our investigation, and facilitates the deployment of multiple redundant head nodes as well as an automatic failover mechanism. We have shown while it is possible to achieve high availability in DPM using our new architec- ture, it also comes with a cost in the system’s ability to scale. Benchmarks ran on our testbed have shown while file upload and stat operations are relatively unaffected, the performance on more database query intensive operations such as read and delete were badly impacted by the strong consistency guarantee provided by the new architecture.

48 We believe this is because in a distributed system, guaranteeing high availability and strong consistency implies that the system must synchronise state changes across every node in the cluster, and thus creating new performance bottlenecks in the system. To achieve an optimal balance between performance, availability, and consistency in DPM would require further studies and testing. In conclusion, high availability in DPM is possible, but care must be taken in balancing performance and consistency. We believe that by identifying some of the availability limiting factors in DPM and recommending potential solutions to them, this work has accomplished the first step in a likely long journey of a production-ready HA-DPM solution.

49 Chapter 7

Future work

In this work we have began the process of designing a high availability DPM storage solution, but there are still many challenges that must be solved before that goal can be realised. This chapter presents some of the work that would contribute toward achieving a production-ready HA-DPM solution. In this project because of time constraint we have only investigated high availability solutions for the HTTP protocol, the compatibility between the Gridftp and xrootd fron- tends and the recommended modifications still need to be accessed. However, we do not anticipate many issues as the requests would ultimately be handled by DOME. There are parts of our recommended architecture which remain untested, namely using keepalived as a failover mechanism and the behaviour of the cluster in the event of a network partition. One of the areas worth exploring is that with the introduction of the clustered cache layer, it may be possible for the DPM service to scale horizontally while providing a degree of availability. Lastly, in light of the benchmark results, we believe further studies is also needed to determine if strong consistency is absolutely required in HA-DPM, and if not, at what point would the system begin to exhibit incorrect behaviour. In order to get a more accurate measurement on the impact of the modifications, it would also be beneficial to deploy a testbed of the new architecture on physical machines.

50 Appendix A

Software versions and configurations

A.1 Core testbed components

• DMLite - 0.8.8 • DOME - 0.8.8 • MariaDB - 10.1.26 • Redis - 4.0.1 • httpd - 2.4.6 • HAProxy - 1.5.18 • BIND - 9.9.4

A.2 Test tools

• Davix - 0.6.6 • grid-hammer - no versioning available (source: https://gitlab.cern.ch/lcgdm/grid- hammer)

A.3 Example domehead.conf glb.debug: 1 glb.role: head glb.fcgi.listenport: 9001 glb.task.maxrunningtime: 3600

51 glb.task.purgetime: 3600 glb.put.minfreespace_mb: 1 glb.restclient.cli_certificate: /etc/grid-security/dpmmgr/dpmcert.pem glb.restclient.cli_private_key: /etc/grid-security/dpmmgr/dpmkey.pem glb.auth.authorizeDN[]: "CN=dpm-disk-1.novalocal,OU=dpmCA.novalocal,OU=DPM-Testbed,O=Grid", "CN=dpm-disk-2.novalocal,OU=dpmCA.novalocal,OU=DPM-Testbed,O=Grid" head.maxcallouts: 4 head.maxcalloutspernode: 2 head.maxchecksums: 2 head.maxchecksumspernode: 2 head.filepuller.stathook: /usr/share/dmlite/filepull/externalstat_example.sh head.filepuller.stathooktimeout:60 head.db.host: localhost head.db.user: dpmdbuser head.db.password: pass head.db.port: 0 head.db.poolsz: 128 head.dirspacereportdepth: 6

A.4 Example domedisk.conf glb.debug: 1 glb.role: disk glb.fcgi.listenport: 9002 glb.task.maxrunningtime: 3600 glb.task.purgetime: 3600 glb.restclient.cli_certificate: /etc/grid-security/dpmmgr/dpmcert.pem glb.restclient.cli_private_key: /etc/grid-security/dpmmgr/dpmkey.pem glb.auth.authorizeDN[]: "CN=dpm-head-1.novalocal,OU=dpmCA.novalocal,OU=DPM-Testbed,O=Grid" disk.headnode.domeurl: https://dpm-head-1.novalocal/domehead disk.filepuller.pullhook: /usr/share/dmlite/filepull/externalpull_example.sh

52 A.5 Example dmlite.conf

LoadPlugin plugin_config /usr/lib64/dmlite/plugin_config.so LogLevel 1 Include /etc/dmlite.conf.d/*.conf

A.6 Example domeadapter.conf

LoadPlugin plugin_domeadapter_headcatalog /usr/lib64/dmlite/plugin_domeadapter.so LoadPlugin plugin_domeadapter_io /usr/lib64/dmlite/plugin_domeadapter.so LoadPlugin plugin_domeadapter_pools /usr/lib64/dmlite/plugin_domeadapter.so

DavixCAPath /etc/grid-security/certificates DavixCertPath /etc/grid-security/dpmmgr/dpmcert.pem DavixPrivateKeyPath /etc/grid-security/dpmmgr/dpmkey.pem

DomeHead https://dpm-head-1.novalocal/domehead

TokenPassword test TokenId ip TokenLife 1000

A.7 Example mysql.conf

LoadPlugin plugin_mysql_ns /usr/lib64/dmlite/plugin_mysql.so LoadPlugin plugin_mysql_iopassthrough /usr/lib64/dmlite/plugin_mysql.so

MySqlDirectorySpaceReportDepth 6 MySqlHost localhost MySqlUsername dpmdbuser MySqlPassword Jthreemd MySqlPort 0 NsDatabase cns_db DpmDatabase dpm_db NsPoolSize 256

MapFile /etc/lcgdm-mapfile

53 HostDNIsRoot yes HostCertificate /etc/grid-security/dpmmgr/dpmcert.pem AdminUsername /O=Grid/OU=DPM-Testbed/OU=dpmCA.novalocal/OU=DPM Testbed CA/CN=Eric Cheung

A.8 Example Galera cluster configuration

[galera] wsrep_on=ON wsrep_provider=/usr/lib64/galera/libgalera_smm.so wsrep_cluster_address="gcomm://192.168.1.16,192.168.1.8,192.168.1.9" binlog_format=row default_storage_engine=InnoDB innodb_autoinc_lock_mode=2 wsrep_cluster_name="dpm_db_cluster" wsrep_node_address="192.168.1.9" wsrep_node_name="mariadb02" wsrep_sst_method=rsync back_log=500 query_cache_size=256M query_cache_limit=16M innodb_buffer_pool_size=4096M innodb_flush_method=O_DIRECT innodb_flush_log_at_trx_commit=2 skip-innodb_doublewrite innodb_support_xa=0 innodb_thread_concurrency=8 innodb_log_buffer_size = 8M key_buffer_size = 16M skip-external-locking innodb_file_per_table=1 max_connections=2000 thread_cache_size=8 max_connect_errors=4294967295 bind-address=0.0.0.0

54 Appendix B

Plots

55 40

35

30 ) z

H 25 ( y c n

e 20 u q e f 15 e t i r

W 10

5

0 0 10 20 30 40 50 60 70 80 90 100

Number of threads

Multi head nodes Single head node

Figure B.1: Average rate for a write operation

900

800

700

) 600 z H (

y 500 c n e u 400 q e f t a 300 t S 200

100

0 0 10 20 30 40 50 60 70 80 90 100 Number of threads

Multi head nodes Single head node

Figure B.2: Average rate for a stat operation

56 700

600

500 ) z H (

y 400 c n e u q 300 e f d a e 200 R

100

0 0 10 20 30 40 50 60 70 80 90 100

Number of threads

Multi head nodes Single head node

Figure B.3: Average rate for a read operation

250

200 ) z H (

y 150 c n e u q e f

100 e t e l e D 50

0 0 10 20 30 40 50 60 70 80 90 100 Number of threads

Multi head nodes Single head node

Figure B.4: Average rate for a delete operation

57 Bibliography

[1] A.Katal, M.Wazid, R.Goudar. 2013 Big data: Issues, challenges, tools and Good practices. Contemporary Computing (IC3), 2013 Sixth International Conference on. [2] D.Agrawal, S.Das, A.Abbadi. 2011 Big data and cloud computing: current state and future opportunities. Proceedings of the 14th International Conference on Extending Database Technology. pp 530-533. [3] WLCG (2017). About. Available at: http://wlcg-public.web.cern.ch/about. (Ac- cessed: 02 August 2017). [4] Globus (no date). GridFTP. Available at: http://toolkit.globus.org/toolkit/docs/latest-stable/gridftp/. (Accessed: 02 August 2017). [5] XRootD (no date). SLAC/CERN. Available at: http://xrootd.org/. (Accessed: 02 August 2017). [6] DPM - Disk Pool Manager (no date). CERN. Available at: http://lcgdm.web.cern.ch/dpm. (Accessed: 02 August 2017). [7] F.Furano. 2016 DPM status and directions. DPM workshop 2016. [8] SRM (2008). GridPP. Available at: https://www.gridpp.ac.uk/wiki/SRM. (Ac- cessed: 02 August 2017). [9] SRM (2016). Memcached. Available at: https://memcached.org/. (Accessed: 02 August 2017). [10] DMLite (no date). DMLite. Available at: http://lcgdm.web.cern.ch/dmlite. (Ac- cessed: 02 August 2017). [11] F.Furano. 2017 DOME - A rest-inspired engine for DPM. Available at: http://svnweb.cern.ch/world/wsvn/lcgdm/dmlite/trunk/doc/dome/dome.pdf. (Accessed: 02 August 2017). [12] S.Gilbert, N.Lynch. 2012 Perspectives on the CAP Theorem. IEEE Computer Volume 45 Issue 2. pp 30-36.

58 [13] BIND - Versatile, Classic, Complete Name Server Software (2017). Internet Systems Consortium . Available at: https://www.isc.org/downloads/bind/. (Ac- cessed: 05 August 2017). [14] Keepalived (2017). Available at: http://www.keepalived.org/. (Accessed: 06 Au- gust 2017). [15] M.ÃUzsu,˝ P.Valduriez. Principles of Distributed Database Systems, Third Edi- tion. Springer. pp 502-505. [16] Group Replication (2017). Oracle. Available at: https://dev.mysql.com/doc/refman/5.7/en/group-replication.html. (Accessed: 05 August 2017). [17] Technology behind Galera Cluster (2017). Codership. Available at: http://galeracluster.com/products/technology/. (Accessed: 07 August 2017). [18] Redis (2017). redislab. Available at: https://redis.io/. (Accessed: 07 August 2017). [19] Redis Sentinel Documentation (2017). redislab. Available at: https://redis.io/topics/sentinel. (Accessed: 07 August 2017). [20] A.TalhaKabakus, R.Kara. 2016 A performance evalua- tion of in-memory databases. Journal of King Saud Uni- versity - Computer and Information Sciences. Available at: http://www.sciencedirect.com/science/article/pii/S1319157816300453. (Ac- cessed: 11 August 2017). [21] HAProxy - The Reliable, High Performance TCP/HTTP Load Balancer (2017). Available at: http://www.haproxy.org/. (Accessed: 11 August 2017). [22] grid-hammer - A load testing tool for CERN storage systems (2017). G.Bitzes. Available at: https://gitlab.cern.ch/lcgdm/grid-hammer. (Accessed: 16 August 2017).

59