An Investigation of a High Availability DPM-Based Grid Storage Element
Total Page:16
File Type:pdf, Size:1020Kb
An investigation of a high availability DPM-based Grid Storage Element Kwong Tat Cheung August 17, 2017 MSc in High Performance Computing with Data Science The University of Edinburgh Year of Presentation: 2017 Abstract As the data volume of scientific experiments continues to increase, there is an increasing need for Grid Storage Elements to provide a reliable and robust storage solution. This work investigates the limitation of the single point of failure in DPM’s architecture, and identifies the components which prevent the inclusion of using redundant head nodes to provide higher availability. This work also contributes a prototype of a novel high availability DPM architecture, designed using the findings of our investigation. Contents 1 Introduction 1 1.1 Big data in science . .1 1.2 Storage on the grid . .2 1.3 The problem . .2 1.3.1 Challenges in availability . .2 1.3.2 Limitations in DPM legacy components . .3 1.4 Aim . .3 1.5 Project scope . .3 1.6 Report structure . .4 2 Background 5 2.1 DPM and the Worldwide LHC Computing Grid . .5 2.2 DPM architecture . .6 2.2.1 DPM head node . .6 2.2.2 DPM disk node . .7 2.3 DPM evolution . .8 2.3.1 DMLite . .8 2.3.2 Disk Operations Manager Engine . .9 2.4 Trade-offs in distributed systems . 10 2.4.1 Implication of CAP Theorem on DPM . 10 2.5 Concluding Remarks . 11 3 Setting up a legacy-free DPM testbed 12 3.1 Infrastructure . 13 3.2 Initial testbed architecture . 13 3.3 Testbed specification . 13 3.4 Creating the VMs . 14 3.5 Setting up a certificate authority . 15 3.5.1 Create a CA . 15 3.5.2 Create the host certificates . 16 3.5.3 Create the user certificate . 17 3.6 Nameserver . 17 3.7 HTTP frontend . 19 3.8 DMLite adaptors . 19 3.9 Database and Memcached . 19 i 3.10 Creating a VO . 19 3.11 Establishing trust between the nodes . 20 3.12 Setting up the file systems and disk pool . 20 3.13 Verifying the testbed . 22 3.14 Problems encountered and lessons learned . 23 4 Investigation 24 4.1 Automating the failover mechanism . 24 4.1.1 Implementation . 25 4.2 Database . 26 4.2.1 Metadata and operation status . 26 4.2.2 Issues . 27 4.2.3 Analysis . 27 4.2.4 Options . 29 4.2.5 Recommendation . 30 4.3 DOME in-memory queues . 32 4.3.1 Issues . 32 4.3.2 Options . 34 4.3.3 Recommendation . 36 4.4 DOME metadata cache . 37 4.4.1 Issues . 38 4.4.2 Options . 38 4.4.3 Recommendation . 38 4.5 Recommended architecture for High Availability DPM . 39 4.5.1 Failover . 41 4.5.2 Important considerations . 41 5 Evaluation 43 5.1 Durability . 43 5.1.1 Methodology . 43 5.1.2 Findings . 44 5.2 Performance . 44 5.2.1 Methodology . 44 5.2.2 Findings . 45 6 Conclusions 48 7 Future work 50 A Software versions and configurations 51 A.1 Core testbed components . 51 A.2 Test tools . 51 A.3 Example domehead.conf . 51 A.4 Example domedisk.conf . 52 A.5 Example dmlite.conf . 53 ii A.6 Example domeadapter.conf . 53 A.7 Example mysql.conf . 53 A.8 Example Galera cluster configuration . 54 B Plots 55 iii List of Tables 3.1 Network identifiers of VMs in testbed . 14 iv List of Figures 2.1 Current DPM architecture . .6 2.2 DMLite architecture . .8 2.3 Simplified view of DOME in head node . .9 2.4 Simplified view of DOME in disk node . 10 3.1 Simplified view of architecture of initial testbed . 14 4.1 Failover using keepalived . 25 4.2 Synchronising records with Galera cluster . 30 4.3 Remodeled work flow of the task queues using replicated Redis caches . 37 4.4 Remodeled work flow of the metadata cache using replicated Redis caches 39 4.5 Recommended architecture for High Availability DPM . 40 5.1 Plots of average rate of operations compared to number of threads . 46 B.1 Average rate for a write operation . 56 B.2 Average rate for a stat operation . 56 B.3 Average rate for a read operation . 57 B.4 Average rate for a delete operation . 57 v Acknowledgements First and foremost, I would like to express my gratitude to Dr Nicholas Johnson for supervising and arranging the budget for this project. Without the guidance and moti- vation he has provided, the quality of this work would certainly have suffered. I would also like to thank Dr Fabrizio Furano from the DPM development team for putting up with the stream of emails I have bombarded him with, and for answering my queries on the inner-workings of DPM. Chapter 1 Introduction 1.1 Big data in science Big data has become a well-known phenomenon in the age of social media. The vast amount of user generated contents has undeniably influenced the research and advance- ment in modern distributed computing paradigms [1][2]. However, even before the advent of social media websites, researchers in several scientific fields already faced similar challenges in dealing with a massive amount of data generated by experiments. One such field is high energy physics, including the Large Hadron Collider (LHC) ex- periments based at the European Organization for Nuclear Research (CERN). In 2016 alone, it is estimated that 50 petabytes of data were gathered by the LHC detectors post- filtering [3]. Since the financial resources required to host an infrastructure that is able to process, store, and analyse the data is far too great for any single organisation, the experiments turned to the grid computing approach. Grid computing, which is mostly developed and used in academia, follows the same principle of its commercial counterpart - cloud computing, where computing resources are provided to end-users remotely and on-demand. Similarly, the physical location of the sites which provide the resource, as well as the infrastructure is abstracted away from the users. From the end-users’ perspective, they just have to submit their jobs to an appropriate job management system without any knowledge of where the jobs will be run or where the data are physically stored. In grid computing, these comput- ing resources are often distributed across multiple locations, where a site that provides data storage capacity is called a Storage Element, and one that provides computation capacity is called a Compute Element. 1 1.2 Storage on the grid Grid storage elements have to support some unique requirements found in grid envi- ronment. For example, the grid relies on the concept of Virtual Organisations (VO) for resource allocation and accounting. A VO represents a group of users, not necessary from the same organisation but usually involved in the same experiment, and manages their membership. Resources on the grid (i.e. storage space provided by a site) are allo- cated to specific VOs instead of individual users. Storage elements also have to support file transfer protocols that are not commonly used outside of the grid environment, such as GridFTP [4] and xrootd [5]. Various storage management systems were developed for grid storage elements to fulfil these requirements, and one such system is the Disk Pool Manager (DPM) [6]. DPM is a storage management system developed by CERN. It is currently the most widely deployed storage system on tier 2 sites, providing the Worldwide LHC Comput- ing Grid (WLCG) around 73 petabytes of storage across 160 instances [7]. The main functionalities of DPM are to provide a straightforward, low maintenance solution to create a disk-based grid storage element, and to support remote file and meta-data op- erations using multiple protocols commonly used in grid environment. 1.3 The problem This section presents the main challenges for DPM, the specific limitations that motivate this work, and outlines the project’s aim. 1.3.1 Challenges in availability Due to limitations in the DPM architecture, the current deployment model supports only one meta-data server and command node. This deployment model exposes a single point of failure in a DPM-based storage element. There are several scenarios where this deployment model could affect the availability of a site: • Hardware failure in the host • Software/OS update that results in the host being offline • Retirement or replacement of machines If any of the scenario listed above happens to the command node, the entire storage element will become inaccessible, which ultimately means expensive downtime for the site. 2 1.3.2 Limitations in DPM legacy components Some components in DPM were first developed over 20 years ago. The tightly-coupled natural of these components have limited the extensibility of the DPM system and makes it impractical to modify DPM into a multi-servers system. As the grid evolves, the number of users and storage demand have also increased. New software practices and designs have also emerged that could better fulfil the requirements of a high load storage element. In light of this, the DPM development team have put in considerable amount of effort into modernising the system in the past few years, which resulted in some new com- ponents that could bypass some limitations of the legacy stack. The extensibility of these new components has opened up an opportunity to modify the current deployment model, which this work aims to explore. 1.4 Aim The aim of this work is to explore the possibility of increasing the availability of a DPM-based grid storage element by modifying its current architecture and components.