Distributed Metadata Management for Parallel Filesystems

A Thesis

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University

By

Vilobh Meshram, B.Tech(Computer Science)

Graduate Program in Computer Science and Engineering

The Ohio State University

2011

Master’s Examination Committee:

Dr. D.K. Panda, Advisor Dr. P. Sadayappan Copyright by

Vilobh Meshram

2011 Abstract

Much of the research in storage systems has been focused on improving the scale and performance of the data-access throughput that read and write large amounts of

file data. Parallel file systems do a good job of scaling large file access bandwidth by striping or sharing I/O resources across many servers or disks. However, the same cannot be said about scaling file metadata operation rates.

Most existing parallel filesystems choose to concentrate all the metadata - ing load on a single server. This centralized processing can guarantee correctness, but it severely hampers scalability. This downside is becoming more and more unac- ceptable as metadata throughput is critical for large scale applications. Distributing metadata processing load is critical to improve metadata scalability when handling huge number of client nodes. However, in such a distributed scenario, a solution to speed up metadata operations has to address two challenges simultaneously, namely scalability and reliability.

We propose two approaches to solve the challenges mentioned above for metadata management in parallel filesystems with a focus towards reliability and scalability aspects. As demonstrated by experiments, the approach to solve the problem of dis- tributed metadata management achieves significant improvements over native parallel

filesystems by large margin for all the major metadata operations. With 256 client processes, our approach to solve the problem of distributed metadata management

ii outperforms and PVFS2 by a factor of 1.9 and 23, respectively, to create di- rectories. With respect to stat() operation on files, our approach is 1.3 and 3.0 times faster than Lustre and PVFS.

iii This work is dedicated to my parents and my sister

iv Acknowledgments

I consider myself extremely fortunate to have met and worked with some remark- able people during my stay at Ohio State. While a brief note of thanks does not do justice to their impact on my life, I deeply appreciate their contributions.

I begin by thanking my adviser, Dr. Dhabaleswar K.Panda. His guidance and advice during the course of my Masters studies have shaped my career. I am thankful to Dr. P. Sadayappan for agreeing to serve on my Master’s examination committee.

Special thanks to Xiangyong Ouyang for all the support and help. I would also like to thank Dr.Xavier Besseron for his insightful comments and discussions which helped me to strengthen my thesis. I am especially grateful to Xiangyong, Xavier and Raghu and I feel lucky to have collaborated closely with them. I would like to thank all my friends in the Network Based Computing Research Laboratory for their friendship and support.

Finally, I thank my family, especially my parents and my sister. Their love, action, and faith have been a constant source of strength for me. None of this would have been possible without them.

v Vita

April 18, 1986 ...... Born - Amravati, India

2007 ...... B.Tech., Computer Science, COEP, Pune University, Pune, India. 2007-2009 ...... Software Development Engineer, Symantec R&D India 2010-2011 ...... Graduate Research Associate, The Ohio State University

Publications

Research Publications

Vilobh Meshram, Xavier Besseron, Xiangyong Ouyang, Raghunath Rajachandrasekar and Dhabaleswar K. Panda Can a Decentralized Metadata Service Layer benefit Parallel Filesystems?. accepted in IASDS 2011 workshop in conjunction with Cluster 2011

Vilobh Meshram, Xiangyong Ouyang and Dhabaleswar K. Panda Minimizing Lookup RPCs in Lustre using Metadata Delegation at Client Side. OSU Technical Report OSU-CISRC-7/11-TR20, July 2011

Raghunath Rajachandrasekar, Xiangyong Ouyang, Xavier Besseron, Vilobh Meshram and Dhabaleswar K. Panda Can Checkpoint/Restart Mechanisms Benefit from Hier- archical Data Staging?. to appear in Reselience 2011 workshop in conjunction with Euro-Par 2011

Fields of Study

vi Major Field: Computer Science and Engineering

Studies in High Performance Computing: Prof. D. K. Panda

vii Table of Contents

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita...... vi

List of Tables ...... xi

List of Figures ...... xii

1. Introduction ...... 1

1.1 Parallel Filesystems ...... 3 1.2 Metadata Management in Parallel Filesystems ...... 5 1.3 Distributed Coordination Service ...... 8 1.4 Motivation of the Work ...... 10 1.4.1 Metadata Server Bottlenecks ...... 10 1.4.2 Consistency management of Metadata ...... 12 1.5 Problem Statement ...... 14 1.6 Organization of Thesis ...... 15

2. Related Work ...... 16

2.1 Metadata Management approaches ...... 16 2.2 Scalable filesystem directories ...... 19

viii 3. Delegating metadata at client side (DMCS) ...... 22

3.1 RPC Processing in Lustre Filesystem ...... 22 3.2 Existing Design ...... 24 3.3 Design and challenges for delegating metadata at client side . . . . 25 3.3.1 Design of communication module ...... 25 3.3.2 Design of DMCS approach ...... 26 3.3.3 Challenges ...... 30 3.3.4 Metadata revocation ...... 31 3.3.5 Distributed Lock management for DMCS approach . . . . . 31 3.4 Performance Evaluation ...... 32 3.4.1 File Open IOPS: Varying Number of Client Processes . . . . 34 3.4.2 File Open IOPS: Varying File Pool Size ...... 34 3.4.3 File Open IOPS: Varying File path Depth ...... 36 3.5 Summary ...... 37

4. Design of a Decentralized Metadata Service Layer for Distributed Meta- data Management ...... 39

4.1 Detailed design of Distributed Union FileSystem (DUFS) ...... 39 4.1.1 Implementation Overview ...... 41 4.1.2 FUSE-based Filesystem Interface ...... 42 4.2 ZooKeeper-based Metadata Management ...... 43 4.2.1 File Identifier ...... 44 4.2.2 Deterministic mapping function ...... 45 4.2.3 Back-end storage ...... 45 4.3 Algorithm examples for Metadata operations ...... 46 4.3.1 Reliability concerns ...... 46 4.4 Performance Evaluation ...... 48 4.4.1 Distributed coordination service throughput and memory us- age experiments ...... 49 4.4.2 Scalability Experiments ...... 52 4.4.3 Experiments with varying number of distributed coordina- tion service servers ...... 52 4.4.4 Experiment with different number of mounts combined using DUFS ...... 55 4.4.5 Experiments with different back-end parallel filesystems . . 58 4.5 Summary ...... 60

ix 5. Contributions and Future Work ...... 62

5.1 Summary of Research Contributions and Future Work ...... 62 5.1.1 Delegating metadata at client side ...... 63 5.1.2 Design of a decentralized metadata service layer for distributed metadata management ...... 64

Bibliography ...... 66

x List of Tables

Table

1.1 LDLM and Oprofile Experiments ...... 7

1.2 Transaction throughput with a fixed file pool size of 1,000 files . . . . 11

1.3 Transaction throughput with varying file pool ...... 12

1.4 Transaction throughput with varying file pool ...... 12

3.1 Metadata operation rates with different underlying storage ...... 30

xi List of Figures

Figure Page

1.1 Basic Lustre Design ...... 4

1.2 Zookeeper Design ...... 9

1.3 Example of consistency issue with 2 clients and 2 MetaData servers . 13

3.1 Design of DMCS approach ...... 27

3.2 File open IOPS, Each Process Accesses 10,000 Files ...... 35

3.3 File open IOPS, Using 16 Client Processes ...... 36

3.4 Time to Finish open, Using 16 Processes Each Accessing 10,000 Files 37

4.1 DUFS mapping from the virtual path to the physical path using File Identifier (FID) ...... 40

4.2 DUFS overview. A, B, C and D show the steps required to perform an open() operation...... 41

4.3 Sample physical filename generated from a given FID ...... 46

4.4 Algorithm for the mkdir() operation ...... 47

4.5 Algorithm for the stat() operation ...... 47

4.6 ZooKeeper throughput for basic operations by varying the number of ZooKeeper Servers ...... 50

xii 4.7 Zookeeper memory usage and its comparison with DUFS and basic FUSE based file system memory usage ...... 51

4.8 Scalability experiments with 8 Client nodes and varying number of client processes ...... 53

4.9 Scalability experiments with 16 Client nodes and varying number of client processes ...... 54

4.10 Operation throughput by varying the number of Zookeeper Servers . 56

4.11 File operation throughput for different numbers of back-end storage . 57

4.12 Operation throughput with respect to the number of clients for Lustre and PVFS2 ...... 59

xiii Chapter 1: INTRODUCTION

High-performance computing (HPC) is an integral part of today’s scientific, eco- nomic, social, and commercial fabric. We depend on HPC systems and applications for a wide range of activities such as climate modeling, drug research, weather fore- casting, and energy exploration. HPC systems enable researchers and scientists to discover the origins of the universe, design automobiles and airplanes, predict weather patterns, model global trade, and develop life-saving drugs. Because of the nature of the problems that they are trying to solve, HPC applications are often data-intensive.

Scientific applications in astrophysics (CHIMERA and VULCAN2D), climate mod- eling (POP), combustion (S3D), fusion (GTC), visualization, astronomy, and other

fields generate or consume large volumes of data. This data is on the order of ter- abytes and petabytes and is often shared by the entire scientific community. Today’s computational requirements are increasing at a geometric rate that involves large quantities of data. While the computational power of microprocessors has kept pace with Moore’s law as a result of increased chip densities, performance improvements in magnetic storage have not seen a corresponding increase. The result has been an in- creasing gap between the computational power and the I/O subsystem performance of current HPC systems. Hence, while keep getting faster, we do

1 not see a corresponding improvement in application performance, because of the I/O bandwidth bottleneck.

The parallel file system do a good job in improving the data throughput rate by striping or sharing I/O resources across many servers and disks. The same cannot be said about metadata operations. Every time a file is opened, saved, closed, searched, backed up or replicated, some portion of metadata is accessed. As a result, metadata operations fall in the critical path of a broad spectrum of applications. Studies [20,23] show that over 75% of all filesystem calls require access to file metadata. Therefore, efficient management of metadata is very crucial for the overall system performance.

Even though the modern distributed file systems architectures like Lustre [4],

PVFS [10] and [13] separate the management of metadata from the storage of the actual file data all the namespace is managed by a centralized metadata server. These architectures have proven to easily scale the storage capacity and bandwidth. However, the management of metadata remains a bottleneck.

Recent trends in high-performance computing have also seen a shift toward dis- tributed resource management. Scientific applications are increasingly accessing data stored in remote locations. This trend is a marked deviation from the earlier norm of co-location of application and its data. So in such a distributed environment the management of metadata becomes even more difficult as the important points of reli- ability, consistency and scalability need to be taken care of. As we saw above that in most of the parallel filesystem single metadata server manages the entire namespace, new approaches need to be designed to take care of distributed metadata manage- ment. Few of the parallel filesystems have a design for better metadata management in order to overcome the problem faced by the single point of metadata bottleneck

2 but considering the complexity of distributed metadata management the effort is still in progress.

Our research focuses on addressing these two problem. We have examined the existing paradigms and suggested better alternatives. In the first part we focus on an approach for the Lustre filesystem to overcome the problem of single point of bottleneck. In the second part we design and evaluate our scheme for distributed metadata management for parallel filesystems with the primary aim of improving the scalability of the filesystem while maintaining the reliability and consistency aspects.

1.1 Parallel Filesystems

Parallel Filesystems are mostly used in High Performance Computing environ- ments which deals with or generates, massive amount of data. Parallel Filesystems usually separate the processing of metadata from data. Some parallel file systems, e.g., Lustre have a separate Metadata Server to handle metadata operations whereas some parallel filesystems, e.g., PVFS may keep the metadata and data at the same place. Lets consider the case of Lustre. Lustre is a POSIX compliant, open-source distributed parallel filesystem. Due to the extremely scalable architecture of the Lus- tre filesystem, Lustre deployments are popular in scientific supercomputing, as well as in the oil and gas, manufacturing, rich media, and finance sectors. Lustre presents a POSIX interface to its clients with parallel access capabilities to the shared file objects. Lustre is an object-based filesystem. It is composed of three components: a

Metadata server (MDS), servers (OSSs), and clients. Figure 1.1 illus- trates the Lustre architecture. Lustre uses devices for file data and metadata storage and each block device can be managed by only one Lustre service. The total

3 data capacity of the Lustre filesystem is the sum of all individual OST capacities.

Lustre clients access and concurrently use data through the standard POSIX I/O system calls. MDS provides metadata services. Correspondingly, a MDC (metadata client) is a client of those services. One MDS per filesystem manages one metadata target (MDT). Each MDT stores file metadata, such as file names, struc- tures, and access permissions. OSS (object storage server) exposes block devices and serves data. Correspondingly, OSC (object storage client) is client of the services.

Each OSS manages one or more object storage targets (OSTs), and OSTs store file data objects.

LDAP Server

configuration information, network connection details & security management

Clients

directory operations, file I/O meta-data & concurrency & locking

Meta-Data Server Object Storage Targets (MDS) (OST)

frecovery, file status & file creation

Figure 1.1: Basic Lustre Design

4 1.2 Metadata Management in Parallel Filesystems

Parallel filesystems like Lustre and Google Filesystem separate out from the clas- sical distributed file systems like NFS, etc. in a way that they separate out the management of metadata from actual file data. In classical distributed filesystems like NFS the server has to manage both data and metadata. This increases the load on the server and limits the performance and scalability of the filesystem. Parallel filesys- tem store the metadata on a separate server known as the metadata server(MDS).

Lets consider the example of Lustre Filesystem. In terms of on-disk storage of meta- data, the parallel file system keeps additional information known as Extended At- tributes(EA) apart from normal file metadata attributes like , etc. EA infor- mation along with the normal file attributes is handed over to the client in case of the getattr or lookup operation. So when the client wants to perform an actual I/O, the client is aware of which servers to talk to or to understand how the file is striped amongst servers. From the MDS point of view, each file is composed of multiple data objects striped on one or more OSTs. A file objects layout information is defined in the extended attribute (EA) of the inode. Essentially, EA describes the mapping between file object id and its corresponding OSTs. This information is also known as striping EA.

So if the stripe size is 1MB, then this would mean that [0,1M), [4M,5M) are stored say as object x, which is on OST p; [1M, 2M), [5M, 6M) are stored say as object y, which is on OST q; [2M,3M), [6M, 7M) are stored say as object z, which is on OST r. Before reading the file, a client will query the MDS via MDC and be informed that it should talk to OST p, OST q, OST r for this operation.

This information is structured in so-called LSM, and client side LOV (logical object

5 volume) is to interpret this information so client can send requests to OSTs. Here again, the client communicates with OST through a client module interface known as

OSC. Depending on the context, OSC can also be used to refer to an OSS client by itself. All client/server communications in Lustre are coded as an RPC request and response. Within the Lustre source, this middle layer is known as Portal RPC, or ptl-rpc which translates and interprets filesystem requests to and from the equivalent form of RPC request and response, and the LNET module to finally put that down onto the wire.

Most of the parallel file system follow such a kind of architecture where a single metadata server manages the entire namespace. So in a scenario when the load on

MDS increases, the performance of MDS slows down which slows down the perfor- mance of the entire file system. The MDS consists of many important components like the Lustre (LDLM) which occupies a major chunk of the processing time at the Lustre. We performed experiments using the Oprofile tool to profile the Lustre code to understand the amount of time consumed by the

LDLM module. The experiment was performed on 8 client nodes. Figure 1.1 shows the amount of time consumed by the Lock Manager module at the MDS. In such a kind of environment where a single metadata manages the entire namespace most of the time is spent in the LDLM module and in communication. By communication we mean sending a blocking AST to the client holding a valid copy and then invalidating the local at that client. Also, allowing only a single Metadata Target (MDT) in a filesystem means that Lustre metadata operations can be processed only as quickly as a single server and its backing filesystem can manage. In order to improve the

6 performance and scalability of parallel filesystem the effort has been made in the direction of distributed metadata management.

Table 1.1: LDLM and Oprofile Experiments

File Percentage ldlm/ldlm lockd.c 0.0044 ldlm/ldlm inodebits.c 0.0044 ldlm/ldlm internal.h 0.7104 ldlm/ldlm lib.c 1.4341 ldlm/ldlm lock.c 0.0132 ldlm/ldlm pool.c 18.5729 ldlm/ldlm request.c 1.8754 ldlm/ldlm resource.c 5.3526

Clustered Metadata Server (CMD) is an approach proposed by the Lustre com- munity for distributed metadata management. With CMD functionality, multiple

MDS can provide a single file system’s namespace jointly, storing the directory and

file metadata on a set of MDT. Clustered Metadata (CMD) means there are multi- ple active MDS servers in one Lustre file system, the MDS workload can be shared among several servers, so that the metadata performance will be significantly im- proved. Although CMD will improve the performance and scalability of Lustre, it also brings some difficulties. The most complex one are recovery, consistency and reliability. In CMD, one metadata operation may need to update several different

MDSs. To maintain the consistency of the filesystem, the update must be atomic. If the update on one MDS fails, all other updates must be rolled back to their original states. To handle this, CMD uses a global lock. But a global lock slows down the overall throughput of the filesystem.

7 1.3 Distributed Coordination Service

Google’s chubby [9] is a distributed lock service which gained wide adoption within their data centers. Chubby lock service is intended to provide coarse-grained locking as well as reliable storage for a loosely-coupled distributed system. The purpose of the lock service is to allow its clients to synchronize their activities and to agree on basic information about their environment. The primary goals include reliability, availability to a moderately large set of clients, and easy-to-understand semantics; throughput and storage capacity are considered secondary. Chubby’s client interface is similar to that of a simple file system that performs whole file reads and writes, augmented with advisory locks and with notification of various events such as file modification. Chubby helps developers to deal with coarse-grained synchronization within their systems, and in particular to deal with the problem of electing a leader from among a set of otherwise equivalent servers. For example, the Google File

System [13] uses a Chubby lock to appoint a GFS master server, and Bigtable [11] uses Chubby in several ways: to elect a master, to allow the master to discover the servers it controls, and to permit clients to find the master. In addition, both GFS and Bigtable use Chubby as a well-known and available location to store a small amount of meta-data; in effect they use Chubby as the root of their distributed data structures. The primary purpose of storing the root in chubby is improved reliability and consistency aspect. So even in the event of a node failure, etc, we are still able to view the contents of the directory due to the reliability provided by Chubby.

Apache Zookeeper, not surprisingly, is a close clone of Chubby designed to fulfill many of the same roles for HDFS and other Hadoop infrastructure. ZooKeeper [14]

8 is a distributed, open-source coordination service for distributed applications. It ex- poses a simple set of interfaces that distributed applications can build upon to imple- ment higher level services for synchronization, configuration maintenance and nam- ing. ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical namespace which is organized similarly to a standard file sys- tem. The namespace consist of special nodes known as Znodes. Znodes do not store data but they store configuration information. The ZooKeeper implementation puts a premium on high performance, highly available, strictly ordered access. The strict ordering means that sophisticated synchronization primitives can be implemented at the client. ZooKeeper is replicated over a sets of hosts. ZooKeeper performs better in a read intensive workload than in a write/update intensive workload [14].

Figure 1.2: Zookeeper Design

9 1.4 Motivation of the Work

Parallel file systems can easily scale bandwidth and improve performance by oper- ating on data in parallel using strategies such as , sharing resources, etc.

However, most parallel file systems do not provide the ability to scale and parallelize metadata operations as it is inherently more complex than scaling the performance of data operations [6]. PVFS provides some level of parallelism through distributed metadata servers that manage different ranges of metadata. The Lustre community has also proposed the idea of Clustered Metadata Server (CMD) to minimize the load on a single Metadata Server, wherein multiple metadata servers share the metadata processing workload.

1.4.1 Metadata Server Bottlenecks

The MDS is currently restricted to a single node, with a fail-over MDS that be- comes operational if the primary server becomes nonfunctional. Only one MDS is ever operational at a given time. This limitation poses a potential bottleneck as the number of clients and/or files increase. IOZone [2] is used to measure the sequen- tial file IO throughput, and Postmark [5] is used to measure the scalability of the

MDS performance. Since MDS performance is the primary concern of this research, we discuss the Postmark experiment with more details. Postmark is a file system that performs a lot of metadata intensive operations to measure MDS performance. Postmark first creates a pool of small files (1KB to 10KB), and then starts many sequential transactions on the file pool. Each transaction performs two operations to either read/append a file or create/delete a file. Each of these opera- tions happens with the same probability. The transaction throughput is measured to

10 approximate workloads on an Internet server. Table 1.2 gives the measured trans- action throughput with a fixed file pool size of 1,000 files and different number of transactions on this pool. The transaction throughput remains relatively constant at varied transaction number. Since the cost for MDS to perform an operation does not change at a fixed file number, this result is expected. Table 1.3 on the other hand, changes the file pool size and measures the corresponding transaction through- put. By comparing the entries in Table 1.3 with their counterparts in Table 1.2, it becomes clear that a large file pool results in a lower transaction throughput. We also performed experiments by varying the number of transactions whereas keeping the number of files in the file pool to be constant. Table 1.4 shows the details. As seen from Table 1.4 for a constant file pool size and varying number of transactions we don’t see a huge mutation in the transaction throughput. The MDS caches the most recently accessed metadata of files (the inode of a file). A client file operation requires the metadata information about that file to be returned by MDS. At larger number of files in the pool, a client request is less likely to be serviced from the MDS cache. A cache miss results in the MDS looking up its disk storage to load the inode of requested file, which results in the lower transaction throughput in Table 1.3.

Table 1.2: Transaction throughput with a fixed file pool size of 1,000 files

Number of transactions Transactions per second 1,000 333 5,000 313 10,000 325 20,000 321

11 Table 1.3: Transaction throughput with varying file pool

Number of files in pool Number of transactions Transactions per second 1,000 1,000 333 5,000 5,000 116 10,000 10,000 94 20,000 20,000 79

Table 1.4: Transaction throughput with varying file pool

Number of files in pool Number of transactions Transactions per second 5,000 1,000 333 5,000 5,000 316 5,000 10,000 318 5,000 20,000 313

1.4.2 Consistency management of Metadata

Majority of the distributed filesystems use a single metadata server. However, this is a bottleneck that limits the operation throughput. Managing multiple metadata servers brings many difficulties. Maintaining consistency between two copies of the same directory hierarchy is not straightforward. We illustrate such a difficulty in

Figure 1.3.

We have two metadata servers (MDS) and we consider two clients that perform an operation on the same directory at the same time. Client 1 creates the directory d1 and client 2 renames the directory d1 to d2. As shown in Figure 1.3a, each client performs its operation in the following order: first on the MDS1, then on the MDS2.

From the MDS point of view, there is no guarantee on the execution order of the

12 Client 1 Client 2

1. ’mkdir d1’ on MDS1 1. ’mv d1 d2’ on MDS1

2. ’mkdir d1’ on MDS2 2. ’mv d1 d2’ on MDS2

Time (a) On the client side

MDS 1 MDS 2

1. ’mkdir d1’ from client1

2. ’mv d1 d2’ from client2

1. ’mv d1 d2’ from client2

2. ’mkdir d1’ from client1

Time

Result: d2 Result: d1 (b) On the MetaData server side

Figure 1.3: Example of consistency issue with 2 clients and 2 MetaData servers

requests since they are coming from different clients. As shown in Figure 1.3b, the requests can be executed in a different order on each metadata server while still respecting the ordering that the clients demand. In this case, the resulting states of the two metadata servers are not consistent.

This small example highlights that distributed algorithms are required to maintain the consistency between multiple metadata servers. Each client operation must ap- pear to be atomic and must be applied in the same order on all the metadata servers.

For this reason, we decided to use a distributed coordination service like ZooKeeper in the proposed metadata service layer. Such a coordination service implements the required distributed algorithms in a reliable manner.

13 1.5 Problem Statement

The amount of data generated and consumed by high-performance computing ap- plications is increasing exponentially. Current I/O paradigms and file system designs are often overwhelmed by this deluge of data. To improve the I/O throughput to a certain parallel file systems do a good job by incorporating features such as sharing resources, data striping, etc. Distributed filesystems often dedicate a subset of the servers for metadata management. File systems such as NFS [17], AFS [15],

Lustre [4], etc use a single metadata server to manage a globally shared file system namespace. While simple, this design does not scale, resulting in the metadata server becoming a bottleneck and a single point of failure.

In this thesis, we study and critique the current metadata management techniques in parallel file systems by taking Lustre file system as our use case. We propose two new designs for metadata management for parallel file systems. In the first part, we present a design where we delegate the metadata at client side to solve the problem of a single metadata server (MDS) becoming a bottleneck while managing the entire namespace. We aim at minimizing the memory pressure at the MDS by delegating some of the metadata to clients so as to improve the scalability of Lustre. In the second part, we design a decentralized metadata service layer and evaluate its benefits in parallel filesystem kind of environment. Decentralized metadata service layer takes care of distributed metadata management with the primary aim of improving the scalability of the filesystem while maintaining the reliability and consistency aspects.

Specifically, our research attempts to answer the following questions:

14 1. What are the challenges and problems associated with a single server managing

the entire namespace for a parallel file system?

2. How to solve the problem of minimizing the load on a single MDS by distributing

the metadata at Client side?

3. What are the challenges and problems associated with distributed metadata

management?

4. Can a distributed coordination service be incorporated into parallel filesystems

for distributed metadata management so as to improve the reliability and con-

sistency aspects?

5. How will a decentralized metadata service layer perform with respect to various

metadata operations as compared to the basic variant of parallel filesystems

such as Lustre [4] and PVFS [10]?

6. Will a decentralized metadata service layer designed for distributed metadata

management do a good job in improving the scalability of parallel file system?

Will it help in maintaining the consistency and reliability of the file system?

1.6 Organization of Thesis

The rest of the thesis is organized as follows. Chapter 2 presents an overview of the work in the area of parallel file systems with focus on metadata management for parallel file systems. Chapter 3 proposes a distributed metadata management technique by delegating metadata at client side. In Chapter 4 we explore the feasibility of using a Distributed coordination service for distributed metadata management. We conclude our work and present future research directions in Chapter 5.

15 Chapter 2: RELATED WORK

In this chapter, we discuss some of the current literature related to metadata management in high performance computing environments. We highlight the draw- backs of current metadata managements paradigms in parallel filesystems and suggest better design and algorithms for metadata management in parallel filesystem.

2.1 Metadata Management approaches

File system metadata management has long been an active area of research [15].

With the advent of commodity clusters and parallel file systems [4], managing meta- data efficiently and in a scalable manner offers significant challenges. Distributed file systems often dedicate a subset of the servers for metadata management. Mapping the semantics of data and metadata across different, non overlapping servers allows

file systems to scale in terms of I/O performance and storage capacity.

File systems such as NFS [17], AFS [15], Lustre [4], and GFS [13] use a single metadata server to manage a globally shared file system namespace. While simple, this design does not scale, resulting in the metadata server becoming a bottleneck and a single point of failure. File Systems like NFS [17], [21] and AFS [15] may also partition their namespace statically among multiple servers, so most of the major metadata operations are centralized. The pNFS [12] allows for distributed

16 data but retains the concept of centralized metadata. Other parallel file systems like

GPFS [22], Intermezzo [7] and Lustre [4] use directory locks for file creation, with the help of a distributed lock management (DLM) for better performance. Lustre uses a single metadata server to manage the entire namespace. Lustre distributed lock management module handles locks between clients and servers and local locks between the nodes. The Lustre community has also mentioned the fact of a single

Metadata Server being a bottleneck in HPC kind of environments. So they came up with the concept of Lustre Clustered Metadata Server (CMD). CMD is still a prototype and there is no implementation for it till now. The original design for

CMD was proposed in 2008. In CMD files are identified by a global FID and are assigned a metadata server, once we know the FID, we can directly deal with the server. Getting this FID still requires a centralized/master metadata server and this information is not redundant. So this will still involve a bottleneck at the Master node in the CMD. Also the reliability and availability factor depends a lot on the Master node in CMD. To mitigate the problems associated with a central metadata server,

AFS [15] and NFS [17] employ static directory subtree partitioning [24] to partition the data namespace across multiple metadata servers. Each server is delegated the responsibility of managing the metadata associated with a subtree. Hashing [8] is an- other technique used to partition the file system namespace. It uses a hash of the file name to assign metadata to the corresponding MDS. Hashing diminishes the problem of hot spots that is often experienced with directory subtree partitioning. The Lazy

Hybrid metadata management scheme [8,23] combines hierarchical directory manage- ment and hashing with lazy updates. Zhu et al. proposed using Hierarchical Bloom

Filter Arrays [25] to map file names to the corresponding metadata servers. They

17 used two levels of Bloom Filter Arrays with differing degrees of accuracy and mem- ory overhead to distribute the metadata management responsibilities across multiple servers. Ananth et. al explored multiple algorithms for creating files on a distributed metadata file system for scalable metadata performance.

In past, in order to get more metadata mutation throughput, efforts were aimed to mount more independent file systems into a larger aggregate, but each directory or directory sub-tree is still managed by one metadata server. Some systems use cluster metadata servers in pairs for fail-over, but not increased throughput. Some systems allow any server to act as a proxy and forward requests to the appropriate server; but this also does not increase metadata mutation throughput in a directory [3].

Symmetric shared disk file systems, that support concurrent updates to the same directory use complex distributed locking and cache consistency semantics, both of which have significant bottlenecks for concurrent create workloads, especially from many clients working in one directory. Moreover, file systems that support client caching of directory entries for faster read only workloads, generally disable client caching during concurrent update workload to avoid excessive consistency overhead.

A recent trend among distributed file systems is to use the concept of objects to store data and metadata. CRUSH [23] is a data distribution algorithm that maps object replicas across a heterogeneous storage system. It uses a pseudo-random function to map data objects to storage devices. Lustre, PanFS and [23] use various non- standard object interfaces requiring the use of dedicated I/O and metadata servers.

Instead, our work breaks away from the dedicated server paradigm and redesigns parallel file systems to use standards-compliant OSDs for data and metadata storage.

Also there has been work in the area of combining multiple partitions into a virtual

18 mount point. UnionFS ( official union filesystem in kernel mainline) [18] has a lot of options but it does not support load balancing between branches. Most of the file system which combine multiple partitions into a virtual mount work on a single node to combine local partitions or directory. Also some union file system can- not extract the parallelism their default behavior is to use the first partition until it reaches a threshold (based on the free space). It cannot attain the throughput even after combining multiple mount points and is restricted by the throughput of the first mounted partition.

2.2 Scalable filesystem directories

GPFS is a shared-disk file system that uses a distributed implementation of Fa- gin’s extendible hashing for its directories. Fagins extendible hashing dynamically doubles the size of the hash-table pointing pairs of links to the original bucket and expanding only the overgrowing bucket (by restricting implementations to a specific family of hash functions). It has a two-level hierarchy: buckets (to store the directory entries) and a table of pointers (to the buckets). GPFS represents each bucket as a disk block and the pointer table as the block pointers in the directory’s i-node. When the directory grows in size, GPFS allocates new blocks, moves some of the directory entries from the overgrowing block into the new block and updates the block pointers in the i-node. GPFS employs its client cache consistency and distributed locking mechanism to enable concurrent accesses to a shared directory. Concurrent readers can cache the directory blocks using shared reader locks, which enables high perfor- mance for read-intensive workloads. Concurrent writers, however, need to acquire write locks from the lock manager before updating the directory blocks stored on the

19 shared disk storage. When releasing (or acquiring) locks, GPFS versions before 3.2.1 force the directory block to be ushed to disk (or read back from disk) inducing high

I/O overhead. Newer releases of GPFS have modied the cache consistency protocol to send the directory insert requests directly to the current lock holder, instead of getting the block through the shared disk subsystem [22]. Still GPFS continues to synchronously write the directory’s i-node (i.e., the mapping state) invalidating client caches to provide strong consistency guarantees. Lustre proposed clustered metadata [1] service which splits a directory using a hash of the directory entries only once over all available metadata servers when it exceeds a threshold size. The effectiveness of this ”split once and for all” scheme depends on the eventual directory size and does not respond to dynamic increases in the number of servers. Ceph is another object-based cluster file system that uses dynamic sub-tree partitioning of the namespace and hashes individual directories when they get too big or experience too many accesses.

There has been some work in the area of designing a distributed indexing scheme for metadata management. GIGA+ [16], examines the problem of scalable file system directories, motivated by data-intensive applications requiring millions to billions of small files to be ingested in a single directory at rates of hundreds of thousands of

file creates every second. GIGA+ builds directories with million/trillions of files with high degree of concurrency. Compared to GPFS, GIGA+ allows the mapping state to be stale at the client and never be shared between servers, thus seeking even more scalability. Compared to Lustre and Ceph, GIGA+ splits a directory incrementally as a function of size, i.e., a small directory may be distributed over fewer servers than a larger one. Furthermore, GIGA+ facilitates dynamic server addition achieving

20 balanced server load with minimal migration. This work is interesting but is more relevant in workloads where the directories have a huge fan-out factor or when the application creates million/trillions of files in a single directory. In GIGA+ every server only keeps a local view of the partitions it is managing and no shared state is maintained and hence there are no synchronization and consistency bottlenecks. But in case the server or the partition goes down or the root level directory gets corrupted then the files wont be able to access.

21 Chapter 3: DELEGATING METADATA AT CLIENT SIDE (DMCS)

In this chapter we focus on the problem faced by managing the entire namespace by a central coordinator. We propose our design, delegating metadata at client side, to handle the problem mentioned in section 1.4.1.

Before we delve into the design for delegating metadata at client side we first have a look at the Remote Procedure Call (RPC) processing in Lustre Filesystem.

3.1 RPC Processing in Lustre Filesystem

When we consider the RPC processing in Lustre we also talk about how lock processing works in Lustre [5, 7, 3, 18] and how our modifications can benefit to minimize the number of LOOKUP RPC. Lets consider an example. Let us assume client C1 wants to open the file /tmp/lustre/d1/d2/foo.txt to read. In this case

/tmp/lustre is our mount point. During the VFS path lookup, Lustre specific lookup routine will be invoked. The first RPC request is lock enqueue with lookup intent.

This is sent to MDS for lock on d1. The second RPC request is also lock enqueue with lookup intent and is sent to MDS asking inodebits lock for d2. The lock returned is an inodebits lock, and its resources would be represented by the fid of d1 and d2.

The subtle point to note is, when we request a lock, we generally need a resource

22 id for the lock we are requesting. However in this case, since we do not know the resource id for d1, we actually request a lock on its parent /, not on the d1 itself. In the intent, we specify it as a lookup intent and the name of the lookup is d1. Then, when the lock is returned, the lock is for d1. This lock is (or can be) different from what the client requested, and the client notices this difference and replaces the old lock requested with the new one returned. The third RPC request is a lock enqueue with open intent, but it is not asking for lock on foo.txt. That is, you can open and read a file without a lock from MDS since the content is provided by Object Storage

Target(OST). OSS/OST also has a LDLM component and in order to perform I/O on the OSS/OST, we request locks from an OST. In other words, what happens at open is that we send a lock request, which means we do ask for a lock from LDLM server. But, in the intent data itself, we might (or not) set a special flag if we are actually interested in receiving the lock back. And the intent handler then decides

(based on this flag), whether or not to return the lock. If foo.txt exists previously, then its fid, inode content (as in owner, group, mode, ctime, atime, mtime, nlink, etc.) and striping information are returned. If client C1 opens the file with the O

CREAT flag and the file does not exist, the third RPC request will be sent with open and create intent, but still there will be no lock requests. Now on the MDS side, to create a file foo.txt under d2, MDS will request through LDLM for another EX lock on the parent directory. Note that this is a conflicting lock request with the previous

CR lock on d2. Under normal circumstances, a fourth RPC request (blocking AST) will go to client C1 or anyone else who may have the conflicting locks, informing the client that someone is requesting a conflicting lock and requesting a lock cancellation.

MDS waits until it gets a cancel RPC from the client. Only then does the MDS gets

23 the EX lock it was asking for earlier and can proceed. If client C1 opens the file with LOV DELAY flag, MDS creates the file as usual, but there is no striping and no objects are allocated. User will issue an ioctl call and set the stripe information, then the MDS will fill in the EA structure.

3.2 Existing Design

In the following section we explain the existing approach followed by Lustre for metadata management.

1. When client 1 tries to open a file it sends a LOOKUP RPC to MDS.

2. In step 2 the processing is done at the MDS side where the Lock Manager will

grant the lock for the resource requested by the Client. A second RPC will be

sent from the client to the MDS with the intent to create or open the file.

3. So at the end of step 2 client 1 will get the lock, extended attribute (EA)

information and other metadata details which the client will need to open the

file successfully.

4. Once the client gets the EA information and the lock handle, the client can

proceed ahead with the I/O operation.

5. MDS keeps track of the allocation by making use of queues. When multiple

clients try to access the same file the new client will wait in the waiting queue

till the time the original client who is the current owner of the lock releases

the lock. The MDS will then hand over the lock to the new client. Say client2

wants to access the same file which was earlier opened by client1 then client2

24 will be placed in the waiting queue. MDS will send a blocking AST to client1 to

revoke the lock granted. Client1 on receiving the blocking AST will release the

lock. In a scenario where client1 is down or something goes wrong, MDS will

wait for a ping timeout of 30 sec after which it will revoke the lock. Once the

lock is revoked the MDS will grant a lock handle and EA for the file to client2.

Client2 can proceed with I/O once it gets the lock handle and EA information.

3.3 Design and challenges for delegating metadata at client side

Before moving ahead with the actual design of our approach we discuss how the

Lustre Networking works and the communication module that we developed to do remote memory copy operations.

3.3.1 Design of communication module

We have designed a communication module for data movement. This communica- tion module will bypass the normal Lustre Networking stack protocols and will help to do remote memory data movement operations. We use the LNET API, originated from Sandia Portals, to design the communication module. With our design for the communication module we use the put and get API to do remote memory copy. The remote copy can be used by clients to copy the metadata information from the client to whom the metadata has been delegated by MDS. LNET identifies its peers using

LNET process id which consists of nid and pid. The nid identifies the id of the node, and pid identifies the process on the node. For example, in the case of socket Lustre

Network Driver (LND) (and for all currently existing LNET LNDs), there is only one instance of LNET running in the kernel space; the process id therefore uses a

25 reserved ID (12345) to identify itself. Portal RPC is a client for LNET layer. Por- tal RPC takes care of the RPC processing logic. A portal is composed of a list of match entries (ME). Each ME can be associated with a buffer, which is described by a memory descriptor (MD). ME itself defines match bits and ignore bits, which are 64-bit identifiers that are used to decide if the incoming message can use the associated buffer space. A memory buffer is described by a memory desciptor (MD).

Consider an example to illustrate the point. Say a client wants to read ten blocks of data from the server. It first sends an RPC request to the server telling that it wants to read ten blocks and it is prepared for the bulk transfer (meaning the bulk buffer is ready). Then, the server initiates the bulk transfer. When the server has completed the transfer, it notifies the client by sending a reply. Looking at this data flow, it is clear that the client needs to prepare two buffers: one is associated with bulk Portal for bulk RPC, the other one is associated with reply Portal.

3.3.2 Design of DMCS approach

In the following section we explain the design details of the client side metadata delegation approach.

1. When client 1 tries to open a file it sends a LOOKUP RPC to MDS.

2. In Step 2, the processing is done at the MDS side where the Lock Manager will

grant the lock for the resource requested by the client. A second RPC will be

sent from the client to the MDS with the intent to create or open the file. So

at the end of step 2, C1 will get the lock handle, EA information and other

metadata details. Conceptually step 1 and 2 are similar to what we have in

the current Lustre Design but in our approach we modify the step 2 slightly.

26 Figure 3.1: Design of DMCS approach

In our approach, in step 2, we make an additional check at the MDS to see

if this is a first time access to the file. First time access means this will be

the first time when the metadata information for this file will be created on

the MDS and neither the metadata caches maintained by the kernel have the

metadata related information cached. So if this is a first time access then we

keep a data structure to keep track of who owns the file and do some validation

whether it is a first time access or not. So at the end of step 2 we get the needed

metadata information like the lock handle and EA information from MDS to

the C1. Communication module is used for one sided operations. We compute

the hash based on filename to speed up the lookup process at the MDS side.

We make use of the communication module for one sided operation like remote

memory read and remote memory write.

3. In Step 3, we expose the buffers with the information such as extended at-

tributes, etc. that will be useful for clients who will subsequently access the file

27 that was opened by C1. We call such a client who exposes the needed buffer

information as the new owner of the file and use the term delegation client for

such clients.

4. Now when C2 tries to open the same file it performs an RPC in step 4 as we

do in step 1. We call C2 as the normal client.

5. In step 3, the normal client will do a lookup in the hash table at the MDS side

that we updated in step 1 and find that C1 is the owner of the file. So instead

of spending additional time at the MDS side we return the needed information

to C2.

6. In step 6, the normal client, C2, in our case will contact the delegator client, C1,

in our case, and fetch the information which was stored in the buffers exposed

by the delegator client for this specific file. We use our communication module

to speed up this process using one sided operations.

7. Once C2 gets the needed metadata information from C1, it can proceed with

the I/O operations.

The design can help in minimizing the request traffic at the MDS by delegat- ing some load at the client. Distributed subtree partitioning and pure hashing are methods used to distribute the namespace and workload among metadata servers in existing distributed file systems. We use the combination of both these approaches.

By partitioning the namespace across the MDS and clients we minimize the load on the MDS and by using a pure hashing scheme we compute the hash on the filename and decide which bucket or server the file’s metadata has been delegated to. So when

28 a client accesses a file and if the file was created earlier by some client then this file will have its entry in the hash table. So we compute the hash based on filename and divert the client to the delegator client to get the metadata information. If the hash table does not have the needed mapping information then it is the first time access for the file. This design allows the authoritative copy of each metadata item to reside on different clients, distributing the workload. In addition, the delegation record is much smaller, allowing the MDS cache to be more effective as the number of files accessed increases. This architecture will work well when the file pool size increases and many clients are simultaneously accessing the files on the MDS. With this design the workload will be distributed among the MDS and clients.

Secondly, with delegating metadata at the client side, instead of caching the com- plete metadata information at MDS, the MDS only stores a record to the delegation client for each file, which greatly reduces the cache memory usage. The metadata is distributed to all clients, so that no single client will become a bottleneck when many clients are accessing many files.

Finally, when many client try to access large number of files, MDS is not able to serve these requests from its memory due to the sheer amount of metadata to be cached. Figure 1.3 shows the details. The MDS will be busy reading the requested metadata that is widely dispersed in disk to load the blocks to memory. Meanwhile metadata already in memory has to be evicted to make space for newly loaded meta- data. Later requests for that evicted metadata have to be serviced from disk, which aggravates the burden on MDS. Obviously the MDS becomes a bottleneck. Client side metadata delegation distributes the responsibilities of many files to a lot of clients,

29 so that a request for any certain file hits the MDS disk at most once, and all follow-

ing requests for that file can be serviced by a delegated client. No single node will

become bottleneck in the entire system. Although an additional network round trip

is incurred as a result of network redirection, this overhead is very tiny compared to

the disk access time on MDS. With high bandwidth and low latency interconnection

technologies such as Infiniband, this network round trip time is likely to be negligible.

We also studied the impact of the underlying storage and the transport on the

metadata performance. The experiment was run on 8 client nodes each running

Lustre 1.8.1. Our results show that even after we change the underlying data storage

medium with a faster device like SSD we cannot see a huge improvement in metadata

operation rates. Table 3.1 shows the details.

Table 3.1: Metadata operation rates with different underlying storage

Metadata operation HDD/TCP SSD/TCP HDD/IB SSD/IB create() 455 457 893 935 open() 602 602 1,441 1,443 stat() 1,472 1,481 3,131 3,065 () 501 504 1,171 1,219 unlink() 405 421 843 883 mkdir() 545 519 1,221 1,229 rmdir() 265 267 609 621

3.3.3 Challenges

While designing the client side metadata delegation approach we need to take care of some challenges. In this section we state the challenges and describe the approach taken to solve them.

30 3.3.4 Metadata revocation

Delegating metadata at client side will distribute the workload of the MDS to client nodes. This will be very beneficial when many files are being accessed by many clients (i.e., many-to-many file access pattern). However, when all clients are accessing a single file (i.e., N-to-1 file access pattern) the hot-spot is simply moved from a very powerful MDS to a relatively less powerful client. Therefore, provisions must be made to attempt to not delegate a node when this situation may arise and to be able to pullback a delegation if this situation arises unexpectedly (including updating the information on the clients to know who has the authoritative metadata). We have implemented the metadata revocation logic in the communication module which takes care of this challenge by revoking the metadata when the client becomes a hot-spot.

3.3.5 Distributed Lock management for DMCS approach

To take care of consistency and reliability aspect we have designed a lock manage- ment scheme. Distributed locking scheme ensures consistency when many concurrent clients are accessing the same file. One of the primary responsibility of the lock man- agement scheme is to protect the shared data structure, i.e., the hash table maintained at the MDS side as this hash table contains information of how the metadata is del- egated and who currently owns the metadata. Consider a scenario where client C1 is the delegation client and holds the metadata for a specific file and then clients

C2-C10 are also accessing the same file and have got information from MDS that the

file’s metadata has been delegated to C1. So meanwhile when the clients C2-C10 are performing the data movement, i.e., getting the file metadata information if there is a metadata revocation request, then this metadata revocation request will be queued

31 till the time all the data movement operations are completed. In the existing design of Lustre or any parallel file system whenever a file is closed, the client cache gets

flushed. In case a client is performing some operation on the file and meanwhile some other client also wants a lock on the same file then depending on the lock compati- bility matrix the original client has to flush its cache to the storage node. The lock compatibility matrix checks which operations are compatible at the same time. For example, if both the clients want to grab a read lock then both clients can proceed ahead. So in case of conflicting compatible matrix entries the original client will flush its cache and handover the lock to MDS. The MDS then will allocate lock to new client and it can proceed ahead. The overhead involved in this step is high as the process involves a lot of time spent in communication and also the cache needs to be

flushed.

In Lustre important file attributes like file size, file modification time, file access time, etc, details are stored at the OSS. So when one client flushes its cache, the next time the new client accesses the file data from the OSS, the Lock Manager at OSS will make sure that the consistency is maintained.

3.4 Performance Evaluation

We have implemented our design into Lustre-1.8.1.1 to minimize the number of

RPC calls during a metadata operation. We conducted experiments to evaluate the metadata operation performance with our proposed design. One node acts as the

Lustre Metadata Server (MDS), and two nodes are Lustre Object Storage Servers

(OSS). Lustre filesystem is mounted on other eight nodes which act as Lustre client nodes. Each node runs kernel 2.6.18- 128.7.1.el5 with Lustre 1.8.1.1. Each node has

32 dual Xeon E5335 CPU (8 cores in total) and 4GB memory. They are inter- connected with 1GigE for general purpose networking. In our testing we configured

Lustre to use TCP transport in different runs. In order to measure the metadata op- erations performance such as open(), we have developed a parallel micro-benchmark.

We have extended the basic fileop testing tool coming with the IOZone [1] benchmark to support parallel running with multiple processes on many Lustre client nodes. The extended fileop tool creates a file tree structure for each process.

This tree structure contains X number of Level 1 directories, with each Level 1 directory having Y number of Level 2 directories. The total level of sub directories can be configured at run time. Within each of the bottom level directory Z files are created. By varying the size (fan-out) of each layer, we can generate different number of files in a file tree. We have developed an MPI parallel program to start multiple process on multiple nodes. Each process works on a separate directory to create its aforementioned file tree. After that, each process walks through its neighbor’s file tree to open each of the file in that sub-tree. This is to simulate the scenario that multiple client processes take turns to access a shared pool of files. After that the wall clock time on all the processes are summarized and the total IOPS for open system call is reported. In order to perform the tests we created some number of

files from a specific client and those files were accessed subsequently by other clients in an interleaving manner. Using the Postmark benchmark we could not simulate the kind of the above scenario as in the Postmark benchmark [6] we create some N number of files and as soon as the open/create or read/append operation is complete the file pool is deleted. So we make use of the above mentioned micro benchmark to perform the test and get the experimental results. In order to see the benefits

33 of the proposed approach in minimizing the RPC we carried out 3 different types of test using our micro benchmark: 1) IOPS in open using our parallel benchmark for different number of client processes, 2) IOPS in open using our parallel benchmark for different number file pool sizes, and 3) Time spent in open for varying Path name.

3.4.1 File Open IOPS: Varying Number of Client Processes

In this test, we first create the aforementioned file tree each containing 10,000 files for every client process, then let each process access its neighbor’s file tree. Figure

3 shows the aggregated number of IOPS for open system call on Lustre filesystem.

We vary the number of client processes from 2 to 16 which are evenly distributed on

8 client nodes. With 2 processes, only two client nodes are actually used. With 16 processes, 2 client processes run on each of the 8 client nodes. As seen in Figure 3, the modified Lustre with MDCS improves the aggregated IOPS over the basic Lustre significantly. Compared to the basic Lustre, our design reduces the number of RPC calls in metadata operation path, therefore helps improve the overall performance.

With two client processes, the new approach (MDCS) promotes file open IOPS from

2,528 per second to 3,612 per second. With basic Lustre, on the other hand, the

Metadata Server has the potential to handle 8 concurrent client processes, given a slightly higher file open IOPS with 8 concurrent client processes. When 16 processes are used, however, MDS performance drops due to the high contention, similar to what we see with MDCS the approach.

3.4.2 File Open IOPS: Varying File Pool Size

In this test we carry out similar basic steps as mentioned in 5.1. But in this test we vary the number of files in each file tree per process, while using the same 16 client

34 4,000 Basic Lustre MDCS: Modified Lustre 3,500

3,000

2,500

2,000

1,500

1,000 Number of open() / seconds 500

0 4 8 16 Number of Client Processes

Figure 3.2: File open IOPS, Each Process Accesses 10,000 Files

processes. We wanted to understand the significance of this factor while considering the performance aspect into consideration. Figure 4 shows the experimental results.

It clearly demonstrates the benefits of our MDCS design. We observe that, by varying the file pool size for a constant number of processes we dont see a huge deviation in the number of IOPS for the open. We speculate that this is caused because the file pool size used in our test isn’t big enough to stress the memory on MDS, such that most of the files metadata information are stored in MDS’s memory cache. As a result, the aggregated metadata operation throughput remains constant with different file pool sizes. In our future study we will experiment with larger file pool to push the memory limit of MDS.

35 4,000 Basic Lustre MDCS: Modified Lustre 3,500

3,000

2,500

2,000

1,500

1,000 Number of open() / seconds 500

0 5,000 10,000 100,000 Number of Files in File Pool

Figure 3.3: File open IOPS, Using 16 Client Processes

3.4.3 File Open IOPS: Varying File path Depth

In this test we want to measure the performance benefit of the new MDCS ap- proach when accessing files with different File path depth, i.e., number of components in the file path. We start with creating a file tree for each of the client process con- taining 10,000 files, with file path depth to be 3 or 4. After that each process begins to access files within its neighbor processs file tree. Figure 5 compares the time spent to open one file, either with basic Lustre or with the MDCS modified Lustre filesystem.

First of all, it shows that MDCS can significantly reduce the time cost to open one

file by up to 33%. We also observe that pathname component factor has a significant importance in the total cost of a metadata operation. Each file path component has

36 to be resolved using one RPC to the MDS, hence the deeper the file path leads to a longer processing time.

25 Basic Lustre MDCS: Modified Lustre 20

15

10

5

Time to Finish One Open() (milliseconds) 0 3 4 Number of Components in File Path (File Path Depth)

Figure 3.4: Time to Finish open, Using 16 Processes Each Accessing 10,000 Files

3.5 Summary

We have described a mechanism for minimizing the load on a single metadata server for the Lustre Filesystem. A single metadata server managing the entire filesys- tem namespace is common in most of the parallel filesystem approaches to manage metadata. In this design we minimize the load on the MDS and hence the memory pressure on the MDS by delegating the metadata at the client side. We evaluated our design and compared it with basic variant of Lustre. We can see that for metadata

37 operation like file open() the throughput increases as the number of client process increases whereas with basic variant of Lustre the throughput decreases. We can see similar behavior when the number of files in the file pool are increased. One of the primary reason for the slowdown in basic variant of Lustre is that as the file pool size goes on increasing the amount of file metadata to be kept in the MDS cache increases.

38 Chapter 4: DESIGN OF A DECENTRALIZED METADATA SERVICE LAYER FOR DISTRIBUTED METADATA MANAGEMENT

4.1 Detailed design of Distributed Union FileSystem (DUFS)

The core principle of Distributed Union FileSystem (DUFS) is to distribute the load of the metadata operations across multiple distributed filesystems. DUFS pro- vides a single POSIX-compliant filesystem abstraction to the user, without revealing the multiple underlying filesystem mounts. With such an abstraction, the single metadata server of the back-end distributed filesystem is not a bottleneck anymore.

However, as described in section 1.4.2, consistency has to be guaranteed across mul- tiple clients which perform simultaneous metadata operations. This task is delegated to the distributed coordination service - ZooKeeper [14].

DUFS maps each virtual filename, as seen by the user, to a physical path cor- responding to one of the underlying filesystem mounts. A single-level indirection is introduced with the use of a File Identifier (FID), which uniquely identifies each file.

Figure 4.1 shows a schematic view of this indirection level in our design. The map- ping between the FID and the physical path is carried out using a universally-known deterministic mapping function which every DUFS client is aware of. This mapping

39 Distributed Deterministic coordination mapping service function Virtual path FID Physical path

Figure 4.1: DUFS mapping from the virtual path to the physical path using File Identifier (FID)

information is cached by ZooKeeper in a consistent manner. The second mapping step does not require any coordination between clients. Consistency management at the physical storage level is offloaded to the underlying filesystem.

This single-level indirection offers flexibility and allows to represent the contents of a file independently of its name. Indeed, a filename can represent two different data contents (after deletion and a new creation with the same name); and conversely, the data contents can correspond to any filename (for instance, a rename operation). This representation also makes rename operations and physical data relocation easier.

Finally, directories and directory-trees are considered as metadata only, so they are not physically created on the back-end storage. Instead, the directory-tree infor- mation is maintained in-memory by ZooKeeper.

This single-level indirection offers flexibility and allows to represent the contents of a file independently of its name. Indeed, a filename can represent two different data contents (after deletion and a new creation with the same name); and conversely, the data contents can correspond to any filename (for instance, a rename operation). This representation also makes rename operations and physical data relocation easier.

40 Finally, directories and directory-trees are considered as metadata only, so they are not physically created on the back-end storage. Instead, the directory-tree infor- mation is maintained in-memory by ZooKeeper.

4.1.1 Implementation Overview

The design of DUFS is broken down into three main components: the filesystem interface based on FUSE, the Metadata management based on ZooKeeper and the back-end storage provided by the underlying parallel filesystem. A DUFS client instance is only a local software that does not interact directly with other DUFS clients. Any necessary interaction is only made through ZooKeeper service or over the back-end storage.

Client node Client node

Application Application Application Application

FUSE interface FUSE interface

A DUFS DUFS

Virtual path FID Physical path Virtual path FID Physical path C

ZooKeeper Backend storage Backend storage ZooKeeper Backend storage Backend storage client client client client library client client

B

ZooKeeper ZooKeeper D server server

ZooKeeper server Backend distributed filesystem storage

ZooKeeper distributed coordination service

Figure 4.2: DUFS overview. A, B, C and D show the steps required to perform an open() operation.

41 Figure 4.2 shows the basic steps required to perform an open() operation on a file

using DUFS.

A. The open() call is intercepted by FUSE which gives the virtual path of the file

to DUFS.

B. DUFS queries ZooKeeper to get the Znode based on the filename and to retrieve

the FID. If the file does not exist, ZooKeeper will return an error.

C. DUFS uses the deterministic mapping function to find the physical path asso-

ciated to the FID.

D. Finally, DUFS opens the file based on its physical path. The result is returned

to the application via FUSE.

Alternatively, directory operations take place only at the metadata level, so only

ZooKeeper is involved and not the back-end storage. Thus, only steps A and B are performed.

The following subsections describe the functions of the primary elements com- prised within DUFS.

4.1.2 FUSE-based Filesystem Interface

We use FUSE to provide a POSIX-compliant filesystem interface to the applica- tions. Thus, our DUFS prototype appears like a classic mount-point of the standard

filesystem.

Most of the basic file system operations like mkdir, create, open, symlink, rename,

stat, readdir, rmdir, unlink, truncate, chmod, access, read, write, etc. are imple-

mented in DUFS. When an application wants to perform a filesystem operation, it

42 will operate on the virtual path exposed to it by DUFS. The filesystem operations

are translated into the FUSE specific operations, for example the open() call from

application is translated into the dufs open() in DUFS. Finally, for each filesystem operation, DUFS can return the correct result after querying the ZooKeeper based

Metadata management service and the back-end storage as needed.

4.2 ZooKeeper-based Metadata Management

We use the ZooKeeper distributed coordination service to handle the consistency threats posed by the distributed accesses from several DUFS clients simultaneously.

The synchronous ZooKeeper APIs were used for this purpose.

With our design, ZooKeeper will store a part of the virtual filesystem metadata.

It keeps track of details of the directories and files that get created. A separate Znode is created in ZooKeeper for each directory or files created, and the virtual filesystem hierarchy is represented inside ZooKeeper using Znodes.

ZooKeeper has several information fields associated to each Znodes. Some of the standard fields include Znode creation time, list of children Znodes, etc. ZooKeeper also has the provision to add a custom data field to each Znode. In DUFS, this custom field is used to tell the Znode if it is representing a directory or a file. In the latter case, the FID of the file is also stored in this field.

The ZooKeeper architecture uses multiple ZooKeeper servers. The data is repli- cated among all the servers. ZooKeeper uses coordination algorithms to ensure that the Znodes hierarchy and its contents are consistent across the servers and that all the modifications are applied in the same order in all the servers [14].

43 All these information are kept in memory and ZooKeeper servers can be located close to DUFS clients. Thanks to this, ZooKeeper queries are fast and a large oper- ation throughput can be performed. This raw throughput is studied in section 4.6.

However, the counterpart is that the ZooKeeper servers use a large amount of memory.

We study this impact in memory usage in section 4.7.

4.2.1 File Identifier

In our design, we use a File Identifier (FID) to uniquely represent the physical contents of a file. This FID is stored in the custom data field of the Znode which corresponds to the virtual path of a file. The FID is designed to be unique for each newly created file. However, modifications to the contents of a file do not require changing the FID.

In DUFS, the FID is a 128-bit integer. We propose a simple approach to generate a unique FID at the DUFS client without requiring any coordination. The FID for a file is generated by the client who initially creates the file. It is a concatenation of a 64-bit client ID that uniquely represents that instance of DUFS client that created the file and a 64-bit file creation counter that records the number of file creations throughout the lifetime of that DUFS client. When a client is restarted, it acquires another unique 64-bit client ID and its creation counter is reset to 0.

The FID is used by DUFS to deduce the physical location of the file and the phys- ical filename. Firstly, the physical location of the data in the underlying filesystem is generated using the deterministic mapping function. Secondly, the filename for the data contents on the physical storage is generated from the FID. In this manner,

44 the contents of a file do not have to be renamed or moved between different physical mounts when the virtual filename is renamed or moved.

4.2.2 Deterministic mapping function

The deterministic mapping function associates a physical location to each file’s contents based on its FID. This function takes as input a 128-bit integer representing the FID and returns a number between 1 and N, with N being the number of back- end underlying storage systems. It has to be deterministic so that any DUFS client can find the right location without coordination.

To achieve a good load-balancing between the different underlying storage mounts, the mapping function has to distribute the FIDs in a fair manner. For this reason, the mapping function of our current implementation is based on the MD5 hash function that has this property [19]. Our mapping function is:

fid 7−→ MD5(fid) mod N

4.2.3 Back-end storage

Once a particular physical filesystem is chosen using the deterministic mapping function, the data is accessed directly using the local mount-point of this distributed

filesystem. The filename is deterministically interpreted from the FID. Thus, it is in- dependent of any virtual filename and the DUFS client does not need to communicate with any other component to find the actual physical filename.

In DUFS, the physical filename used to store a file is the equivalent to the hex- adecimal representation of the FID that was computed in the previous step. In order to avoid congestion due to file creation at a single directory level, the hexadecimal representation is divided into four parts to create multiple path components. The

45 first component has the filename, while the other components are used for the path

hierarchy. Figure 4.3 shows an example of the filename on the back-end storage for

the FID 0123456789abcdef.

FID: 0123456789abcdef Physical filename: cdef / 89ab / 4567 / 0123

Figure 4.3: Sample physical filename generated from a given FID

This directory hierarchy is static and identical between all the back-end mount-

points. This static structure avoids any potential conflict.

4.3 Algorithm examples for Metadata operations

In this section, we give some algorithms for some Metadata operations in DUFS.

Figure 4.4 shows the algorithm for the mkdir() operation; Figure 4.5 shows the algo- rithm for the stat() operation.

4.3.1 Reliability concerns

The DUFS client does not have any state. All the required information are stored either in ZooKeeper or in the back-end storage. So the DUFS reliability relies on the

ZooKeeper and back-end distributed filesystems.

For ZooKeeper, all the information are duplicated among all the servers. Thanks to this, ZooKeeper is able to tolerate the failure of many servers. It needs to have to majority of the servers alive to maintain consistency of the data [14]. Further,

46 1: Get the virtual path of the directory 2: Look for the corresponding Znode 3: if Znode exists then 4: return ’File exists’ error code 5: else 6: Generate the data field with type and metadata information 7: Create the corresponding Znode with ZooKeeper 8: if success then 9: return Success 10: else 11: Handle error 12: end if 13: end if Figure 4.4: Algorithm for the mkdir() operation

1: Get the virtual path of the file/directory 2: Get the corresponding Znode with ZooKeeper 3: if Znode does not exist then 4: return ’No such file or directory’ error code 5: else 6: ZooKeeper returned the data field (type, FID, ...) 7: if Znode type is directory then 8: Fill the struct stat with information stored in ZooKeeper 9: return struct stat 10: else 11: Compute the physical location 12: Compute the physical path 13: Perform stat() on the physical file 14: return struct stat 15: end if 16: end if Figure 4.5: Algorithm for the stat() operation

47 although each ZooKeeper server keeps all its data in memory, it is periodically check- pointed on disk. So, it can tolerate the failure of all servers by restarting them later.

Many distributed filesystems like Lustre provide fault tolerance. Data can be replicated among multiple data servers. If such filesystems are used as a back-end storage, it will benefit the DUFS availability.

4.4 Performance Evaluation

In this section, we conduct experiments to evaluate the performance of metadata operations with our proposed design. These tests were performed on a Linux cluster.

Each node has a dual Intel Xeon E5335 CPU (8 cores in total) and 6GB memory.

A SATA 250GB hard drive is used as the storage device on each node. The nodes are connected with 1 GigE for general purpose networking. Each node runs kernel

2.6.30.10. We dedicate a set of nodes as Lustre MDS and OSS (version 1.8.3) to form multiple instances of Lustre filesystem. Another set of dedicated nodes work as PVFS servers (version 2.8.2) to export multiple instances of PVFS filesystem.

Each client node mounts multiple instances of Lustre and PVFS filesystems and uses

DUFS to merge these distinct physical partitions into a logically uniformed partition.

ZooKeeper server runs along with the DUFS clients, and they provide distributed coordination services over 1 GigE. We have used the mdtest benchmark [13] for our experiments. We carried out experiments by creating a directory structure with a fan-out factor of 10 and directory depth of 5. As the number of processes increases, the number of files per directory also increases accordingly. We have also carried out experiments where many files are created in a single directory. We have used

48 the same parameters and configuration while experimenting with different back-end parallel file systems like Lustre, PVFS.

4.4.1 Distributed coordination service throughput and mem- ory usage experiments

With DUFS design, each metadata operation has to go through the ZooKeeper service before it is actually issued to the corresponding physical back-end filesystem.

In this section we performed experiments in order to study ZooKeepers throughput for basic operations like zoo create(), zoo get(), zoo set() and zoo delete() using

ZooKeepers synchronous API. With a total of 8 DUFS clients in the experimental setup, we varied the number of ZooKeeper Servers from 1 to 8. The results are shown in Figure 4.6. For the zoo create(), zoo delete() and zoo set() operations, we can see that with more number of ZooKeeper Servers the overall throughput drops down. This is the expected behavior since this operation performs modifications on the Znodes.

Thus, all the ZooKeeper servers have to coordinate to ensure the consistency of their replicated states. For the zoo get() operation, the overall throughput increases with more number of ZooKeeper Servers. ZooKeeper performs very well in read dominant workloads [8]. Indeed, each ZooKeeper server can serve the request independently from each other.

Since ZooKeeper keeps all its data in memory, the memory usage can be a concern.

In the following experiment, we study the memory usage of ZooKeeper ( process), and DUFS as well, when the number of metadata information increases. We have designed a benchmark that creates a large number of directories and reports the resident process memory size. For this experiment, all the processes ran on the same node.

49 Number of Zookeeper Server = 1 Number of Zookeeper Server = 1 Number of Zookeeper Server = 4 Number of Zookeeper Server = 4 Number of Zookeeper Server = 8 Number of Zookeeper Server = 8

16000 9000 14000 8000 12000 7000 10000 6000 5000 8000 4000 6000 3000 4000 2000 2000 1000 Throughput (Ops/sec) 0 Throughput (Ops/sec) 0 0 50 100 150 200 250 0 50 100 150 200 250 Number of client processes Number of client processes (a) zoo create() operation (b) zoo delete() operation

Number of Zookeeper Server = 1 Number of Zookeeper Server = 1 Number of Zookeeper Server = 4 Number of Zookeeper Server = 4 Number of Zookeeper Server = 8 Number of Zookeeper Server = 8

9000 180000 8000 160000 7000 140000 6000 120000 5000 100000 4000 80000 3000 60000 2000 40000 1000 20000 Throughput (Ops/sec) 0 Throughput (Ops/sec) 0 0 50 100 150 200 250 0 50 100 150 200 250 Number of client processes Number of client processes (c) zoo set() operation (d) zoo get() operation

Figure 4.6: ZooKeeper throughput for basic operations by varying the number of ZooKeeper Servers

50 Additionally, in order to compare the memory usage of DUFS, we run the same benchmark for a dummy FUSE filesystem which just does nothing, except forwarding the requests to a local filesystem.

1400 Zookeeper 1200 DUFS Dummy FUSE 1000

800

600

400 Memory Usage (MB) 200

0 0 0.5 1 1.5 2 2.5 Millions of directories created

Figure 4.7: Zookeeper memory usage and its comparison with DUFS and basic FUSE based file system memory usage

The results are shown in Figure 4.7. We can see that the memory consumed by DUFS is bounded and similar to a normal FUSE based file system, which is what is expected. The ZooKeeper memory usage is proportional to the number of created directories or files (Znode data size is similar for file or directory). From these numbers, we can estimate that storing one million files or directory requires about

417 MB in memory. This drawback comes from the ZooKeeper design choice.

51 4.4.2 Scalability Experiments

For the scalability experiments we have a Zookeeper server running on each of the DUFS client. We evaluate the scalability experiments by varying the number of client processes from 4-256 and the number of physical nodes from 4,8 and 16. In these experiments since the Zookeeper server are local to the DUFS clients the read request will have a high throughput but in case of update a higher level of synchronization is needed amongst the servers which are part of the ensemble.

By varying the number of physical nodes and the number of client processes run- ning on them we can see that approach suggested in this chapter performs better as compared to the basic variant of Lustre/PVFS as the number of client process increases. As expected, the directory creation, directory removal, directory stat per- form better. Directory stat being a read operation performs exceedingly well as com- pared to basic variant of Lustre. For file operations like file creation, file removal,

file stat also we can see a similar trend although we cannot get as high throughput as compared to the directory operations but we perform we than the basic variant of parallel filesystems. For file operation we have to contact the actual backend file system to get the file attributes where as for the directory operation most of the requests are satisfied at the Zookeeper level itself.

4.4.3 Experiments with varying number of distributed coor- dination service servers

In this section we performed experiments in order to study the outcome by varying the number of ZooKeeper Servers. We used a set of 8 nodes with 8 DUFS clients,

52 Basic Lustre DUFS Basic Lustre DUFS

8000 8500 7500 8000 7000 7500 6500 7000 6000 6500 5500 6000 5000 5500 4500 5000 Throughput (Ops/sec) Throughput (Ops/sec) 4000 4500 3500 4000 3000 3500 0 50 100 150 200 250 0 50 100 150 200 250 Number of processes Number of processes (a) Directory creation (b) Directory removal

Basic Lustre DUFS Basic Lustre DUFS

110000 20000 100000 18000 90000 16000 80000 70000 14000 60000 12000 50000 10000 40000 8000

Throughput (Ops/sec) 30000 Throughput (Ops/sec) 20000 6000 10000 4000 0 50 100 150 200 250 0 50 100 150 200 250 Number of processes Number of processes (c) Directory stat (d) File creation

Basic Lustre DUFS Basic Lustre DUFS

12000 45000 11000 40000 10000 9000 35000 8000 30000 7000 6000 25000 5000 20000

Throughput (Ops/sec) 4000 Throughput (Ops/sec) 15000 3000 2000 10000 0 50 100 150 200 250 0 50 100 150 200 250 Number of processes Number of processes (e) File removal (f) File stat

Figure 4.8: Scalability experiments with 8 Client nodes and varying number of client processes

53 Basic Lustre DUFS Basic Lustre DUFS

10000 11000

9000 10000 9000 8000 8000 7000 7000 6000 6000 5000 5000 4000 Throughput (Ops/sec) Throughput (Ops/sec) 4000 3000 3000 2000 0 50 100 150 200 250 0 50 100 150 200 250 Number of processes Number of processes (a) Directory creation (b) Directory removal

Basic Lustre DUFS Basic Lustre DUFS

110000 20000 100000 18000 90000 16000 80000 70000 14000 60000 12000 50000 10000 40000 8000

Throughput (Ops/sec) 30000 Throughput (Ops/sec) 20000 6000 10000 4000 0 50 100 150 200 250 0 50 100 150 200 250 Number of processes Number of processes (c) Directory stat (d) File creation

Basic Lustre DUFS Basic Lustre DUFS

14000 70000 13000 60000 12000 11000 50000 10000 9000 40000 8000 30000 7000

Throughput (Ops/sec) 6000 Throughput (Ops/sec) 20000 5000 4000 10000 0 50 100 150 200 250 0 50 100 150 200 250 Number of processes Number of processes (e) File removal (f) File stat

Figure 4.9: Scalability experiments with 16 Client nodes and varying number of client processes

54 which use a number of ZooKeeper servers varying from 1 to 8. We measured the operation throughput and we compared it against the Basic Lustre throughput.

The results are presented in Figure 4.10. As expected, for read operations like file stat() and directory stat(), it shows a significant performance improvement when the number of ZooKeeper servers is increased. For the other operations, the effect of the number of ZooKeeper servers is lesser.

Finally, these results show that using 8 ZooKeeper servers is a good compromise for our configuration.

4.4.4 Experiment with different number of mounts combined using DUFS

In this section we performed experiments to study the influence of varying number of back-end storage to be combined by DUFS. For this experiment we had an ensemble of 8 zookeeper servers. Since the directory operations do not touch the back-end distributed filesystem, we only focus on file operations for this experiment.

Figure 4.11 shows the throughput of file operations for 2 and 4 back-end storage and for different number of client processes. We also compare this throughput to the

Basic Lustre case. Using 4 back-end storage instead of 2 provides a small improvement for file creation and removal. For file stat(), we can see an improvement of more than

37% with 256 client processes.

Although the file operations are uniformly distributed among the back-end stor- age, there is an indirection to a ZooKeeper server. File removal and creation require a metadata modification. The cost of this modification overtakes the benefit of mul- tiple back-end storage. The file stat() operation only requires to read the metadata,

55 Basic Lustre 1 Zookeeper 4 Zookeeper 8 Zookeeper 6,000 8,000 Basic Lustre 7,000 1 Zookeeper 5,000 4 Zookeeper 6,000 8 Zookeeper 4,000 5,000

3,000 4,000

3,000 2,000

Throughput (Ops/sec) Throughput (Ops/sec) 2,000 1,000 1,000

0 0 64 128 256 64 128 256 Number of client processes Number of client processes (a) Directory creation (b) Directory removal

Basic Lustre 1 Zookeeper 4 Zookeeper 8 Zookeeper 100,000 14,000 90,000 Basic Lustre 12,000 1 Zookeeper 80,000 4 Zookeeper 8 Zookeeper 70,000 10,000 60,000 8,000 50,000 6,000 40,000

30,000 4,000 Throughput (Ops/sec) Throughput (Ops/sec) 20,000 2,000 10,000 0 0 64 128 256 64 128 256 Number of client processes Number of client processes (c) Directory stat (d) File creation

Basic Lustre Basic Lustre 1 Zookeeper 1 Zookeeper 4 Zookeeper 4 Zookeeper 8 Zookeeper 8 Zookeeper 9,000 60,000

8,000 50,000 7,000

6,000 40,000

5,000 30,000 4,000

3,000 20,000

Throughput (Ops/sec) 2,000 Throughput (Ops/sec) 10,000 1,000

0 0 64 128 256 64 128 256 Number of client processes Number of client processes (e) File removal (f) File stat

Figure 4.10: Operation throughput by varying the number of Zookeeper Servers

56 7,000 8,000 Basic Lustre Basic Lustre 6,000 DUFS:2 mounts 7,000 DUFS:2 mounts DUFS:4 mounts DUFS:4 mounts 6,000 5,000 5,000 4,000 4,000 3,000 3,000 2,000 Throughtput(Ops/sec) Throughtput(Ops/sec) 2,000

1,000 1,000

0 0 64 128 256 64 128 256 Number of processes Number of processes (a) Directory creation (b) Directory removal

Basic Lustre DUFS with 2 Lustre backend storages DUFS with 4 Lustre backend storages 100,000 14,000 90,000 Basic Lustre DUFS:2 mounts 12,000 80,000 DUFS:4 mounts 70,000 10,000 60,000 8,000 50,000 40,000 6,000

30,000 4,000 Throughtput(Ops/sec) Throughput (Ops/sec) 20,000 2,000 10,000 0 0 64 128 256 64 128 256 Number of processes Number of client processes (c) Directory stat (d) File creation

Basic Lustre Basic Lustre DUFS with 2 Lustre backend storages DUFS with 2 Lustre backend storages DUFS with 4 Lustre backend storages DUFS with 4 Lustre backend storages 10,000 60,000 9,000 50,000 8,000 7,000 40,000 6,000 5,000 30,000 4,000 20,000 3,000 Throughput (Ops/sec) Throughput (Ops/sec) 2,000 10,000 1,000 0 0 64 128 256 64 128 256 Number of client processes Number of client processes (e) File removal (f) File stat

Figure 4.11: File operation throughput for different numbers of back-end storage

57 which is very fast with ZooKeeper. That is why we see a clear benefit of increasing the number of back-end storage in this case.

In any parallel file system if the directory is spread across multiple partitions on a server then the ’ls -l’ operation can be costly. It is even costlier in a bigger environment where the directories are spread across different partitions on different servers. With the approach presented in the paper we can get significant improvement in the ’ls -l’ operation even though the files might be evenly distributed across different partitions on different servers.

4.4.5 Experiments with different back-end parallel filesys- tems

In this section, we study the performance of our DUFS prototype in comparison with two distributed filesystems: Lustre and PVFS2. To keep a fair comparison, we also use Lustre and PVFS2 as our back-end storage. We study the scalability by increasing the number of client processes.

In these experiment we had 8 DUFS clients and 8 Zookeeper servers. The Zookeeper

Server and DUFS clients were running on same nodes.

From Figure 4.12, we can see that DUFS with Lustre as a back-end physical

filesystem, it can outperform Basic Lustre. We can see similar results even in the

PVFS case. Also one notable point is that for the directory operations, we see a similar trend for any back-end physical mount. This is expected because in DUFS, directory operations only rely on ZooKeeper. Also, for the file operations like cre- ation/stat/removal, DUFS with Lustre as back-end filesystem performs way better than DUFS with PVFS2 as the back-end filesystem. This is because in that case, the

58 Basic Lustre Basic Lustre DUFS Approach : Merge 2 physical Lustre mounts DUFS Approach : Merge 2 physical Lustre mounts Basic PVFS Basic PVFS DUFS Approach : Merge 2 physical PVFS mounts DUFS Approach : Merge 2 physical PVFS mounts

12000 9000 8000 10000 7000 8000 6000 5000 6000 4000 4000 3000 2000 2000 Throughput (Ops/sec) Throughput (Ops/sec) 1000 0 0 0 50 100 150 200 250 0 50 100 150 200 250 Number of processes Number of processes (a) Directory creation (b) Directory removal

Basic Lustre Basic Lustre DUFS Approach : Merge 2 physical Lustre mounts DUFS Approach : Merge 2 physical Lustre mounts Basic PVFS Basic PVFS DUFS Approach : Merge 2 physical PVFS mounts DUFS Approach : Merge 2 physical PVFS mounts

100000 20000 90000 18000 80000 16000 70000 14000 60000 12000 50000 10000 40000 8000 30000 6000 20000 4000

Throughput (Ops/sec) 10000 Throughput (Ops/sec) 2000 0 0 0 50 100 150 200 250 0 50 100 150 200 250 Number of processes Number of processes (c) Directory stat (d) File creation

Basic Lustre Basic Lustre DUFS Approach : Merge 2 physical Lustre mounts DUFS Approach : Merge 2 physical Lustre mounts Basic PVFS Basic PVFS DUFS Approach : Merge 2 physical PVFS mounts DUFS Approach : Merge 2 physical PVFS mounts

12000 45000 40000 10000 35000 8000 30000 25000 6000 20000 4000 15000 10000 2000 Throughput (Ops/sec) Throughput (Ops/sec) 5000 0 0 0 50 100 150 200 250 0 50 100 150 200 250 Number of processes Number of processes (e) File removal (f) File stat

Figure 4.12: Operation throughput with respect to the number of clients for Lustre and PVFS2

59 back-end storage is actually used and thus the throughput of these operations depend on the performance of this back-end filesystem .

From the scalability point of view, we see that Lustre and PVFS2 do not scale very well. When the number of client processes grows significantly, their performance drops down. Conversely, DUFS does not perform so well at small scale, however it can outperform Lustre for all operation for 256 client processes. In all the case, DUFS with PVFS2 back-end storage is clearly better than PVFS2 alone.

For directory creation with 256 client processes, DUFS outperforms Lustre by a factor of 1.9, and PVFS2 by a factor of 23.

Finally, we can see that for directory/file stat the approach discussed in the paper performs exceedingly well as compared to its basic variant, i.e., Lustre and PVFS2.

With respect to file stat() with 256 processes, our approach is 1.3 and 3.0 times faster than Lustre and PVFS, respectively. This is mainly because Zookeeper performs good in case of read dominant workloads.

4.5 Summary

We have designed a Distributed Metadata Service Layer and evaluated its benefits to parallel file systems. Distributed metadata management is a hard problem since it involves taking care of various consistency and reliability aspects. Also, scaling metadata performance is more complex than scaling raw I/O performance. With distributed metadata, this complexity further increases. This leads to a primary goal while designing a Distributed Metadata Service Layer - to improve on the scalability aspect while taking care of consistency and reliability. With our approach, we are able to maintain good performance even with a large number of client. With 256 client

60 processes, we are able to outperform Lustre for the 6 metadata operations namely directory creation, directory removal, directory stat, file creation, file removal and

file stat.

61 Chapter 5: CONTRIBUTIONS AND FUTURE WORK

In this thesis, we have designed metadata management approaches for managing metadata in parallel filesystems. Our work involved design of a scheme to delegate metadata at client side so as to minimize the load on single metadata server (MDS).

We also designed a new approach for distributed metadata management and the various challenges faced by it.

5.1 Summary of Research Contributions and Future Work

The research in this thesis aims towards solving two important problems faced in parallel filesystem environments. They are as follows :

1. Single metadata server is a bottleneck. So we have designed a scheme for meta-

data management to minimize the load on a single MDS.

2. Recent trends in high-performance computing have also seen a shift toward

distributed resource management. With distributed metadata we need to care

of some complex issues related to reliability and consistency. There is always a

compromise with maintaining reliability/consistency in the file system and at

the same time achieving a scalable solution. Our approach tries to solve the

problem of distributed metadata management with the primary aim to maintain

62 the reliability and consistency of the file system and at the same time achieving

improving the scalability of the filesystem.

5.1.1 Delegating metadata at client side

We have described a mechanism for minimizing the load on a single metadata server for the Lustre Filesystem. A single metadata server managing the entire filesys- tem namespace is common in most of the parallel filesystem approaches to manage metadata. In this design we minimize the load on the MDS and hence the memory pressure on the MDS by delegating the metadata at the client side. We evaluated our design and compared it with basic variant of Lustre. We can see that for metadata operation like file open() the throughput increases as the number of client process increases whereas with basic variant of Lustre the throughput decreases. We can see similar behavior when the number of files in the file pool are increased. One of the primary reason for the slowdown in basic variant of Lustre is that as the file pool size goes on increasing the amount of file metadata to be kept in the MDS cache increases.

The MDS metadata cache won’t be flushed until it reaches a threshold which varies depending on the physical memory at the MDS. But if the metadata that was initially in the caches gets flushed out and is accessed by some of the client then the MDS has to perform a disk I/O to get the needed metadata from the disk, which is a costlier operation. With the design proposed in this chapter we perform an extra hop to the client which holds the metadata information for the file to be accessed. And with the use of low latency, high bandwidth interconnects like Infiniband the cost of this extra hop is negligible as compared to the expensive disk I/O. So in brief, the design aims to take advantage of subtree partitioning and hashing based approaches to minimize the load at the MDS and also preventing it from being a single point of bottleneck.

63 In future, we plan to carry out studies to use this approach in MPI-IO kind of environment where it will be really beneficial. In such a kind of environment a single client can traverse the path and grab the EA information and the striping details.

Next this information can be brodcasted to other processes using a MPI Broadcast and thus we can save a considerable amount of time in path resolution. The amount of time saved in RPC is approximately equal to (number of path elements) * (number of clients accessing the file). We also plan to design a scheme for distributed metadata management.

5.1.2 Design of a decentralized metadata service layer for distributed metadata management

We have designed a Distributed Metadata Service Layer and evaluated its benefits to parallel file systems. Distributed metadata management is a hard problem since it involves taking care of various consistency and reliability aspects. Also, scaling metadata performance is more complex than scaling raw I/O performance. With distributed metadata, this complexity further increases. This leads to a primary goal while designing a Distributed Metadata Service Layer - to improve on the scalability aspect while taking care of consistency and reliability. In order to study this topic we have designed a FUSE based file system, namely the Distributed Union File Sys- tem (DUFS). DUFS can combine multiple mounts of a Parallel File System into a single which is exposed to the user applications. We have used

ZooKeeper as a distributed coordination service to take care of metadata reliabil- ity and consistency management. Finally, our ZooKeeper-based prototype shows the main trends that can be expected when using a distributed coordination service for metadata management. From our experiments, we can see that for higher number of

64 processes running on the client nodes and as the load on the client nodes increase, we can scale well with the approach proposed in the paper as compared to the other studied distributed filesystems Lustre and PVFS2. While Lustre performs very well for the small number of clients, its performance drops down when the number of client increases. With our approach, we are able to maintain good performance even with a large number of client. With 256 client processes, we are able to outperform

Lustre for the 6 metadata operations namely directory creation, directory removal, directory stat, file creation, file removal and file stat.

One major drawback of our approach is the memory usage because the ZooKeeper servers keep all their data in memory. Future work will focus on addressing this issue.

Additionally, we plan to replace our MD5-based mapping function with one based on consistent hashing [26]. This approach will allow to dynamically add and remove back-end storage while ensuring that the amount of data to relocate will be bounded.

65 Bibliography

[1] Clustered MetaData. http://wiki.lustre.org/index.php/Clustered Metadata.

[2] IOZONE Filesystem benchmark. http://www.iozone.org/.

[3] Isilon Systems Inc. http://www.isilon.com.

[4] Oracle Lustre File System. http://wiki.lustre.org/index.php/MainPage.

[5] Postmark File System Benchmark. http://shub- internet.org/brad/FreeBSD/postmark.html.

[6] Amina Saify, Garima Kochhar, Jenwei Hsieh, Onur Celebioglu. Enhancing High- Performance Clusters with Parallel File Systems.

[7] Peter Braam Braam and Michael Callahan. The file system, 1999.

[8] Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Lan Xue. Efficient metadata management in large distributed storage systems. In Proceedings of the 20 th IEEE/11 th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS’03), MSS ’03, Washington, DC, USA. IEEE Computer Soci- ety.

[9] Mike Burrows. The chubby lock service for loosely-coupled distributed systems. OSDI ’06, Berkeley, CA, USA. USENIX Association.

[10] PhilipH. Carns, Walter B. Ligon, III, Robert B. Ross, and Rajeev Thakur. Pvfs: A parallel file system for linux clusters. In In Proceedings of the 4th Annual Linux showcase and conference. MIT Press, 2000.

[11] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wal- lach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A distributed storage system for structured data.

[12] G.Goodson, B. Welch, B.Halevy, D.Black, and A.Adamson. Nfsv4 pnfs exten- sions. technical report.

66 [13] Sanjay Ghemawat, Howard Gobioff, and Shun- Leung. The google file sys- tem. In Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP ’03, New York, NY, USA, 2003. ACM. [14] Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. Zookeeper: wait-free coordination for internet-scale systems. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference, USENIX- ATC’10, Berkeley, CA, USA, 2010. USENIX Association. [15] James H. Morris, , Michael H. Conner, John H. Howard, David S. Rosenthal, and F. Donelson Smith. Andrew: a distributed personal computing environment. Commun. ACM. [16] Swapnil V. Patil, Garth A. Gibson, Sam Lang, and Milo Polte. Giga+: scalable directories for shared file systems. In Proceedings of the 2nd international work- shop on Petascale data storage: held in conjunction with Supercomputing ’07, PDSW ’07, New York, NY, USA, 2007. ACM. [17] Brian Pawlowski, Chet Juszczak, Peter Staubach, Carl Smith, Diane Lebel, and David Hitz. Nfs version 3 - design and implementation. In In Proceedings of the Summer USENIX Conference, pages 137–152, 1994. [18] David Quigley, Josef Sipek, Charles P. Wright, and Erez Zadok. Unionfs: User- and communityoriented development of a unification filesystem. In In Proceedings of the 2006 Linux Symposium, 2006. [19] Ronald A. Rivest. The md5 message digest algorithm. Internet RFC 1321, 1992. [20] Drew Roselli, Jacob R. Lorch, and Thomas E. Anderson. A comparison of file system workloads. In Proceedings of the annual conference on USENIX Annual Technical Conference, ATEC ’00, Berkeley, CA, USA, 2000. USENIX Associa- tion. [21] Mahadev Satyanarayanan, James J. Kistler, Puneet Kumar, Maria E. Okasaki, Ellen H. Siegel, David, and C. Steere. Coda: A highly available file system for a distributed workstation environment. IEEE Transactions on Computers, 39:447–459, 1990. [22] Frank Schmuck and Roger Haskin. Gpfs: A shared-disk file system for large computing clusters. In Proceedings of the 1st USENIX Conference on File and Storage Technologies, FAST ’02. USENIX Association. [23] Sage A. Weil, Kristal T. Pollack, Scott A. Brandt, and Ethan L. Miller. Dynamic metadata management for petabyte-scale file systems. In Proceedings of the 2004 ACM/IEEE conference on Supercomputing, SC ’04, Washington, DC, USA, 2004. IEEE Computer Society.

67 [24] Gongye Zhou, Qiuju Lan, and Jincai Chen. A dynamic metadata equipotent subtree partition policy for mass storage system. In Proceedings of the 2007 Japan-China Joint Workshop on Frontier of Computer Science and Technology, FCST ’07, Washington, DC, USA. IEEE Computer Society.

[25] Yifeng Zhu, Hong Jiang, and J. Wang. Hierarchical bloom filter arrays (hba): a novel, scalable metadata management system for large cluster-based storage. In Proceedings of the 2004 IEEE International Conference on Cluster Computing, Washington, DC, USA. IEEE Computer Society.

68