Minimizing Lookup Rpcs in Lustre File System Using Metadata Delegation at Client Side

Minimizing Lookup RPCs in Lustre File System using Metadata Delegation at Client Side Vilobh Meshram, Xiangyong Ouyang and Dhabaleswar K. Panda Department of Computer Science and Engineering The Ohio State University {meshram, ouyangx, panda}@cse.ohio-state.edu Abstract 75% of all the file system calls access metadata [18]. Therefore, the efficient management of metadata is Lustre is a massively Parallel Distributed File Sys- crucial for the overall system performance. Along tem and its architecture scales well to large amount with this it will be very beneficial if we can minimize of data. However the performance of Lustre can be the time spent in communication between the inter- limited by the load of metadata operations at the acting Client and Server nodes in case of file system Metadata Server (MDS). Because of the higher ca- calls which access metadata. pacity of parallel file systems, they are often used to Lustre is a POSIX compliant, open-source dis- store and access millions of small files. These small tributed parallel filesystem.Due to the extremely scal- files may create a meta-data bottleneck, especially able architecture of the Lustre filesystem, Lustre de- for file systems that have only a single active meta- ployments are popular in scientific supercomputing, data server. Also,in case of Lustre installations with as well as in the oil and gas, manufacturing, rich me- Single Metadata Server or with Clustered Metadata dia, and finance sectors. Lustre presents a POSIX Server the time spent in path traversal from each interface to its clients with parallel access capabili- client involves multiple LOOKUP Remote Procedure ties to the shared file objects. As of this writing, 18 Call (RPC) for path traversal before receiving the ac- of the top 30 fastest supercomputers in the world use tual metadata related information and the extended Lustre filesystem for high-performance scratch space. attributes adding additional pressure on the MDS. Lustre is an object-based filesystem. It is composed In this paper we propose an approach to minimize of three components: Metadata servers (MDSs), ob- the number of Lookup RPC calls to MDS. We have ject storage servers (OSSs), and clients. Figure 1 il- designed a new scheme called Metadata Delegation lustrates the Lustre architecture. Lustre uses block at Client Side (MDCS), to delegate part of metadata devices for file data and metadata storage and each and the extended attributes for a file to the client so block device can be managed by only one Lustre ser- that we can load balance the traffic at the MDS by vice. The total data capacity of the Lustre filesystem delegating some useful information to the client. Ini- is the sum of all individual OST capacities. Lustre tial experimental results show that we can achieve up clients access and concurrently use data through the to 45% improvement in metadata operations through- standard POSIX I/O system calls. put for system calls such as file open. MDCS also MDS provides metadata services. Correspond- reduces the latency to open a Lustre file by 33%. ingly, an MDC (metadata client) is a client of those services. One MDS per filesystem manages one meta- 1 Introduction data target (MDT). Each MDT stores file metadata, such as file names, directory structures, and access Modern distributed file systems architectures like permissions. OSS (object storage server) exposes Lustre [5, 3], PVFS [7], Google File System [13] or the block devices and serves data. Correspondingly, OSC object based storage file systems [12, 16] separate the (object storage client) is client of the services. Each management of metadata from the storage of the ac- OSS manages one or more object storage targets tual file data. These architecture have proven to eas- (OSTs), and OSTs store file data objects. ily scale the storage capacity and bandwidth. How- In our studies with Lustre filesystem, we noticed a ever the management of metadata remains a bottle- potential performance hit in the metadata operation neck [10, 19]. Studies has shown that approximately part caused by repeated LOOKUP RPC calls from 1 Lustre clients to the MDS. Since path LOOKUP is • We have implemented this MDCS mechanism one of the most often seen metadata operations in into Lustre filesystem. Initial experiments show a typical filesystem workload, this issue may result up to 45% improvement in terms of metadata in performance degradation in Lustre MDS perfor- operation throughput over the basic Lustre de- mance. sign. The rest of paper is organized as follows. In sec- LDAP Server tion 2, we describe the background of Metadata operation in Lustre File System and some well known configuration information, problems in metadata access from MDS. In section 3, network connection details we explain how the path name lookup is performed & security management in Lustre and also explain the approach used by us to minimize the number of RPC’s by making use Clients of Design presented in Section 4. In section 4, we present our detailed designs and discuss our design directory operations, file I/O choices. In section 5, we conduct experiments to eval- meta-data & concurrency & locking uate our designs and present results that indicate improvement. In section 6, we discuss the related work. Finally we provide our conclusion and state the direc- tion of the research we intend to conduct in future. Meta-Data Server Object Storage Targets (MDS) (OST) 2 Background frecovery, file status & file creation 2.1 Lustre Metadata Server and its components Figure 1: Basic Lustre Design Lustre Metadata Server (MDS) [5, 8, 3] is the criti- cal component of the Lustre File System. Lustre File In this paper we propose a new approach called System has an important component known as Lus- “Metadata Delegation at Client Side” (MDCS) in tre Distributed Lock Manager (LDLM) which ensures this paper. With this new strategy, we are able to cache metadata integrity i.e. atomic access to meta- minimize the number of LOOKUP RPC [9] calls and data. In Lustre we also have a LDLM component hence the request traffic on MDS. We delegate meta- at the OSS/OST [5, 8, 3] but since in this paper we data related attributes to the special clients which we focus more on the Metadata operation we will con- call “Delegator Client”, which are the client nodes sider LDLM which is present at the MDS. In case of that open/create files initially. In case of increased a Single Metadata Server deployment a single MDS workload at the MDS, we distribute the load by as- manages the entire namespace so when a client wants signing the ownership to different clients depending to lookup or create a name in that namespace its the upon the client which at first created or opened the sole owner of the entire namespace. Whereas in case file. So in case of an environment where a file is sub- of a Clustered Metadata Server design each directory sequently accessed by many clients at regular inter- can be striped over multiple metadata servers, each vals, we can distribute the workload from MDS to of which contains a disjoint portion of the namespace. the Delegator Client and the Normal Client who are So when a client wants to lookup or create a name subsequently accessing the files will fetch the meta- in that namespace, it uses a hashing algorithm to de- data related attributes from the Delegator Client and termine which metadata server holds the information hence for that name. In a summary, our contributions in this paper are: A single MDS server can be a bottleneck. Also the number of RPC calls made in the path name lookup • We have identified and revealed a potential per- can contribute a significant portion in Lustre File Sys- formance hit in Lustre MDS that is caused by tem performance. In case of an environment where repeated LOOKUP RPC calls to MDS. many Clients are performing interleaving access and I/O on the same file the time spent in path name • We have proposed a new approach (MDCS) to lookup can be huge. We have addressed this problem address the potential performance issue. in this paper. 2 Number of Transactions Number of Number of Transactions transactions per second files in pool transactions per second 1,000 333 1,000 1,000 333 5,000 313 5,000 5,000 116 10,000 325 10,000 10,000 94 20,000 321 20,000 20,000 79 Table 1: Transaction throughput with a fixed Table 2: Transaction throughput with varying file pool size of 1000 files file pool 2.2 Single Lustre Metadata Server 3 RPC Mechanism in Lustre (MDS) Bottlenecks Filesystem The MDS is currently restricted to a single node, with In the following section we discuss about the RPC a fail-over MDS that becomes operational if the pri- mechanism in Lustre Filesystem. Our experiments mary server becomes nonfunctional. Only one MDS show that nearly 1500-2000 usecs are spent in RPC is ever operational at a given time. This limitation on a TCP transport. poses a potential bottleneck as the number of clients and/or files increase. IOZone [1] is used to measure 3.1 Existing RPC Processing with the sequential file IO throughput, and PostMark [6] Lustre is used to measure the scalability of the MDS performance. Since MDS performance is the primary When we consider the RPC processing in Lustre we concern of this research, we discuss the PostMark ex- also talk about the how lock processing works in Lus- periment with more details. PostMark is a file system tre [5, 8, 3, 19] and how our modifications can benefit benchmark that performs a lot of metadata intensive to minimize the number of LOOKUP RPC.

Load more