Distributed Metadata Management for Parallel Filesystems

Distributed Metadata Management for Parallel Filesystems A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Vilobh Meshram, B.Tech(Computer Science) Graduate Program in Computer Science and Engineering The Ohio State University 2011 Master’s Examination Committee: Dr. D.K. Panda, Advisor Dr. P. Sadayappan c Copyright by Vilobh Meshram 2011 Abstract Much of the research in storage systems has been focused on improving the scale and performance of the data-access throughput that read and write large amounts of file data. Parallel file systems do a good job of scaling large file access bandwidth by striping or sharing I/O resources across many servers or disks. However, the same cannot be said about scaling file metadata operation rates. Most existing parallel filesystems choose to concentrate all the metadata processing load on a single server. This centralized processing can guarantee correctness, but it severely hampers scalability. This downside is becoming more and more unac- ceptable as metadata throughput is critical for large scale applications. Distributing metadata processing load is critical to improve metadata scalability when handling huge number of client nodes. However, in such a distributed scenario, a solution to speed up metadata operations has to address two challenges simultaneously, namely scalability and reliability. We propose two approaches to solve the challenges mentioned above for metadata management in parallel filesystems with a focus towards reliability and scalability aspects. As demonstrated by experiments, the approach to solve the problem of distributed metadata management achieves significant improvements over native parallel filesystems by large margin for all the major metadata operations. With 256 client processes, our approach to solve the problem of distributed metadata management ii outperforms Lustre and PVFS2 by a factor of 1.9 and 23, respectively, to create directories. With respect to stat() operation on files, our approach is 1.3 and 3.0 times faster than Lustre and PVFS. iii This work is dedicated to my parents and my sister iv Acknowledgments I consider myself extremely fortunate to have met and worked with some remark- able people during my stay at Ohio State. While a brief note of thanks does not do justice to their impact on my life, I deeply appreciate their contributions. I begin by thanking my adviser, Dr. Dhabaleswar K.Panda. His guidance and advice during the course of my Masters studies have shaped my career. I am thankful to Dr. P. Sadayappan for agreeing to serve on my Master’s examination committee. Special thanks to Xiangyong Ouyang for all the support and help. I would also like to thank Dr.Xavier Besseron for his insightful comments and discussions which helped me to strengthen my thesis. I am especially grateful to Xiangyong, Xavier and Raghu and I feel lucky to have collaborated closely with them. I would like to thank all my friends in the Network Based Computing Research Laboratory for their friendship and support. Finally, I thank my family, especially my parents and my sister. Their love, action, and faith have been a constant source of strength for me. None of this would have been possible without them. v Vita April 18, 1986 . Born - Amravati, India 2007 . .B.Tech., Computer Science, COEP, Pune University, Pune, India. 2007-2009 . Software Development Engineer, Symantec R&D India 2010-2011 . Graduate Research Associate, The Ohio State University Publications Research Publications Vilobh Meshram, Xavier Besseron, Xiangyong Ouyang, Raghunath Rajachandrasekar and Dhabaleswar K. Panda Can a Decentralized Metadata Service Layer benefit Parallel Filesystems?. accepted in IASDS 2011 workshop in conjunction with Cluster 2011 Vilobh Meshram, Xiangyong Ouyang and Dhabaleswar K. Panda Minimizing Lookup RPCs in Lustre File System using Metadata Delegation at Client Side. OSU Technical Report OSU-CISRC-7/11-TR20, July 2011 Raghunath Rajachandrasekar, Xiangyong Ouyang, Xavier Besseron, Vilobh Meshram and Dhabaleswar K. Panda Can Checkpoint/Restart Mechanisms Benefit from Hier- archical Data Staging?. to appear in Reselience 2011 workshop in conjunction with Euro-Par 2011 Fields of Study vi Major Field: Computer Science and Engineering Studies in High Performance Computing: Prof. D. K. Panda vii Table of Contents Page Abstract . ii Dedication . iv Acknowledgments . v Vita......................................... vi List of Tables . xi List of Figures . xii 1. Introduction . 1 1.1 Parallel Filesystems . 3 1.2 Metadata Management in Parallel Filesystems . 5 1.3 Distributed Coordination Service . 8 1.4 Motivation of the Work . 10 1.4.1 Metadata Server Bottlenecks . 10 1.4.2 Consistency management of Metadata . 12 1.5 Problem Statement . 14 1.6 Organization of Thesis . 15 2. Related Work . 16 2.1 Metadata Management approaches . 16 2.2 Scalable filesystem directories . 19 viii 3. Delegating metadata at client side (DMCS) . 22 3.1 RPC Processing in Lustre Filesystem . 22 3.2 Existing Design . 24 3.3 Design and challenges for delegating metadata at client side . 25 3.3.1 Design of communication module . 25 3.3.2 Design of DMCS approach . 26 3.3.3 Challenges . 30 3.3.4 Metadata revocation . 31 3.3.5 Distributed Lock management for DMCS approach . 31 3.4 Performance Evaluation . 32 3.4.1 File Open IOPS: Varying Number of Client Processes . 34 3.4.2 File Open IOPS: Varying File Pool Size . 34 3.4.3 File Open IOPS: Varying File path Depth . 36 3.5 Summary . 37 4. Design of a Decentralized Metadata Service Layer for Distributed Meta- data Management . 39 4.1 Detailed design of Distributed Union FileSystem (DUFS) . 39 4.1.1 Implementation Overview . 41 4.1.2 FUSE-based Filesystem Interface . 42 4.2 ZooKeeper-based Metadata Management . 43 4.2.1 File Identifier . 44 4.2.2 Deterministic mapping function . 45 4.2.3 Back-end storage . 45 4.3 Algorithm examples for Metadata operations . 46 4.3.1 Reliability concerns . 46 4.4 Performance Evaluation . 48 4.4.1 Distributed coordination service throughput and memory usage experiments . 49 4.4.2 Scalability Experiments . 52 4.4.3 Experiments with varying number of distributed coordination service servers . 52 4.4.4 Experiment with different number of mounts combined using DUFS . 55 4.4.5 Experiments with different back-end parallel filesystems . 58 4.5 Summary . 60 ix 5. Contributions and Future Work . 62 5.1 Summary of Research Contributions and Future Work . 62 5.1.1 Delegating metadata at client side . 63 5.1.2 Design of a decentralized metadata service layer for distributed metadata management . 64 Bibliography . 66 x List of Tables Table Page 1.1 LDLM and Oprofile Experiments . 7 1.2 Transaction throughput with a fixed file pool size of 1,000 files . 11 1.3 Transaction throughput with varying file pool . 12 1.4 Transaction throughput with varying file pool . 12 3.1 Metadata operation rates with different underlying storage . 30 xi List of Figures Figure Page 1.1 Basic Lustre Design . 4 1.2 Zookeeper Design . 9 1.3 Example of consistency issue with 2 clients and 2 MetaData servers . 13 3.1 Design of DMCS approach . 27 3.2 File open IOPS, Each Process Accesses 10,000 Files . 35 3.3 File open IOPS, Using 16 Client Processes . 36 3.4 Time to Finish open, Using 16 Processes Each Accessing 10,000 Files 37 4.1 DUFS mapping from the virtual path to the physical path using File Identifier (FID) . 40 4.2 DUFS overview. A, B, C and D show the steps required to perform an open() operation. 41 4.3 Sample physical filename generated from a given FID . 46 4.4 Algorithm for the mkdir() operation . 47 4.5 Algorithm for the stat() operation . 47 4.6 ZooKeeper throughput for basic operations by varying the number of ZooKeeper Servers . 50 xii 4.7 Zookeeper memory usage and its comparison with DUFS and basic FUSE based file system memory usage . 51 4.8 Scalability experiments with 8 Client nodes and varying number of client processes . 53 4.9 Scalability experiments with 16 Client nodes and varying number of client processes . 54 4.10 Operation throughput by varying the number of Zookeeper Servers . 56 4.11 File operation throughput for different numbers of back-end storage . 57 4.12 Operation throughput with respect to the number of clients for Lustre and PVFS2 . 59 xiii Chapter 1: INTRODUCTION High-performance computing (HPC) is an integral part of today’s scientific, eco- nomic, social, and commercial fabric. We depend on HPC systems and applications for a wide range of activities such as climate modeling, drug research, weather fore- casting, and energy exploration. HPC systems enable researchers and scientists to discover the origins of the universe, design automobiles and airplanes, predict weather patterns, model global trade, and develop life-saving drugs. Because of the nature of the problems that they are trying to solve, HPC applications are often data-intensive. Scientific applications in astrophysics (CHIMERA and VULCAN2D), climate modeling (POP), combustion (S3D), fusion (GTC), visualization, astronomy, and other fields generate or consume large volumes of data. This data is on the order of ter- abytes and petabytes and is often shared by the entire scientific community. Today’s computational requirements are increasing at a geometric rate that involves large quantities of data. While the computational power of microprocessors has kept pace with Moore’s law as a result of increased chip densities, performance improvements in magnetic storage have not seen a corresponding increase. The result has been an increasing gap between the computational power and the I/O subsystem performance of current HPC systems. Hence, while supercomputers keep getting faster, we do 1 not see a corresponding improvement in application performance, because of the I/O bandwidth bottleneck.

Load more