Fast Storage for File System Metadata
Total Page:16
File Type:pdf, Size:1020Kb
Fast Storage for File System Metadata Kai Ren CMU-CS-17-121 2017/09/26 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: Garth A. Gibson, Chair Dave G. Andersen Greg R. Ganger Brent B Welch (Google) Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright © 2017 Kai Ren This research was sponsored by the Moore Foundation under grant number 2160, Los Alamos National Security, LLC under grant numbers 161465-1 and 288103, and the ISTC-CC. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity. Keywords: File System, Metadata Management, Log-Structured Approach, Caching Dedicated to my family. iv Abstract In an era of big data, the rapid growth of data that many companies and organizations produce and manage continues to drive efforts to im- prove the scalability of storage systems. The number of objects presented in storage systems continue to grow, making metadata management crit- ical to the overall performance of file systems. Many modern parallel applications are shifting toward shorter durations and larger degrees of parallelism. Such trends continue to make storage systems to experience more diverse metadata intensive workloads. The goal of this dissertation is to improve metadata management in both local and distributed file systems. The dissertation focuses on two aspects. One is to improve the out-of-core representation of file system metadata, by exploring the use of log-structured multi-level approaches to provide a unified and efficient representation for different types of secondary storage devices (e.g., traditional hard disk and solid state disk). The other aspect is to demonstrate that such representation also can be flexibly integrated with many namespace distribution mechanisms to scale metadata performance of distributed file systems, and provide better support for a variety of big data applications in data center environment. vi Acknowledgments First and foremost, I am greatly indebted to my advisor, Garth Gib- son for his guidance and support of my PhD research. I would also like to thank my thesis committee members Dave Andersen, Greg Ganger, and Brent Welch for their insightful feedback and comments. I would also like to express my gratitude towards my collaborators, collegues, and friends with whom I spent amazing six years: Yoshihisa Abe, Magdalena Balazinska, Vishnu Naresh Boddeti, Zhuo Chen, Christos Faloutsos, Bin Fan, Bin Fu, Gary Grider, Fan Guo, Bill Howe, Wenlu Hu, Junchen Jiang, Jin Kyu Kim, Yunchuan Kong, YongChul Kwon, Ni Lao, Lei Li, Boyan Li, Hyeontaek Lim, Liu Liu, Xi Liu, Yanjin Long, Julio Lopez, Iulian Moraru, Swapnil Patil, Andew Pavlo, Richard Peng, Amar Phanishayee, Milo Polte, Long Qin, Raja R. Sambasivan, Ilari Shafer, Julian Shun, Jiri Simsa, Pingzhong Tang, Wittawat Tantisiriroj, Yuandong Tian, Yuan- dong Tian, Vijay Vasudevan, Jingliang Wei, Guang Xiang, Lin Xiao, Lianghong Xu, Zi Yang, Junming Yin, Yin Zhang, Xin Zhang, Le Zhao, Qing Zheng, Dong Zhou, Zongwei Zhou. Finally, this thesis would not be possible without the support from my parents Yanjun Ren and Weinian Zhang, as well as my wife, Jieqiong Liu. viii Contents Contents ix List of Figures xiii List of Tables xix 1 Introduction1 1.1 Thesis Statement.............................2 1.2 Results Overview.............................3 1.2.1 Out-of-core Metadata Representation..............3 1.2.2 In-memory Index Optimization.................4 1.2.3 Distributed Metadata Management...............4 1.3 Thesis Contribution............................5 2 Background and Related Work7 2.1 Overview of File System Architecture..................7 2.2 In-Memory and External Memory Index................8 2.2.1 In-Memory Indexes........................9 2.2.2 External Memory Indexes.................... 10 2.3 Metadata Management for Local File Systems............. 11 2.3.1 Optimizations for Traditional Disk Model........... 11 2.3.2 Optimizations for Solid State Disks............... 13 ix 2.4 Metadata Management for Distributed Systems............ 13 2.4.1 Namespace Distribution..................... 13 2.4.2 Metadata Caching........................ 14 2.4.3 Large Directories Support.................... 14 2.4.4 Bulk Loading Optimization................... 15 2.5 Storage Systems without File System APIs............... 15 3 Metadata Workload Analysis 19 3.1 Data Sets Overview............................ 20 3.2 File System Namespace Statistics.................... 21 3.3 Dynamic Behaviors in Metadata Workload............... 23 3.4 Statistical Properties of Metadata Operations............. 26 3.5 Lessons from Workload Analysis..................... 29 4 TableFS: A Stacked File System Design Layering on Key-Value Store 31 4.1 Background................................ 32 4.1.1 Analysis of File System Metadata Operations......... 32 4.1.2 LSM-Tree and Its Implementation LevelDB.......... 33 4.2 Design Overview of TableFS....................... 36 4.2.1 Local File System as Object Store................ 36 4.2.2 Table Schema........................... 37 4.2.3 Hard Links............................ 38 4.2.4 Scan Operation Optimization.................. 39 4.2.5 Inode Number Allocation.................... 39 4.2.6 Concurrency Control....................... 39 4.2.7 Journaling............................. 40 4.2.8 Column-Style Table for Faster Insertion............. 40 4.2.9 TableFS in the Kernel...................... 41 x 4.3 Evaluation................................. 43 4.3.1 Evaluation System........................ 43 4.3.2 Data-Intensive Macrobenchmark................. 43 4.3.3 TableFS-FUSE Overhead Analysis............... 46 4.3.4 Metadata-Intensive Microbenchmark.............. 49 4.3.5 Column-Style Metadata Storage Schema............ 55 4.4 Summary................................. 57 5 SlimFS: Space Efficient Indexing and Balanced Read-Write Perfor- mance 59 5.1 The Analysis of Log-Structured Designs................ 61 5.1.1 I/O Cost Analysis of Log-Structured Merge Tree........ 61 5.1.2 Stepped-Merge Algorithm: Reducing Write Amplification in Compaction............................ 62 5.1.3 Optimizing In-memory Indexes and Filters........... 63 5.2 The Design Overview of SlimFS..................... 65 5.2.1 The SlimFS Architecture..................... 65 5.2.2 SlimDB’s Compact Index and Multi-Store Design....... 66 5.3 Design of Compact Index and Filter in SlimDB............ 67 5.3.1 Three-Level Index: Compact Block Index for SSTable..... 67 5.3.2 Multi-Level Cuckoo Filter: Improve Tail Latency....... 72 5.3.3 Implementation of SlimDB.................... 77 5.4 Analytic Model for Selecting Indexes and Filters............ 77 5.5 Evaulation................................. 80 5.5.1 Evaluation System........................ 80 5.5.2 Full System Benchmark..................... 81 5.5.3 Compact SSTable Index Microbechmark............ 85 5.5.4 Multi-Level Cuckoo Filters Microbenchmark.......... 87 5.6 Summary................................. 89 xi 6 IndexFS: Metadata Management For Distributed File Systems 91 6.1 IndexFS System Design......................... 92 6.1.1 Dynamic Namespace Partitioning................ 93 6.1.2 Stateless Directory Caching................... 95 6.1.3 Integration with Log-Structured Metadata Storage...... 97 6.1.4 Metadata Bulk Insertion..................... 99 6.1.5 Rename Operation........................ 100 6.1.6 Fault Tolerance.......................... 102 6.2 Comparison of System Designs...................... 103 6.2.1 Table partitioned namespace (Giraffa)............. 104 6.2.2 Replicated directories with sharded files (ShardFS)...... 106 6.2.3 Comparison Summary...................... 109 6.3 Experimental Evaluation......................... 112 6.3.1 Large Directory Scaling..................... 114 6.3.2 Metadata Client Caching..................... 116 6.3.3 Load Balancing.......................... 119 6.3.4 Bulk Insertion and Factor Analysis............... 123 6.3.5 Portability to Multiple File Systems............... 124 6.4 Summary of IndexFS Benefits...................... 126 7 Conclusion and Future Work 129 7.1 Future Work................................ 130 Bibliography 131 xii List of Figures 2.1 Distributed file system architecture. .......................................8 3.1 The distribution of 50th, 90th, and 100th percentile, and maximum directory sizes and file sizes of collected file system samples...... 21 3.2 This scatter plot demonstrates the relation between the size of each file system sample and its maximum directory size. The black line is the trend lines that predicts the relation using linear regression.... 22 3.3 The distribution of directory depth of collected file systems traces. Many file systems have 50% of their directories with a depth lower than 9................................... 23 3.4 This scatter plot demonstrates the relationship between the size of each file system sample and the average depth of objects inside the system. The average depth of the tree increases sub-linearly as the file system size grows........................... 24 3.5 Distribution of file system operations in LinkedIn, Yahoo! and Open- Cloud traces. Most of these workloads are read-intensive, but compli- cated operations like rename also have a significant presence..... 25 3.6 The distribution of each type of metadata operation in a one-day LinkedIn trace over