TABLEFS: Enhancing Metadata Efficiency in the Local File System

TABLEFS: Enhancing Metadata Efficiency in the Local File System Kai Ren, Garth Gibson Carnegie Mellon University kair, garth @cs.cmu.edu { } Abstract phasize simple (NoSQL) interfaces and large in-memory caches [2, 24, 33]. File systems that manage magnetic disks have long rec- Some of these key-value stores feature high rates ognized the importance of sequential allocation and large of change and efficient out-of-memory Log-structured transfer sizes for file data. Fast random access has dom- Merge (LSM) tree structures [8, 23, 32]. An LSM tree inated metadata lookup data structures with increasing can provide fast random updates, inserts and deletes use of B-trees on-disk. Yet our experiments with work- without scarificing lookup performance [5]. We be- loads dominated by metadata and small file access in- lieve that file systems should adopt LSM tree techniques dicate that even sophisticated local disk file systems like used by modern key-value stores to represent metadata Ext4, XFS and Btrfs leave a lot of opportunity for perfor- and tiny files, because LSM trees aggressively aggregate mance improvement in workloads dominated by meta- metadata. Moreover, today’s key-value store implemen- data and small files. tations are “thin” enough to provide the performance lev- In this paper we present a stacked file system, els required by file systems. TABLEFS, which uses another local file system as an ob- In this paper we present experiments in the most ma- ject store. TABLEFS organizes all metadata into a sin- ture and restrictive of environments: a local file sys- gle sparse table backed on disk using a Log-Structured tem managing one magnetic hard disk. We used a Lev- Merge (LSM) tree, LevelDB in our experiments. By elDB key-value store [23] to implement TABLEFS, our stacking, TABLEFS asks only for efficient large file al- POSIX-compliant stacked file system, which represents location and access from the underlying local file sys- metadata and tiny files as key-value pairs. Our results tem. By using an LSM tree, TABLEFS ensures metadata show that for workloads dominated by metadata and tiny is written to disk in large, non-overwrite, sorted and in- files, it is possible to improve the performance of the dexed logs. Even an inefficient FUSE based user level most modern local file systems in Linux by as much as implementation of TABLEFS can perform comparably to an order of magnitude. Our demonstration is more com- Ext4, XFS and Btrfs on data-intensive benchmarks, and pelling because it begins disadvantaged: we use an in- can outperform them by 50% to as much as 1000% for terposed file system layer [1] that represents metadata metadata-intensive workloads. Such promising perfor- and tiny files in a LevelDB store whose LSM tree and mance results from TABLEFS suggest that local disk file log segments are stored in the same local file systems we systems can be significantly improved by more aggres- compete with. sive aggregation and batching of metadata updates. 2 Background 1 Introduction Even in the era of big data, most things in many file In the last decade parallel and Internet service file sys- systems are small [10, 28]. Inevitably, scalable systems have demonstrated effective scaling for high band- tems should expect the numbers of small files to soon width, large file transfers [48, 13, 17, 25, 38, 39]. The achieve and exceed billions, a known challenge for both same, however, is not true for workloads that are domi- the largest [34] and most local file systems [49]. In this nated by metadata and tiny file access [34, 49]. Instead section we review implementation details of the systems there has emerged a class of scalable small-data stor- employed in our experiments: Ext4, XFS, Btrfs and Lev- age systems, commonly called key-value stores, that em- elDB. 1 USENIX Association 2013 USENIX Annual Technical Conference (USENIX ATC ’13) 145 2.1 Local File System Structures tent maps, distinguished by a type field: INODE ITEM Ext4[26] is the fourth generation of Linux ext file sys- for inodes, DIR ITEM and DIR INDEX for directory en- tems, and, of the three we study, the most like traditional tries, and EXTENT DATA REF for file extent maps. Di- UNIX file systems. Ext4 divides the disk into block rectory entries are stored twice so that they can be or- groups, similar to cylinder groups in traditional UNIX, dered differently: in one the offset field of the FS tree and stores in each block group a copy of the superblock, index (for the directory’s inode) is the hash of the en- a block group descriptor, a bitmap describing free data try’s name, for fast single entry lookup, and in the other blocks, a table of inodes and a bitmap describing free the offset field is the child file’s inode number. The lat- inodes, in addition to the actual data blocks. Inodes con- ter allows a range scan of the FS tree to list the inodes of tain a file’s attributes (such as the file’s inode number, child files and accelerate user operations such as ls+stat. ownership, access mode, file size and timestamps) and Btrfs, by default, delays writes for 30 seconds to increase four extent pointers for data extents or a tree of data ex- disk efficiency, and metadata and data are in the same de- tents. The inode of a directory contains links to a HTree lay queue. (similar to B-Tree) that can be one or two levels deep, based on a 32 bit hash of the directory entry’s name. By default only changes to metadata are journaled for dura- 2.2 LevelDB and its LSM Tree bility, and Ext4 asynchronously commits its journal to Inspired by a simpler structure in BigTable[8], LevelDB disk every five seconds. When committing pending data [23] is an open-source key-value storage library that fea- and metadata, data blocks are written to disk before the tures an Log-Structured Merge (LSM) tree [32] for on- associated metadata is written to disk. disk storage. It provides simple APIs such as GET, PUT, XFS[47], originally developed by SGI, aggressively DELETE and SCAN (an iterator). Unlike BigTable, not and pervasively uses B+ trees to manage all file struc- even single row transactions are supported in LevelDB. tures: free space maps, file extent maps, directory entry Because TABLEFS uses LevelDB, we will review its de- indices and dynamically allocated inodes. Because all sign in greater detail in this section. file sizes, disk addresses and inode numbers are 64 bits In a simple understanding of an LSM tree, an mem- in XFS, index structures can be large. To reduce the size ory buffer cache delays writing new and changed entries of these structures XFS partitions the disk into alloca- until it has a significant amount of changes to record tion groups, clusters allocation in an allocation group and on disk. Delay writes are made more durable by re- uses allocation group relative pointers. Free extents are dundantly recording new and changed entries in a write- represented in two B+ trees: one indexed by the start- ahead log, which is pushed to disk periodically and asyn- ing address of the extent and the other indexed by the chronously by default. length of the extent, to enable efficient search for an ap- In LevelDB, by default, a set of changes are spilled propriately sized extent. Inodes contain either a direct to disk when the total size of modified entries exceeds extent map, or a B+ tree of extent maps. Each allocation 4 MB. When a spill is triggered, called a minor com- group has a B+ tree indexed by inode number. Inodes paction, the changed entries are sorted, indexed and writ- for directories have a B+ tree for directory entries, in- ten to disk in a format known as SSTable[8]. These en- dexed by a 32 bit hash of the entry’s file name. XFS also tries may then be discarded by the memory buffer and journals metadata for durability, committing the journal can be reloaded by searching each SSTable on disk, pos- asynchronously when a log buffer (256 KB by default) sibly stopping when the first match occurs if the SSTa- fills or synchronously on request. bles are searched from most recent to oldest. The number Btrfs[22, 36] is the newest and most sophisticated of SSTables that need to be searched can be reduced by local file system in our comparison set. Inspired by maintaining a Bloom filter[7] on each, but with increas- Rodeh’s copy-on-write B-tree[35], as well as features ing numbers of records the disk access cost of finding of XFS, NetApp’s WAFL and Sun’s ZFS[3, 18], Btrfs a record not in memory increases. Scan operations in copies any B-tree node to an unallocated location when LevelDB are used to find neighbor entries, or to iterate it is modified. Provided the modified nodes can be allo- through all key-value pairs within a range. When per- cated contiguously, B-tree update writing can be highly forming a scan operation, LevelDB first searches each sequential; however more data must be written than is SSTable to place a cursor; it then increments cursors minimally needed (write amplification). The other sig- in the multiple SSTables and merges key-value pairs in nificant feature of Btrfs is its collocation of different sorted order. Compaction is the process of combining metadata components in the same B-tree, called the FS multiple SSTables into a smaller number of SSTables by tree.

TABLEFS: Enhancing Metadata Efficiency in the Local File System

15-122: Principles of Imperative Computation, Fall 2010 Assignment 6: Trees and Secret Codes

The Parallel File System Lustre

A Fog Storage Software Architecture for the Internet of Things Bastien Confais, Adrien Lebre, Benoît Parrein

Algorithms Documentation Release 1.0.0

Evaluation of Active Storage Strategies for the Lustre Parallel File System

On the Performance Variation in Modern Storage Stacks

DMFS - a Data Migration File System for Netbsd

Lustre* Software Release 2.X Operations Manual Lustre* Software Release 2.X: Operations Manual Copyright © 2010, 2011 Oracle And/Or Its Affiliates

An Incremental Path Towards a Safer OS Kernel

Parallel File Systems for HPC Introduction to Lustre

Lustre* with ZFS* SC16 Presentation

FS Design Around SMR Approved SNIA Tutorial © 2015 Storage Networking Industry Association