File Systems: More Cooperations - Less Integration

File Systems: More Cooperations - Less Integration. Alex Depoutovitch Andrei Warkentin VMWare, Inc. VMWare, Inc. [email protected] [email protected] Abstract application Conventionally, file systems manage storage space user available to user programs and provide it through the file interface. Information about the physical location of kernel used and unused space is hidden from users. This makes file system the file system free space unavailable to other storage stack kernel components because of performance or lay- ering violation reasons. This forces file systems archi- volume manager tects to integrate additional functionality, like snapshotting and volume management, inside file systems increasing their complexity. software RAID We propose a simple and easy to implement file system interface that allows different software components to disk efficiently share free storage space with a file system at a block level. We demonstrate the benefits of the new interface by optimizing an existing volume manager to Figure 1: Storage stack store snapshot data in the file system free space, instead of requiring the space to be reserved in advance making it unavailable for other uses. using snapshots and choose the size of the file system accordingly to leave enough storage space to hold the snapshots [2]. This reserved space is not available to 1 Introduction user programs and snapshots cannot grow larger than the reserved space. The end result is wasted storage capac- A typical operating system storage stack consists of ity and unnecessary over-provisioning. Second, kernel functional elements that provide many important fea- components underneath the file system have to resort to tures: volume management, snapshots, high availabil- storing their data within blocks at fixed locations. For ity, data redundancy, file management and more. The example, Linux MD software RAID expects its internal traditional UNIX approach is to separate each piece of metadata in the last blocks of the block device [4]. Be- functionality into its own kernel component and to stack cause of this, MD cannot be installed without relocating these components on top of one another (Fig.1). Com- an existing file system 1. munication typically occurs only between the adjacent components using a linear block device abstraction. To solve the above problems, many modern file systems, like ZFS, BTRFS, and WAFL, tend to incorpo- This design creates several problems. First, users have rate a wide range of additional functionality, such as to know at installation time how much space must be software RAID arrays, a volume manager, and/or snap- reserved for the file system and for each of the kernel shots [1,6, 11]. Integrating all these features in a sin- components. This reservation cannot be easily changed gle software component brings a lot of advantages, such after the initial configuration. For example, when de- ploying a file system on top of an LVM volume man- 1There is an option to store the metadata on a separate file sys- ager, users have to know in advance if they intend on tem, but this requires a separate file system to be available. 1 as flexibility and ease of storage configuration, shared brreserve(name, block_num) free space pool, and potentially more robust storage al- brquery(name, offset, block_num) returns set of location strategies. However, additional functionality of blocks such "monolithic" file systems results in a complex and brexpand(name, block_num) large source code. For example, BTRFS contains twice brtruncate(name, block_num) as many lines of code as EXT4, and four times more than EXT3. ZFS source code is even larger. A large brfree(name) code-base results in a more error-prone and harder to Table 1: Block Register interface functions. modify software. BTRFS development started in 2008, and the stable version is yet to be released. In addi- tion, these file systems put users into an "all or nothing" before, and release blocks back for future reuse. While position: if one needs BTRFS volume management or we want this interface to provide both fine-grained con- snapshotting features, one has to embrace its slower per- trol and extensive functionality, we also make sure that formance in many benchmarks [13]. it can be implemented by any general-purpose file sys- We believe that a file system can and should provide a tem without core changes to its design and on-disk lay- centralized storage space management for stateful com- out and with no or minimal code changes. The inter- ponents of the storage stack without integrating them in- face needs to be generic and easy to use with simple side the file system code. File systems are ideally suited abstractions. Although we think that kernel components for this role as they already implement disk space man- will be the primary users of the Block Register interface, agement. We propose a new interface, called Block Reg- user mode applications might benefit from it as well, for ister (BR), to a file system that allows both user-mode example, to get access to data saved by kernel compo- applications and kernel components to dynamically re- nents. Therefore, the Block Register interface has to be serve storage from a single shared common pool of free accessible from both user and kernel modes. blocks. With BR interface, stateful storage stack com- The main abstraction we provide is a reservation.A ponents do not need to reserve storage space outside the reservation is a set of blocks on a block device, identi- file system, but can ask the file system to reserve the fied by their block numbers. These blocks belong to the necessary space for them. caller and can be accessed directly by calling a block With the help of the new interface, file systems, such device driver for the device on which the file system is as EXT4, can have advantages, like more effective stor- located. The file system will neither try to use these age resource utilization, reduced wasted space, and flex- blocks itself nor include them in other reservation re- ible system configuration, without integrating additional quests. Each reservation can be uniquely identified by functionality inside the file system. The BR interface its name. In the Table1, we list functions of the Block also solves the problem of fixed metadata location, al- Register API. Currently, we define 5 functions. When lowing kernel components to dynamically query for lo- a caller needs to reserve a set of blocks, it calls the cation of necessary blocks. We demonstrate the benefits brreserve() function, specifying a unique name for of the BR interface by integrating it with LVM snap- this reservation and the number of blocks requested. In shot mechanism and allowing snapshots to use file sys- response, the file system reserves the required the num- tem free blocks instead of preallocated storage space. It ber of blocks from its pool of free blocks. is important that our interface can be exposed by existing file systems with no changes to their disk layout and When a caller needs to get the list of blocks of an ex- brquery() minimal or no changes to their code. isting reservation, it calls the function, passing it the name of the existing reservation and the range of blocks required. The range is specified by 2 Block Register Interface the offset of the first block in the query and the number of blocks to return. An existing reservation can The main purpose of the Block Register interface is to grow and shrink dynamically by calls to brexpand() allow kernel components to ask the file system to reserve and brtruncate(). brexpand() takes the reser- blocks of storage, increase and decrease number of re- vation name and the number of blocks to expand by, served blocks, query for blocks that have been reserved while brtruncate() takes the physical block to re- 2 turn back to the file system. contiguity of allocated space. This may affect I/O performance. Second, fallocate() call results in write Finally, an existing reservation can be removed by call- operations to the underlying block device, thus special ing the brfree() function. After a reservation is re- care has to be taken by the caller in order to avoid dead- moved, all blocks belonging to the reservation are re- locks. turned to the file system’s pool of free blocks. brquery() was implemented using the bmap() function, which returns the physical device block for the 3 Linux Implementation given logical file block. bmap() may also trigger some read requests to the underlying storage. In this section, we’ll explore how the Block Register interface could be implemented in the EXT4 file system. 4 Snapshots with Block Register Interface Although, we use some of the EXT4- and Linux-specific features, they are used only as a shortcut in the prototype In order to evaluate the Block Register interface, we implementation of the BR interface. modified Linux Logical Volume Manager (LVM) to use our new interface to allocate storage necessary to cre- Although this is not necessary, we decided to repre- ate and maintain snapshots of block devices [2]. In this sent Block Register reservations as regular files, using section, we briefly describe the changes we made to the a common name space for files and reservations. Thus, snapshotting code so that it can make use of the Block brreserve() and brfree() can be mapped to file Register interface. creation and deletion. Leveraging the existing file name space has other advantages with respect to access and Snapshots preserve the state of a block device at a par- management. Since reservations are basically normal ticular point in time. They are widely used for many files, they can be easily accessed with existing file sys- different purposes, such as backups, data-recovery, or tem utilities and interfaces.

File Systems: More Cooperations - Less Integration

Increasing Reliability and Fault Tolerance of a Secure Distributed Cloud Storage

And Intra-Set Write Variations

Improving Performance and Reliability of Flash Memory Based Solid State Storage Systems

Configuring Oracle Solaris ZFS for an Oracle Database

PC-Based Data Acquisition II Data Streaming to a Storage Device

(12) United States Patent (10) Patent No.: US 7,191,286 B2 Forrer, Jr

Storage Administration Guide Storage Administration Guide SUSE Linux Enterprise Server 15

Of File Systems and Storage Models

Data Stored Based Test Pattern Generation in Flash Memory

Architectural Techniques for Improving NAND Flash Memory Reliability Yixin Luo

Drobotm Beyondraidtm Simplifies Storage Deployment and Management Safe

Data Redundancy and Maintenance for Peer-To-Peer File Backup Systems Alessandro Duminuco