The Linux Implementation of a Log-Structured File System

The LinuxImplementation of a Log-structured File System Ryusuke Konishi Yoshiji Amagai Koji Sato [email protected] [email protected] [email protected] Hisashi Hifumi Seiji Kihara SatoshiMoriai [email protected] [email protected] [email protected] NTT Cyber Space Laboratories, NTT Corporation 1-1Hikari-no-oka,Yokosuka-shi, Kanagawa, 239-0847, Japan ABSTRACT meta data blocks is physically conserved on the disk plat- Toward enhancing the reliability of the Linux file system, ters. Unfortunately, this constraint is often violated by the wearedeveloping a new log-structured file system (NILFS) write optimizations performed by the block I/O subsystem for the Linux operating system. Instead of overwriting ex- and disk controllers. Careful implementation of write bar- isting blocks, NILFS appends consistent sets of modified or riers and their correct application, which partially restrict newly created blocks continuously into segmented disk re- the elevator seek optimizations, is indispensable for avoiding gions. This writing method allowsNILFStoachievefaster this problem. However, strict write ordering degrades stor- recovery time and higher write performance. The address age performance because it increases disk seeks.Journaling of the block that is written to changes for each write, which file systems degrade performance due to the seeks between makes it difficult to apply modern file system technologies the journal file and original meta data blocks.Atpresent,a such as B-tree structures. To permit such writing on the few Linuxjournaling file systems support the write barrier, Linux kernel basis, NILFS has its own write mechanism that but it has not yet taken hold. Only XFS has recently made handles data and meta data as one unit and allowsthemto it effectivebydefault.We seem to be faced with a trade-off be relocated. This paper presents the design and implemen- between performance and reliability. tation of NILFS focussing on the write mechanism. One interesting alternative approach is called log-structured file system (LFS)[6, 7]. LFS assures reliability by avoiding 1. INTRODUCTION the overwriting of on-disk blocks; the modified blocks are ap- As use of open-source operating systems advances not only pended to existing data blocks as is done for log data. This for stationery PCs but also for backend servers, their reliabil- write method also improves performance since the sequential ity and availability are becoming more and more important. block writes minimize the number of disk head hops. One important issue in these operating systems is file system reliability. For instance, applying Linux to such fields With regard to data recovery, we see the need for realizing was difficult several years ago because of the unreliablity more advanced features through the use of the technique of its standard file system. This problem has been signifi- called “snapshot”. Snapshots reduce the possibility of data cantly eased by the adoption of journaling file systems such loss including those caused by human error. Some commer- as ext3[9], XFS[2], JFS[1] and ReiserFS[3]. cial storage systems support this feature, but they are costly and are far from common. LFS suits data salvage and snap- These journaling file systems enable fast and consistent re- shots because past data are kept in disk. It can improvethe covery of the file system after unexpected system freezes or restorability of data, and can compensate for operational power failures. However, they still allow the fatal destruc- errors. tion of the file system due to the characteristic that recovery is realized by overwriting meta data with their copies Some LFS implementations appeared in the ’90s, but most saved in a journal file. This recovery is guaranteed to work of them are now obsolete.Asfor4.4BSD and NetBSD, properly only if the write order of on-disk data blocks and an implementation called BSD-LFS [8] is available.Asfor Linux,therewas an experimental implementation called Lin- LogFS[5], but its development was abandoned and it’s not available for recent Linux kernels. This is primarily due to the difficulty of implementing LFS; LFS basically stores blocks at different positions at each write. Combining LFS with a modern block management technique such as the B- tree[4] is a significant challenge. To overcome this shortfall, wearedeveloping NILFS, which is an abbreviation of the New Implementation of a Log-structured File System, for 102 Super block Segment usage chunks Full segment Seg 0 Seg 1 Seg 2 ... Seg i ... Seg n Partial segment ... Payload Payload Payload ... blocks blocks blocks Segment summary Logical segment Inode File B-tree Inode Check ... File blocks B-tree ... blocks blocks point blocks Write direction Figure1: Disk layout ofNILFS the Linux operating system. NILFS without losing compliance with Linux file system semantics and to minimize changes in the kernel code.We The rest of the paper is organized as follows. Section 2 note that the development of user tools is a future task. overviewsNILFSdesign. Section 3 focuses on the writing mechanism of NILFS, the key to our LFS implementation 2.2 On-disk representation for Linux. After showing some evaluation results in Section On-disk representation of inode number, block number and 4, we conclude with a brief summary. other sizes are designed around 64-bits in order to eliminate scalability limits. Data blocks and inode blocks are managed 2. OVERVIEW OF THE NILFS DESIGN using B-trees to enable fast lookup. The B-tree structures 2.1 Development goals of NILFS also have the role of translating logical block offset The development goals of NILFS are as follows: addresses into disk block numbers.We call the former file block number and the latter disk block number. 1. High availability Figure 1 depicts the disk layout of NILFS. Disk blocks of NILFS are divided into units of segments. The term segment 2.Adequate reliability and performance is used in multiple ways. To avoid ambiguity, we introduce three terms: 3. Support of online snapshots 4. High scalability • Full segment: 5. Compliance with Linux semantics Division and allocation units. The full segment is basically divided equally and is addressable by its index 6. Improved operability with user-friendly tools value. • Partial segment: High availability means that the recovery time should match Write units. Each partial segment consists of segment those of existing journaling file systems. For this purpose, a summary blocks and payload blocks. It cannot cross DB-like technique called checkpointing is applied. However, over the full segment boundary. The segment sum- NILFS recovery is safer because it prevents the overwriting mary has information about how the partial segment of meta data. In contrast, journaling file systems restore is organized. It includes a breakdownofeachblock, important blocks from the journal file during recovery. This length of the segment, a pointer to the summary block can cause fatal collapse if the journal file is not written per- of the previous partial segment, and so on. fectly. • Logical segment: Recovery units. Each logical segment consists of one NILFS offers online snapshot support and adequate relia- or more partial segments. It represents the difference bility and performance by taking advantage of LFS. Online between two consistent states of a file system. The snapshot support allows users to clip the consistent file sys- partial segments composing a logical segment are re- tem state without stopping services and then helps with garded, logically, as one segment. online backup. The requirement of high scalability includes support of many files, and large files and disks with mini- mum performance degradation.We would like to implement A logical segment consists of file blocks, file B-tree node 103 System Call Interface Virtual File System (VFS) super_ops inode_ops inode_ops Parts of NILFS Mount/Recovery File Inode Directory Inode segment Operations Operations Operations lookup File Page Operations lookup/insert/delete Segment Segment Block Manager Manager Constructor (B-Tree) segment data allocation read B-Tree Node Cache meta data read BIO write Buffer Cache Layer Page Cache (Radix-Tree) BIO read BIO read Block I/O Layer (BIO) Device Driver Figure2:Block diagram ofNILFS blocks, inode blocks, inode B-tree node blocks, and a check locating the valid checkpoint over partial segments. This point block. The B-tree node blocks are intermediate blocks search starts from the partial segment pointed to by the su- composing B-tree structures. The check point block points per block. The file inode operations read, write, delete, and to the root of the inode B-tree and holds inode blocks on the truncate files through the manipulation of file inodes and leaves. Each inode block includes multiple inodes, each of file data blocks. The directory inode operations perform which points to the root of a file B-tree. File data and direc- lookup, listing, creation, removal, and renaming of nodes tory data are held by the file B-trees. These payload blocks on directories through the manipulation of directory inodes are a collection of modified or newly created blocks. Since and directory data blocks. pointers to unchanged blocks are reused, the B-tree node blocks or unchanged inodes may havepointerstoblocksin The lower part of NILFS is entirely new and is composed past segments. Thus, the check point block represents the of a block manager, a segment constructor, and a segment root of the entire file system at the time the logical seg- manager. These parts call generic functions of the Linux ment was created. Snapshots are realized as the ability to file cache layer or directly call the Linux block I/O (BIO) keep the on-disk blocks trackable from all or selected check Layer. The Linux file cache layer consists of the buffer cache points. The segment summary and the check point use cyclic and the page cache.

Load more