Copyright © 1992, by the author(s). All rights reserved.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. AN IMPLEMENTATION OF A LOG-STRUCTURED

FILE SYSTEM FOR UNIX by

Margo Seltzer, Keith Bostic, Marshall Kirk McKusick, and Carl Staelin

Memorandum No. UCB/ERL M92/134

3 December 1992 AN IMPLEMENTATION OF A LOG-STRUCTURED

FILE SYSTEM FOR UNIX

by

Margo Seltzer, Keith Bostic, Marshall Kirk McKusick, and Carl Staelin

Memorandum No. UCB/ERL M92/134

3 December 1992

ELECTRONICS RESEARCH LABORATORY

College of Engineering University of California, Berkeley 94720 AN IMPLEMENTATION OF A LOG-STRUCTURED

FILE SYSTEM FOR UNIX

by

Margo Seltzer, Keith Bostic, Marshall Kirk McKusick, and Carl Staelin

Memorandum No. UCB/ERL M92/134

3 December 1992

ELECTRONICS RESEARCH LABORATORY

College of Engineering University ofCalifornia, Berkeley 94720 An Implementation ofa Log-Structured File System for UNIX1 Margo Seltzer —University ofCalifornia, Berkeley Keith Bostic - University ofCalifornia, Berkeley Marshall Kirk McKusick —University ofCalifornia, Berkeley Carl Staelin - Hewlett-Packard Laboratories

ABSTRACT

Research results [ROSE91] demonstrate that a log-structured file system (LFS) offers the potential for dramatically improved write performance, faster recovery time, and faster file creation and deletion than traditional UNIX file systems. This paper presents a redesign and implementation of the Sprite [ROSE91] log-structured file system that is more robust and integrated into the vnode interface [KLEI86]. Measurements show its performance to be superior to the 4BSD Fast File System (FFS) in a variety of benchmarks and not significantly less than FFS in any test Unfor tunately, an enhanced version of FFS (with readand write clustering) [MCV091] provides com parable and sometimes superior performance to our LFS. However, LFS can be extended to pro vide additional functionality such as embedded transactions and versioning, not easily imple mented in traditional file systems.

1. Introduction requests for different files can dominate. No solu Early UNIX file systems used a small, fixed tions to this problem currently exist in the context of block size and made no attempt to optimize block FFS. placement [THOM78]. They assigned disk addresses The log-structured file system, as proposed in to new blocks as they were created (preallocation) [OUST88], attempts to address both of these prob and wrote modified blocks back to their original disk lems. The fundamental idea of LFS is to improve file addresses (overwrite). In these file systems, the disk system performance by storing all file system data in became fragmented over time so that new files a single, continuous log. Such a file system is optim tended to be allocated randomly across the disk, ized for writing, because no seek is requiredbetween requiring a disk seek per file system read or write writes. It is also optimized for reading files written in even when the file was being read sequentially. their entirety over a brief period of time (as is the The Fast File System (FFS) [MCKU84] norm in UNIX systems), because the files are placed dramatically increased file system performance. It contiguously on disk. Finally, it provides temporal increased the block size, improving bandwidth. It locality, in that it is optimized for accessing files that reduced the number and length of seeks by placing were created or modified at approximately the same related information close together on the disk. For time. example, blocks within files were allocated on the The write-optimization of LFS has the poten same or a nearby cylinder. Finally, it incorporated tial for dramatically improving system throughput,as rotational disk positioning to reduce delays between large main-memory file caches effectively cache accessing sequential blocks. reads, but do litde to improve write performance The factors limitingFFS performance are syn [OUST88]. The goal of the Sprite log-structured file chronous file creation and deletion and seek times system (Sprite-LFS) [ROSE91] was to design and betweenI/Orequests fordifferent files. The synchro implement an LFS that would provide acceptable nous I/O for file creation and deletion provides file read performance as well as improved write perfor system disk data structure recoverability after mance. Our goal is to build on the Sprite-LFS work, failures. However, there exist alternative solutions implementinga new version ofLFS that providesthe such as NVRAM hardware [MORA90] and logging same recoverability guarantees as FFS, provides per software [KAZA90]. In a UNIX environment, where formance comparable to or better than FFS, and is the vast majority of files are small well-integrated into a production quality UNIX sys [OUST85][BAKE91], the seek times between I/O tem.

Permission has been granted by the USENDC Association to 2UNIX is currently a registered trademark of UNIX System reprint the above article. This article was originally published in Laboratories in the United States and other countries, however, the USENDC Association Conference Proceedings, January 1993. pending litigation calls this into question. Copyright © USENK Association, 1993. This paper describes the design of log- structured file systems in general and our implemen Cylinder Groupt tation in particular, concentrating on those parts that differ from the Sprite-LFS implementation. We com pare the performance of our implementation of LFS (BSD-LFS) with FFS using a variety ofbenchmarks.

2. Log-Structured File Systems There are two fundamental differences between an LFS and a traditional UNIX file system, as represented by FFS; the on-disk layout of the data structures and the recovery model. In this section we describe the key structural elements of an LFS, con trasting the data structures and recovery to FFS. The complete design and implementation of Sprite-LFS can be found in [ROSE92]. Table 1 compares key differences between FFS and LFS. The reasons for these differences will be described in detail in the fol MMTOM NMMAIYBtfO uoacTonu lowing sections. CTUXMtMM usTtiocxrw finslocbvonnoiv

KUMCTUtMtf USTILOCKMIinM OMMMAT

KtMixorauoca UfTDfOMPaanM mackmacro

2.1. Disk Layout MMMTHIUXa HUMRACIAVAM. •LOCK MAT In both FFS and LFS, a file's physical disk lay out is described by an index structure (inode) that Figure 1: Physical Disk Layout of the Fast File System. The contains the disk addresses of some direct, indirect, disk is statically partitioned into cylinder groups, each of which is doubly indirect, and triply indirect blocks. Direct described by a cylinder group block, analogous to a file system su blocks contain data, while indirect blocks contain perblock. Each cylinder group contains a copy of the superblock disk addresses of direct blocks, doubly indirect and allocation information for the inodes and blocks within that blocks contain disk addresses of indirect blocks, and group. triply indirect blocks contain disk addresses of dou bly indirect blocks. The inodes and single, double and triple indirect blocks are referred to as "meta data" in this paper. superblock. The disk is statically partitioned into cylinder groups, typically between 16 and 32 The FFS is described by a superblock that con cylinders to a group. Each group contains a fixed tains file system parameters (block size, fragment number of inodes (usually one inode for every two size, and file system size) and disk parameters (rota kilobytes in the group) and bitmaps to record inodes tional delay, number ofsectors per track, and number and data blocks available for allocation. The inodes of cylinders). The superblock is replicated in a cylinder group reside at fixed disk addresses, so throughout the file system to allow recovery from that disk addresses may be computed from inode crashes that corrupt the primary copy of the numbers. New blocks are allocated to optimize for

Task FFS LFS Assign disk addresses block creation segment write Allocate inodes fixed locations appended to log Maximum number of inodes statically determined grows dynamically Map inode numbers to disk addresses static address lookup in inode map Maintain free space bitmaps cleaner segment usage table Make file system state consistent fsck roll-forward Verify directory structure fsck background checker Table 1: Comparison of File System Characteristics ofFFS and LFS. sequential file access. Ideally, logically sequential time, LFS sorts the blocks by logical block number, blocks of a file are allocated so that no seek is assigns them disk addresses, and updates the meta required between two consecutive accesses. Because data to reflect their addresses. The updated meta-data data blocks for a file are typically accessed together, blocks are gathered with the data blocks, and all are the FFS policy routines try to place data blocks for a written to a segment As a result, the inodes are no file in the same cylinder group, preferably at rotation- longer in fixed locations, so, LFS requires an addi ally optimal positions in the same cylinder. Figure 1 tional data structure,called the inode map [ROSE90], depicts the physical layout ofFFS. that maps inode numbers to disk addresses. LFS is a hybrid between a sequential database Since LFS writes dirty data blocks into the next log and FFS. It does all writes sequentially, like a available segment, modified blocks are written to the database log, but incorporates the FFS index struc disk in different locations than the original blocks. tures into this log to support efficient random This space reallocation is called a "no-overwrite" retrieval. In an LFS, the disk is statically partitioned policy, and it necessitates a mechanism to reclaim into fixed size segments, typically one-half megabyte. space resulting from deleted or overwritten blocks. The logical ordering of these segments creates a sin The cleaner is a garbage collection process that gle, continuous log. reclaims space from the file system by reading a seg An LFS is described by a superblock similar to ment, discarding "dead" blocks (blocks that belong the one used by FFS. When writing, LFS gathers to deleted files or that have been superseded by many dirty pages and prepares to write them to disk newer blocks), and appending any "live" blocks. sequentially in the next available segment At this For the cleaner to determine which blocks in a seg ment are "live," it must be able to identify each block in a segment This determination is done by including a summary block in each segment that identifies the inode and logical block number of SUPBRBLCCKS every block in the segment. In addition, the kernel <•)! I SEGMENT1 I SEGMENT2 maintains a segment usage table that shows the number of "live" bytes and the last modified time of

TO each segment The cleaner uses this table to deter mine which segments to clean [ROSE90]. Figure 2 SEGMENT shows the physical layout ofthe LFS. DATA1 ••• INOOBS ••• DATAN SUMMARY While FFS flushes individual blocks and files on demand, the LFS must gather data into segments. Usually, there will not be enough dirty blocks to fill a (e) SUMMARY CHBCKSUM complete segment [BAKE92], in which case LFS DATA CHECKSUM writespartial segments. A physical segment contains NEXT SEGMENT POINTER one or more partial segments. For the remainder of M> CRBATBTIMB this paper,segment will be used to refer to the physi NUMFINFOS VERSION NUMBER cal partitioning of the disk, and partial segment will

INOOB NUMBER be used to refer to a unit of writing. Small partial

LOGICAL BLOCK I segments most commonly result from NFS opera tions orfsync(2) requests, while writes resulting from LOGICAL BLOCK N tNOOB DISK ADDRESS N the sync(2) system call or system memory shortages typically form larger partials, ideally taking up an

INODE DISK ADDRESS 1 entire segment During a sync, the inode map and segment usage table are also written to disk, creating Figure 2: A Log-Structured File System. A file systemis com a checkpoint that provides a stable point from which posedof segments as shownin Figure(a). Eachsegment consists the file system can be recovered in case of system of a summaryblock followed by data blocks and inode blocks (b). failure. Figure 3 shows the details ofallocating three The segment summary contains checksums to validate both the files in an LFS. segment summary and the data blocks, a timestamp, a pointer to the next segment, and information that describes each file and 2.2. File System Recovery inodethat appearsin the segment(c). Files are described by FIN- There are two aspects to file system recovery: FO structures that identify the inode number and version of the file bringing the file system to a physically consistent (aswellaseach blockof that file)locatedin the segment(d). state and verifying the logical structure of the file system. When FFS or LFS add a block to a file, there ARTIAL SEGMENT- f EM filc>2 I -PARTIALSEGMENT-

... another partial ... unused blocks...

SEGMENT- -SEGMENT-

more of ARTIAL SEGMENT- -PARTIALSEGMENT- filel fiW I i» filo2

... another partial ~ I another partial...

SEGMENT SEGMENT

Dtta Block Inode* Stummy Bl-D inodo upnocks Figure 3: A Log-Structured File System. In figure (a),two files havebeenwritten, filel and filel Eachhas an indexstructurein themeta-data block thatis allocated afterit on disk, hi figure (b),themiddle block of file2 hasbeen modified. A newversion of it is added to thelog,as well as a new version of its meta-data. Then file3 is created, causing its blocks and meta-data to be appended to the log. Next,filel has two more blocks appended toit, causing theblocks anda new version offilel'smeta-data tobeappended tothelog. Oncheckpoint, theinode map contain ing pointers to the meta-data blocks, is written.

are several different pieces of information that may the partial segments written after the checkpoint and be modified: the block itself, the inode, the free block those written before which have been reclaimed. The map, possibly indirect blocks, and the location of the file and block numbers in the FINFO structures are last allocation. If the system crashesduring the addi used to update the inode map, segment usage table, tion, the file system is likely be left in a physically and inodes making the blocks in the partial segment inconsistent state. Furthermore, there is currently no extant As is the case for database recovery, the way for FFS to localize inconsistencies. As a result, recovery time is proportional to the interval between FFS must rebuild the entire file system state, includ file system checkpoints. ing cylinder group bitmaps and meta-data. At the While standard LFS recovery quickly brings same time, FFS verifies the directory structure and all the file system to a physically consistent state, it does block pointers within the file system. Traditionally, not provide the same guarantees made by fsck. When fsck(&) is the agent that performs both of these func fsck completes, not only is the file system in a con tions. sistent state, but the directory structure has been In contrast to FFS, LFS writes only to the end verified as well. The five passes offsck are summar of the log and is able to locate potential inconsisten ized in Table 2. For LFS to provide the same level of cies and recover to a consistent physical state robustness as FFS, LFS must make many of the same quickly. This partof recovery in LFS is more similar checks. While LFS has no bitmaps to rebuild, the to standard database recovery [HAER83] than to verification of block pointers and directory structure fsck. It consists of two parts: initializing all the file and contents is crucial for the system to recover from system structures from the most recent checkpoint media failure. This recovery will be discussed in and then "rolling forward** to incorporate any more detail in Section 3.4. modifications that occurred subsequendy. The roll forward phaseconsistsof reading each segmentafter 3. Engineering LFS the checkpoint in time order and updating the file While the Sprite-LFS implementation was an system state to reflect the contents of the segment. excellent proof of concept, it had several deficiencies The next segment pointers in the segment summary that made it unsuitable for a production environment facilitate reading from the last checkpoint to the end Our goal was to engineer a version of LFS that could of the log, the checksums are used to identify valid be used as a replacement for FFS. Some of our con segments, and the timestamps are used to distinguish cerns were as follows: Phase I Traverse inodes Validate all block pointers. Record inode state (allocatedor unallocated)and file type for each inode. Record inode numbers and block addresses ofall directories. Phase II Sort directories by disk address order. Traverse directories in disk address order. Validate ".". Record"..'*. Validate directories* contents, type, and link counts. Recursively verify "..*'. Phase in Attach any unresolved ".." trees to lost+found. Mark all inodes in those trees as "found". Phase IV Put any inodes that are not "found*' in lost+found. Verify link counts for every file. Phase V Update bitmaps in cylinder groups. Table 2: Five Phases of fsck.

1. Sprite-LFS consumes excessive amounts of 3.1. Memory Consumption memory. Sprite-LFS assumes that the system has a large 2. Write requests are successful even if there is physical memory and ties down substantial portions insufficient disk space. ofit The following storage is reserved: 3. Recovery does nothing to verify the con sistency ofthe file system directory structure. Two 64K or 128K staging buffers 4. Segment validation is hardware dependent Since not all devices support scatter/gather I/O, data is written in buffers large enough to allow 5. All file systems use a single cleaner and a sin the maximum transfer size supported by the gle cleaning policy. disk controller, typically 64K or 128K. These 6. There are no performance numbers that meas buffers are allocated per file system from ker ure the cleaner overhead. nel memory. The earlier description of LFS focused on the One cleaning segment overall strategy of log-structured file systems. The One segment's worth of buffer cache blocks rest of Section 3 discusses how BSD-LFS addresses per file system are reserved for cleaning. the first five problems listed above. Section 4 Two read-only segments addresses the implementation issues specific to Two segments* worth of buffer cache blocks integration in a BSD framework, and Section 5 per file system are marked read-only so that presents the performance analysis. In most ways, the they may be reclaimed by Sprite-LFS without logical framework of Sprite-LFS is unchanged. We requiring an I/O. have kept the segmented log structure and the major supportstructures associatedwith the log, namely the Buffers reserved for the cleaner inode map, segment usage table, and cleaner. How Each file system also reserves some buffers for ever, to address the problems described above and to the cleaner. The number ofbuffers is specified integrate LFS into a BSD system, we have altered in the superblock and is set during file system nearly all of the details of implementation, including creation. It specifies the minimum number of a few fundamental design decisions. Most notably, clean buffers that must be present in the cache we have moved the cleaner into user space, elim at any point in time. On the Sprite cluster, the inated the directory operation log, and altered the amount of buffer space reserved for 10 com segment layout on disk. monly used file systems was 37 megabytes. One segment This segment (typically one-half megabyte) is allocated from kernel memory for use by the cleaner. Since this one segment is allocated per system, only one file system per system written to disk. may be cleaned at a time. The second form of accounting keeps track of The reserved memory described above makes how much space is currently available for writing. Sprite-LFS a very "bad neighbor** as kernel subsys This spaceis allocated as soon as a dirty block enters tems compete for memory. While memory continues the cache, but is not reclaimed until segments are to become cheaper, a typical laptop system has only cleaned. This count is used to initiate cleaning. If an three to eight megabytes of memory, andmight very application attempts to write data, but there is no reasonably expect to havethreeormore file systems. space currently available for writing, the write will BSD-LFS greatly reduces the memory con sleep until space is available. These two forms of sumption of LFS. First BSD-LFS does not use accounting guarantee that if the separate buffers for writing large transfers to disk, accepts a write request from the user, barring a crash, instead it uses the regular buffer cache blocks. For it will perform the write. disk controllers that do not coalesce contiguous reads, we use 64K staging buffers (briefly allocated 3.3. Segment Structure and Validation from the regular kernel memory pool)to do transfers. Sprite-LFS places segment summary blocks at The size ofthe staging buffer was set to the minimum the end of the segment trusting that if the write con of the maximum transfer sizes for currently sup taining the segment summary is issued after all other ported disks. However, simulation results in writes in that partial segment the presence of the [CAR92] show that for current disks, the write size segment summary validates the partial segment This minimizing the read response time is typically about approach requires two assumptions: the disk con two tracks; two tracks is close to 64 kilobytes for the troller will not reorder the write requests and the disk disks on our systems. writes the contents of a buffer in the order presented. Secondly, rather than reserving read-only Since controllers often reorder writes and reduce buffers, we initiate segment writes when the number rotational latency by beginning track writes any of dirty buffers crosses a threshold. That threshold is where on the track, we felt that BSD-LFS could not currently measured in available buffer headers, not in make these assumptions. Therefore, we build seg physical memory, although systems with an ments from front to back, placing the segment sum integrated buffer cache and virtualmemory will have mary at the beginning of each segment as shown in simpler, more straight-forward mechanisms. Figure 4. We compute a checksum across four bytes of each block in the partial segment, store it in the Finally, the cleaner is implemented as a user segment summary, and use this to verify that a partial space process. This approach means that it requires no dedicated memory, competing for virtual memory space with the other processes. Sprite Segment Structure 3.2. Block Accounting n xibjUlegmen! poaitsr Sprite-LFS maintains a count of the number of disk blocks available for writing, i.e. the real number of disk blocks that do not contain useful data. This 1 count is decremented when blocks are actually writ next fuzztznuy block posdcn ten to disk. This approach implies that blocks can be BSD Segment Structure successfully written to the cache but fail to be written nextMgpwnt pointef to disk if the disk becomes full before the blocks are actually written. Even if the disk is not full, all avail able blocks may reside in uncleaned segments and -N t new data cannot be written. To prevent the system Segment Summary Blocks from deadlocking or losing data in these cases, BSD-LFS uses two forms ofaccounting. Figure 4: Partial Segment Structure Comparison Between Sprite-LFS and BSD-LFS. The numbers in each partial show the The first form of block accounting is similar to order in which the partial segments are created. Sprite-LFS builds that maintained by Sprite-LFS. BSD-LFS maintains segments back to front, chaining segment summaries. BSD-LFS a count of the number of disk blocks that do not con builds segments front to back. After reading a segment summary tain useful data. It is decremented whenever a new block, the location of the next segment summary block can be easi block is createdin the cache. Since many files die in ly computed. the cache [BAKE91], this number is incremented whenever blocks are deleted, even if they were never segment is valid. This approach avoids write- case a system failure corrupted it ordering constraints and allows us to write multiple partial segments without an intervening seek or rota 3.5. The Cleaner tion. We do not yet have reason to believe that our In Sprite-LFS the cleaner is part of the kernel checksum is insufficient, however, methods exist for and implements a single cleaning policy. There are guaranteeing that any missing sector can be detected three problems with this, in addition to the memory during roll-forward, at the expense of a bit per disk issues discussed in Section 3.1. First there is no rea sector stored in the segment usage table and segment son to believe that a single cleaning algorithm will summary blocks.3 work well on all workloads. In fact measurements in [SELT93] show that coalescing randomly updated 3.4. File System Verification files would improve sequential read performance Fast recovery from system failure is desirable, dramatically. Second, placing the cleaner in kernel- but reliable recovery from media failure is necessary. space makes it difficult to experiment with alternate Consequently, the BSD-LFS system provides two cleaning policies. Third, implementing the cleaner in recovery strategies. The first quickly rolls forward the kernel forces the kernel to make policy decisions from the last checkpoint, examining data written (the cleaning algorithm) rather than simply providing between the last checkpoint and the failure. The a mechanism. To handle theses problems, the BSD- second does a complete consistency check of the file LFS cleaner is implemented as a user process. system to recover lost or corrupted data, due to the The BSD-LFS cleaner communicates with the corruption of bits on the disk or errant software writ kernel via system calls and the read-only ifile. Those ing bad data to the disk. This check is similar to the functions that are already handled in the kernel (e.g. functionality of fsck, the file system checker and translating logical block numbers to disk addresses recovery agent for FFS, and ]ikefsck, it takes a long via bmap) are made accessible to the cleaner via sys time to run. tem calls. If necessary functionality did not already As UNIX systems spend a large fraction of exist in the kernel {e.g. reading and parsing segment their time, while rebooting, in file system checks, the summary blocks), it was relegated to user space. speed at which LFS is able to recover its file systems There may be multiple cleaners, each imple is considered one of its majoradvantages. However, menting a different cleaning policy, running in paral FFS is an extremely robust file system. In the stan lel on a single file system. Regardless of the particu dard4BSD implementation, it is possible to clear the lar policy, the basic cleaning algorithm works as fol root inode and recover the file system automatically lows: with fsck(S). This level of robustness is necessary beforeuserswill acceptLFS as a file system in tradi 1. Choose one or more segments and read them. tional UNIX environments. In terms ofrecovery, the 2. Decide which blocks are still alive. advantage of LFS is that writes are localized, so the 3. Write live blocks back to the file system. file system may be recovered to a physically con 4. Mark the segment clean. sistent state very quickly. The BSD-LFS implemen The ifile and four new system calls, summarized in tation permits LFS to recover quickly, and applica Table 3, provide the cleaner with enough information tions can startrunning as soon as the roll-forwardhas to implement this algorithm. The cleaner reads the been completed, while basic sanity checking of the ifile to find out the status of segments in the file sys file system is done in the background. There is the tem and selects segments to clean based on this infor obvious problem of what to do if the sanity check mation. Once a segment is selected, the cleanerreads fails. It is expectedthatthe file system will be forci the segment from the raw partition and uses the first bly made read-only, fixed, and then once again write segment summary to find out what blocks reside in enabled. These events should have a limited effect that partial segment It constructs an array of on users as it is unlikely to ever occur and is even BLOCKJNFO structures (shown in Figure 5) and more unlikely to discover an error in a file currently continues scanning partial segments, adding their being written by a user, since the opening of the file blocks to the array. When the entire segment has would most likely have already caused a process or been read, and all the BLOCKJNFOs constructed, system failure. Of course, the root file system must the cleaner calls Ifsjbmapv which returns the current always be completely checked after every reboot in physical disk address for each BLOCKJNFO. If the disk address is the same as the location of the block 3 The database community uses a technique called patch in the segment being examined by the cleaner, the tables to verify multi-sector block writes. Although many people block is "live". Live blocks must to be written back haveheard of this technique, noneknew of any published reference for it into the file system without changing their access or modify times, so the cleanerissues an Ifsjnarkv call, lfs_bmapv Take an array of inode BLOCK INFO STRUCTURE number/logical block number pairs and return the disk ad INODE NUMBER dress for each block. Used to LOGICAL BLOCK NUMBER

determine if blocks in a seg CURRENT DISK ADDRESS ment are "live'*. lfs_markv Take an array of inode SEGMENT CREATIONTIME number/logical block number BUFFER POINTER pairs and append them into the log. This operation is a special Figure 5: BLOCKJNFO Structure used by the Cleaner. The purpose write call that rewrites cleaner calculates the current disk address for each block from the the blocks and inodes without disk address of the segment. The kernel specifies which have been updating the inode's access or superceded by more recent versions. modification times. lfs.segwait Causes the cleaner to sleep un til a given timeout has elapsed or until another segment is The cleaning simulation results in [ROSE91] written. This operation is used show that selection of segments to clean is an impor to let the cleaner pause until tant design parameter in minimizing cleaning over there may be more segments head, and that the cost-benefit policy defined there available for cleaning. does extremely well for the simulated workloads. lfs.segclean Mark a segment clean. After Briefly, each segment is assigned a cleaning cost and benefit. The cost to clean a segment is equal to the the cleaner has rewritten all the "live'* blocks from a seg 1 + utilization (the fraction of "live" data in the seg ment). The benefit of cleaning a segment is ment the segment is marked free bytes generated * age ofsegment where clean for reuse. free bytes generated is the fraction of "dead" Table 3: The System Call Interface for the blocks in the segment (1 - utilization) and Cleaner. age of segment is the time of the most recent modification to a block in that segment When the file system needs to reclaim space, the cleaner selects the segment with the largest benefit to cost ratio. We which is a special write causing these blocks to be retained this policy as the default cleaning algorithm. appended into the log without updating the inode Currently the cost-benefit cleaner is the only times. Before rewriting the blocks, the kernel verifies cleaner we have implemented, but two additional pol that none of the blocks have "died** since the cleaner icies are under consideration. The first would run called Ifsjbmapv. Once lfs_markv begins, only during idle periods and select segments to clean cleaned blocks are written into the log, until based on coalescing and clustering files. The second lfs.markv completes. Therefore, if cleaned blocks would flush blocks in the cache to disk during normal die after lfs.markv verifies that they are alive, partial processing even if they were not dirty, if it would segments written after the lfsjnarkv partial segments improve the locality for a given file. These policies will reflect that the blocks have died. When will be analyzed in future work. lfsjnarkv returns, the cleaner calls lfs_segclean to mark the segment clean. Finally, when the cleaner 4. Implementing LFS in a BSD System has cleaned enough segments, it calls Ifsjsegwait, sleeping until the specified timeout elapses or a new While the last section focused on those design segment is written into an LFS. issues that addressed problems in the design of Sprite-LFS, this section presents additional design Since the cleaner is responsible for producing issues either inherent to LFS or resulting from the free space, the blocks it writes must get preference integration ofan LFS into 4BSD. over other dirty blocks to be written to avoid running out of free space. To ensure that the cleaner can 4.1. Integration with FFS always run, normal writing is suspended when the number ofclean segments drops to two. The on-disk data structures used by BSD-LFS are nearly identical to the ones used by FFS. This decision was made for two reasons. The first one was that many applications have been written over the years to interpret and analyze raw FFS structures. It is reasonable that these tools could continue to Vnode Operations function as before, with minor modifications to read the structures from a new location. The second and more important reason was that it was easy and blkatoff Read the block at the given offset increased the maintainability of the system. A basic from a file. The two file systems cal LFS implementation, without cleaner or reconstruc culate block sizes and block offsets tion tools, but with dumpfs(\) and newfs(l) tools, was differently, because BSD-LFS does reading and writing from/to the buffer cache in under not implementfragments. two weeks, and reading and writing from/to the disk valloc Allocate a new inode. FFS must in under a month. This implementation was done by consult and update bitmaps to allo copying the FFS source code and replacing about cate inodes while BSD-LFS removes 40% of it with new code. The FFS and LFS imple the inode from the head of the free mentations have since been merged to share common inodelist in the ifile. code. vfree Free an inode. FFS must update bit In BSD and similar systems (i.e. SunOS, maps while BSD-LFS inserts the OSF/1), a file system is defined by two sets of inter inode onto a free list face functions, vfs operations and vnode operations truncate Truncate a file from the given offset [KLEI86]. Vfs operations affect entire file systems FFS marks bitmaps to show that (e.g. mount unmount, etc.) while vnode operations blocks are no longer in use, while affect files (open, close, read, write, etc.). BSD-LFS updates the segment usage table. File systems could share code at the level of a update Update the inode for the given file. vfs or vnode subroutine call, but they could not share the UNIX naming while implementing theirown disk FFS pushes individual inodes syn chronously, while BSD-LFS writes storage algorithms. To allow sharing of the UNIX them ina partial segment. naming, the code common to both the FFS and BSD-LFS was extracted from the FFS codeand put bwrite Write a block into the buffer cache. in a new, generic file system module (UFS). This FFS does synchronous writes while code contains all the directory traversal operations, BSD-LFS puts blocks on a queue for almost all vnode operations, the inode hash table writing in the next segment manipulation, quotas, and locking. The common code is used not only by the FFS and BSD-LFS, but Vfs Operations by the memory file system [MCKU90] as well The FFS and BSD-LFS implementations remain responsi vget Get a vnode. FFS computes the disk ble for disk allocation, layout, and actual I/O. address of the inode while BSD-LFS looksit up in the ifile. In moving code from the FFS implementation into the generic UFS area, it was necessary to add Table 4: New Vnode and Vfs Operations. These routines al seven new vnode and vfs operations. Table 4 lists the lowed us to share60% of the original FFS code with BSD-LFS. operations that were added to facilitate this integra tion and explains why they are different for the two file systems. ratio low. Unfortunately, large blocks can lead to 4.1.1. Block Sizes wasted space if many small files are present Since One FFS feature that is not implemented in managing fragments complicates the file system, we BSD-LFS is fragments. The original reason FFS had decided to allocate progressively larger blocks fragments was that, given a large block size (neces instead of using a block/fragment combination. This sary to obtain contiguous reads and writes and to improvement has not yet been implemented but is lower the data to meta-data ratio), fragments were similar to the multiblock policy simulated in required to minimize internal fragmentation (allo [SELT91]. cated space that does not contain useful data). LFS does not require large blocks to obtain contiguous 4.1.2. The Buffer Cache reads and writesas it sorts blocks in a file by logical Prior to the integration of BSD-LFS into block number, writing them sequentially. Still, large 4BSD, the buffer cache had been considered file sys blocks are desirable to keep the meta-data to data tem independent code. However, the buffer cache contains assumptions about how and when blocks are addresses. In BSD-LFS, the disk address is assigned written to disk. First it assumes that a single block when blocks are written to disk instead of when they can be flushed to disk, at any time, to reclaim its are written in to the cache. This lazy assignment of memory. There are two problems with this: flushing disk addresses violates the assumption that all blocks blocks a single block at a time would destroy any have disk addresses. FFS accesses indirect blocks by possible performance advantage of LFS, and, hashing on the raw device vnode and the disk block because of the modified meta-data and partial seg number. Since BSD-LFS has no disk address for ment summary blocks, LFS may require additional these indirect blocks, the block name space had to memory to write. Therefore, BSD-LFS needs to incorporate meta-data block numbering. This nam guarantee that it can obtain any additional buffers it ing is done by making block addresses be signed needs when it writes a segment To prevent the integers with negative numbers referencing indirect buffer cache from trying to flush a single BSD-LFS blocks, while zero and positive numbers reference page, BSD-LFS puts its dirty buffers on the kernel data blocks. Figure 6 shows how the blocks are num LOCKED queue, so that the buffer cache never bered. Singly indirect blocks take on the negative of attempts to reclaim them. The number of buffers on the first data block to which they point Doubly and the locked queue is compared against two variables, triply indirect blocks take the next lower negative the startwritethreshold and stop access threshold, to number of the singly or doubly indirect block to prevent BSD-LFS from using up all the available which they point This approach makes it simple to buffers. This problem can be much more reasonably traverse the indirect block chains in either direction, handled by systems with better integration of the facilitating reading a block or creating indirect buffer cache and virtual memory. blocks. Sprite-LFS partitions the "block name Second, the buffer cache assumes that meta space*' in a similar fashion. Although it is not possi data (both indirect blocks and inodes) are assigned ble for BSD-LFS to use FFS meta-data numbering, disk addresses when they are created and can be the reverse is not true. In 4.4BSD, FFS uses the assigned block numbers corresponding to those disk BSD-LFS numbering and the bmap code has been moved into the UFS area.

4.2. The IFILE

Data Blocks Sprite-LFS maintained the inode map and seg ment usage table as kernel data structures which are written to disk at file system checkpoints. BSD-LFS places both of these data structures in a read-only regular file, visible in the file system, called the ifile. There are three advantages to this approach. First 11 Indirect Blocks while Sprite-LFS and FFS limit the number ofinodes in a file system, BSD-LFS has no such limitation, -• 12 growing the ifile via the standard file mechanisms. • -12 • Second, it can be treated identically to other files, in ~ "•"--* 1035

Double Indirect Blocks.pics •—* 1036 • • -1036 • NUM CLEAN SEGMENTS

: -» 2059 IFILE NUM DIRTY SBCMBNTS

CLEANER INFO

: „ NUM LIVE BYTES 1048588 SEGMENT USAGE LAST MOOTIME -1048588 : TABLE

» 1049612

tNOOBMAP VERSION NUMBER Figure 6: Block-numbering In BSD-LFS. In BSD-LFS, data DISK ADDRESS blocks are assigned positive block numbers beginning with 0. In FREE LIST POINTER direct blocks are numbered with the negative of the first data block to which they point Double and triple indirect blocks are num Figure 7: Detail Description of the IFILE. The ifile is main bered with one less than the first indirect or double indirect block tained as a regular file with read-only permission. It facilitates to which they point communication between the file system and the cleaner. most cases, minimizing the special case code in the Sprite-LFS preserves the ordering of directory operating system. Finally, as is discussed in section operations by maintaining a directory operation log 3.6, we intended to move the cleaner into user space, inside the file system log. Before any directory and the ifile is a convenient mechanism for communi updates are written to disk, a log entry that describes cation between the operating system and the cleaner. the directory operation is written. The log informa A detailed view ofthe ifile is shown in Figure 7. tion always appears in an earlier segment, or the Both Sprite-LFS and BSD-LFS maintain disk same segment, as the actual directory updates. At addresses and inode version numbers in the inode recovery time, this log is read and any directory map. The version numbers allow the cleaner to operations that were not fully completed are rolled easily identify groups ofblocks belonging to files that forward. Since this approach requires an additional, have been truncated or deleted. Sprite-LFS also on-disk data structure, and since LFS is itself a log, keeps the last access time in the inode map to minim we chose a different solution, namely segment batch ize the number ofblocks that need to be written when ing. a file system is being used only for reading. Since Since directory operations affect multiple the access time is eight bytes in 4.4BSD and main inodes, we need to guarantee that either both of the taining it in the inode map would cause the ifile to inodes and associated changes get written to disk or grow significantly larger, BSD-LFS keeps the access neither does. BSD-LFS has a unit of atomicity, the time in the inode. Sprite-LFS allocates inodes by partial segment but it does not have a mechanism scanning the inode map sequentially until it finds a that guarantees that all inodes involved in the same free inode. This scan is costly if the file system has directory operation will fit into a single partial seg many inodes. BSD-LFS avoids this scan by main ment Therefore, we introduced a mechanism that taininga free list ofinodes in the inode map. allows operations to span partial segments. At The segment usage table contains the number recovery, we never roll forward a partial segment ifit of live bytes in and the last modified time ofthe seg has an unfinished directory operation and the partial ment, and is largely unchanged from Sprite-LFS. In segment that completes the directory operation did order to support multiple and user mode cleaning not make it to disk. processes, we have added a set of flags indicating The requirements for segment batching are whether the segment is clean, contains a superblock, defined as follows: is currently being written to, or is eligible for clean 1. If any directory operation has occurred since ing. the last segment was written, the next segment write will append all dirty blocks from the ifile 4J. Directory Operations (that is, it will be a checkpoint except that the Directory operations4 pose a special problem superblock need not be updated). for LFS. Since the basic premise of LFS is that 2. During recovery, any writes that were part ofa operations canbe postponed andcoalesced to provide directory operation write will be ignored unless large I/Os, it is counterproductive to retain the syn the entire write completed. A completed write chronous behavior of directory operations. At the can be identified if all dirty blocks of the ifile same time, if a file is created, filled with data and and its inode were successfully written to disk. fsyncod, then both the file's data and the directory entry for the file must be on disk. Additionally, the UNIX semantics of directory operations are defined This definition is essentially a transaction to preserveordering (i.e. if the creation of file a pre where the writing ofthe ifile inode to disk is the com cedes the creation of file b, then any post-recovery mit operation. In this way, there is a coherent state of a file system that includes file b must include snapshot of the file system at some point after each file a). We believe this semantic is used in UNIX directory operation. The penalty is that checkpoints systems to provide mutual exclusion and other lock are written more frequently in contrast to Sprite- LFS's approach that wrote additional logging infor ingprotocols5. mation to disk. 4Directory operations include those system calls that affect The BSD-LFS implementation requires syn more than one inode (typically a directory and a file)and include: chronizing directory operations and segment writing. create, link, mkdir, mkncd, remove,rename, rmdir, andsymlink. Each time a directory operation is performed, the 5Wehave been unable to find areal example ofthe ordering affected vnodes are marked. When the segment of directory operations being used forthis purpose andareconsid writer builds a segment it collects vnodes in two ering removing it as unnecessary complexity. If you have an ex ample where ordering must be preserved across system failure, passes. In the first pass, all unmarked vnodes (those pleasesendus email [email protected]! not participating in directory operations) are collected, and during the second pass those vnodes that are marked are collected. If any vnodes are found during the second pass, this means that there are directory operations present in the current seg work (iri_tllclean_wa (lockedjaueoejcount) ment, and the segment is marked, identifying it as containing a directory operation. To prevent direc g (Ifs_dIrop») tory operations from being partially reflected in a segment no new directory operations are begun while the segmentwriter is in pass two, andthe seg ment writer cannot begin pass two while any direc tory operationis in progress. work (lfi_alklean_wakcup) When recovery is run, the file system can be in A >B Reason (address) one of three possible states with regard to directory A watts for B on "address" due to "Reason" operations: Figure 8: Synchronization Relationships in BSD-LFS. The cleanerhas precedence over all components in the system. It waits 1. The system shut down cleanly so that the file on the lfs_allclean_wakeup conditionandwakes the segmentwrit system may be mounted as is. er or user processes using the lfs_avail condition. The segment 2. There are valid segments following the last writer and user processes maintain directory operation synchroni checkpoint and the last one was a completed zation through the IfsjUrop and l/sjvriter conditions. User directory-operation write. Therefore, all that is processes doing writes wait on the lockedjjueue_count when the required before mounting is to rewrite the number ofdirty buffers held by BSD-LFS exceeds a system limit superblock to reflect the address of the ifile inode and the currentend of the log. 3. There are valid segments following the last checkpoint or directory operation write. As in in the system to guarantee that clean segments are the previous case, the system recovers to the available if the file system has space. It has its own last completed directory operation write and event variable on which it waits for new work then rolls forward the segments from there to (lfs_allclean_wakeup). The segment writer and user either the end of the log or the first segment processes will defer to the cleaner if the disk system beginning a directory operation that is never does not have enough clean space. A user process finished. Then the recovery process writes a detects this condition when it attempts to write a checkpoint and updates the superblock. block but the block accounting indicates that there is no space available. The segment writer detects this While rolling forward, two flags are used in the condition when it attempts to begin writing to a new segment summaries: SS_DLROP and SS_CONT. segment and the number of clean segments has SS.DLROP specifies that a directory operation reached two. appears in the partial segment SS_CONT specifies that the directory operation spans multiple partial In addition to cleaner synchronization, the seg segments. If the recovery agent finds a segment with ment writer and user processes synchronize on the both SS_DIROP and SS_CONT set, it ignores all the availability of buffer headers. When the number such partial segments until it finds a later partial seg of buffer headers drops below the start write thres ment with SS_DLROP set and SS_CONT unset (i.e. hold a segment write is initiated. If a write request the end of the directory operation write). If no such would push the number of available buffer headers partial segment is ever found, then all the segments below the stop access threshold, the writing process from the initial directory operation on are discarded. waits until a segment write completes, making more Since partial segments are small [BAKE92] this buffer headers available. Finally, there is the direc should rarely, ifever, happen. tory operation synchronization. User processes wait on the IfsjUrop condition and the segment writer 4.4. Synchronization waits on Ifsjvriter condition.

To maintain the delicate balance between 4 J. Minor Modifications buffer management, free space accounting and the cleaner, synchronization between the components of There are a few additional changes to Sprite- the system must be carefully managed. Figure 8 LFS. To provide more robust recovery we replicate shows each of the synchronizationrelationships. The the superblock throughout the file system, as in FFS. cleaner is given precedenceover all other processing Since the file system meta-data is stored in the ifile, we have no need for separate checkpoint regions, and simply store the disk address of the ifile inode in the reading. Because BSD-LFS potentially allocates superblock. Note that it is not necessary to keep a many blocks contiguously, it may miss rotations duplicate ifile since it can be reconstructed from seg between reading collections of blocks. Since EFS ment summary information, if necessary. uses the FFS allocator, it leaves a rotational delay between clusters of blocks and does not pay this 5. Performance Measurements penalty. This chapter compares the performance of the redesigned log-structured file system to more tradi 5.1. The Evaluation Platform tional, read-optimized file systems on a variety of Our benchmarking configuration consists of a benchmarks based on real workloads. Read- Hewlett-Packard series 9000/300 computer with a 25 optimized policies that favorlargeblocks or contigu Mhz MC68040 processor. It has 16 megabytes of ous layout perform similarly [SELT91], so we main memory, and an HP RD335 with an HPIB6 analyze FFS and a variant of FFS that does extent interface. The hardware configuration is summarized like allocation. The new log-structured file system in Table 5. The system is running the 4.4BSD-Alpha was written in November of 1991 and was left operating system and all measurements were taken largely untouched until late spring 1992,and is there with the system running single-user, unattached to fore a completely untuned implementation. While any network. Each of the file systems uses a 4K design decisions took into account the expected per block size with FFS and EFS having IK fragments. formance impact at this point there is littleempirical evidence to support those decisions. The three file systems being evaluated run in the same operating system kernel and share most of The file systems against which LFS is com their source code. There are approximately 6000 pared are the regular fast file system (FFS), and an lines of shared C code, 4000 lines of LFS-specific enhanced version of FFS similar to that described in code, and 3500 lines of FFS-specific code. EFS uses [MCV091], referred to as EFS for the rest of this the same source code as FFS plus an additional 500 paper. lines of clustering code, of which 300 are also used EFS provides extent-based file system behavior by BSD-LFS for read clustering. without changing the underlying structures of FFS, Each of the next sections describes a bench by allocating blocks sequentially on disk and cluster mark and presents performance analysis for each file ingmultiple block requests. FFS is parameterized by system. The first benchmark analyzes raw file sys a variable called maxcontig that specifies how many tem performance. The next two benchmarks attempt logically sequential disk blocks should be allocated to model specific workloads. A time-sharing contiguously. When maxcontig is large (equal to a environment is modeled by a software development track), FFS does what is essentially track allocation. benchmark, and a database environment is modeled In EFS, sequential dirty buffers are accumulated in by the industry-standard TPCB benchmark thecache, and when an extent'sworth (i.e. maxcontig [TPCB90]. blocks) have been collected, they are bundled together into a cluster, providing extent-based writ 5.2. Raw File System Performance ing. The goal of this test is to measure the max To provide extent-based reading, the interac imum throughput that can be expected from the given tion between the buffer cache and the disk was disk and system configuration for each of the file modified. Typically, before a block is read from disk, the bmap routine is called to translate logical block addresses to physical disk block addresses. Disk(HPRD335) The block is then read from disk and the next block is Average seek 16.6 ms requested. Since I/O interrupts are not handled Single rotation 15.0 ms instantaneously, the disk is usually unable to respond Transfer time 1MB/sec to two contiguous requests on the same rotation, so Track size 56.5 KB/track sequentially allocated blocks incur the cost of an CPU 25 Mhz entire rotation. For both EFS and BSD-LFS, bmap MIPS 10-12 was extended to return, not only the physical disk address, but the number of contiguous blocks that Table 5: Hardware Specifications. follow therequested block. Then, rather than reading one block at a time and requesting the next block asynchronously, the file system reads many contigu HPIB is an unbelievably slow interface as is shown in Table 6. ousblocksin a singlerequest,providing extent-based systems. For this test the three file systems are com Given the sequential layout of both BSD-LFS pared against the maximum speed at which the and EFS, the expectation is that both should perform operating system can write directly to the disk. The comparably to the speed of the raw disk. For the benchmark consists of either reading or writing a write test both are nearly identical and within 10% of large amount of data sequentially to a newly created the raw disk speed. This 10% difference is an file system or a raw partition. Two versions of this approximation of the file system overhead. In BSD- test are run. In the first the data is appended, while LFS, the overhead comes from creating segments, in the second the same data is repeatedly overwritten. while in EFS the overhead comes from doing block While FFS and EFS merely rewrite the existing data allocation. Since FFS does sequential allocation with blocks, BSD-LFS is continually allocating new a rotational delay between each block, the expecta blocks and marking the old blocks as no longer in tion is that it should exhibit performance approxi use, thus requiring more CPU processing than any of mately half that of EFS and BSD-LFS. The perfor the other file systems. The results for both the mance measurements support this hypothesis. APPEND and OVERWRITE tests are shown in FFS demonstrates read performance approxi Table 6. mately 10% better than write performance for the append tests. The 10% improvement is due partially

APPENDS

Writes Reads

Transfer Unit iM IM 2M .5M IM 2M

RAW 0.31 (0.00) 0.31 (0.00) 0.31 (0.00) 0.45(0.01) 0.45 (0.00) 0.45 (0.00)

FFS 0.11(0.00) 0.11(0.00) 0.12(0.00) 0.14(0.00) 0.14 (0.00) 0.14(0.00) EFS 0.26 (0.02) 0.28 (0.01) 0.28(0.01) 0.38 (0.02) 0.38 (0.00) 0.36 (0.03)

LFS 0.27 (0.00) 0.28 (0.00) 0.29 (0.00) 0.33 (0.00) 0.36 (0.00) 0.37 (0.00)

OVERWRITES

Writes Reads

Transfer Unit .5M IM 2M .5M IM 2M

RAW 0.30(0.00) 030(0.00) 0.30(0.00) 0.43 (0.00) 0.43 (0.00) 0.43 (0.00)

FFS 0.12(0.00) 0.12 (0.00) 0.12(0.00) 0.12(0.00) 0.14(0.00) 0.14(0.00) EFS 0.29 (0.01) 030(0.00) 0.28 (0.00) 0.35 (0.00) 0.37 (0.00) 0.37 (0.00) LFS 0.25 (0.00) 0.26 (0.00) 0.28 (0.00) 0.33 (0.00) 0.35 (0.00) 0.36 (0.00)

BSD-LFS O-verwrite Performance with Q eaner Running

Writes Reads

Transfer Unit .5M IM 2M .5M IM 2M

0% Utilization 0.25 (0.00) 0.26(0.00) 0.28 (0.00) 0.33 (0.03) 0.35 (0.02) 0.36 (0.01) 50% Utilization 0.24 (0.01) 0.26(0.01) 0.27 (0.03) 0.33 (0.03) 0.35 (0.02) 0.36(0.01)

Table 6: Raw File System Performance(MB/sec). The first two ubles show the raw file system performance in megabytes / second. The numbersshown are averagesof ten runs with the standarddeviation in parentheses. In the first table, new data is appendedto a file eachtime, whilein the second,the data is overwritten. The benchmark issuesthe writesin .5 megabyte, 1 megabyte, or 2 megabyterequestsas specified by the columns. The third table shows the overwritetest for BSD-LFS when the cleaner is ninning. In the first row, the same file is repeatedly overwritten, leaving very little work for the cleanerwhilein the second,alternating files are overwritten so that each segmentcleanedis approxi mately half full. to read-ahead and partiallyto avoiding the file system measurement The Andrew benchmark was created block allocation overhead required during writing. by M. Satyanarayanan of the Information Technol Since EFS and BSD-LFS do sequential allocation, ogy Center at Carnegie-Mellon University. It con the expectation is that the read performance should tains five phases. be comparable to that of the raw disk. However, 1. Create a directory hierarchy. Table 6 shows that BSD-LFS and EFS are achieving 2. Make several copies ofthe data. only 70% and 85% of the raw disk performance 3. Recursively examine the status ofevery file. respectively. The explanation for BSD-LFS lies in 4. Examine every byte ofevery file. the mapping between the file written and the on-disk 5. Compile several ofthe files. segment layout The file system is configured with one megabyte segments. A one megabyte segment Unfortunately, the test set for the Andrew benchmark does not hold one megabyte worth of data, due to the is small, and main-memory file caching can make the presence of meta-data (segment summary blocks, results uninteresting. To exercise the file systems, inode blocks, and indirect blocks) in the segment As this benchmark is run both single-user and multi a result the requests span segment boundaries, and user, and the system's cache is kept small (1.6 mega the resulting seek often incurs the penalty of a full bytes). Table 7 shows the performance of the stan rotation. The extent-based system does better dard Andrew benchmark with multi-user results because it takes advantage of FFS* understanding of presented in Table 8. rotational delay. It pays only a rotational delay (on As expected, BSD-LFS does quite well on the the order of 0.25 rotations) between clusters of directory creation phase of the benchmark (Phase 1), blocks. All the systems suffer a small (approxi as BSD-LFS avoids synchronous writes and pays no mately 5%) performance penalty for the overwrite allocation overhead during directory creation. How test since a disk seek is required before each read is ever, the remaining phases of the test exercise read issued. performance, and EFS performs the best with BSD- The third table in Table 6 shows impact of the LFS and FFS performing comparably. Although cleaner on BSD-LFS. In the first row (0% utiliza BSD-LFS can take advantage of contiguous layout, tion), the same file is overwritten repeatedly. As a as does EFS, it exhibits poorer performance due to result the only valid data is in the current segment, I/Os issued by the cleaner. In this test the cleaner and the cleaner does very little work to reclaim pays a substantial performance penalty during clean space. In the second row, (50% utilization), every ing, because it reads entire segments before deter other file is overwritten, leaving each segment half mining which blocks are "live". Since the files in full. In this case, the cleaner copies one-half seg this benchmark are created and deleted in large ment on average,to reclaimone segment ofspace. groups, most of the blocks read by the cleaner are discarded and most of the reads accomplished noth As expected, the cleaner had virtually no ing. impact during the read test (since there was no data being overwritten). For the 0% utilization test, there These single-user results are significantly dif is also no impact on performance which is not ferent from those of Sprite-LFS discussed in surprising. Even when each segment is half utilized, [ROSE92J. There are several reasons for this. As the impact is under 10%, However, when the cleaner mentioned earlier, a goal of this test is to stress the is running, performance is less predictable as is evi denced by the 10% standard deviations observed dur ing those tests. Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 The remaining tests are all designed to stress FFS 3.30 13.20 the file systems. For BSD-LFS, that means the 7.00 16.10 48.80 cleaner is runningand the disk systems are fairly full EFS 3.30 11.80 6.90 13.20 46.00 (above 80%), so that the cleaner is forced to reclaim LFS 1.30 13.70 8.00 16.60 46.20 space. For EFS and FFS, it becomes more difficult for them to allocate blocks optimally when they run Table 7: Single-User Andrew Benchmark Results. This table on fairly full file systems, so degradation is expected shows the average elapsed time, in seconds, for each phase of the for those systems as well. Andrewbenchmark. As expected, BSD-LFS does best on phase 1, the directory creation portion of the text However, in the remain 53. Software Development Workload ing tests, the extent-based system provides the best performance because it benefits from contiguous layout on both reading and The next tests evaluate BSD-LFS in a typical writing, but does not vie for disk servicing with the cleaner. software development environment. The Andrew benchmark [OUST90] is often used for this type of file systems, so both the cacheand the file system are demonstrates this even more dramatically. small. The small cache (1.6 megabytes) ensures that both read and write performance to the disk can be 5.4. Transaction Processing Performance measured. The small file system guarantees that the cleaner has work to do, so its impacton system per The industry-standard TPCB is used as a formance can be analyzed. database-oriented test Our benchmark is a modified version of the TPCB benchmark, configured for a 10 For the multi-user benchmark, four separate transaction per second system. While the benchmark trees are created and the benchmark runs con is designed to be run multi-user, we show only currently in each tree. The reported results are aver single-user (worst case) results. Additionally, we do aged across ten runs on each of the four trees for a not keep redundant logs and we do not model "think total of forty iterations. The multi-user Andrew time" between transactions. Each data point results are shown in Table 8. In a multi-user environ represents ten runs of 1000 transactions. The count ment the strengths and weaknesses of BSD-LFS ing of transactions is not begun until the buffer pool become more apparent. Phase 1 performance is has filled, so the measurements do not exhibit any exceptionally good because BSD-LFS is not doing artifacts of an empty cache. Transaction run lengths synchronous file and directory creation. Since phase ofgreater than 1000 were measured, but there was no 2 is write intensive, BSD-LFS performs approxi noticeable change in performance after the first 1000 mately 35% better than its nearest competitor (EFS), transactions. because it pays very few seeks in writing. In phase 3, When the cleaner is not running, BSD-LFS where every file's inode is examined, EFS demon behaves as predicted in simulation [SELT90], It strate approximately 10% better performance then shows approximately 20% improvement over the either LFS or FFS. In phase 4 every byte of every extent-based system. However, the impact of the fileis examined. BothLFS and EFS achieveapprox cleaner is far worse than was expected. With one imately a 10% performance improvement over FFS megabyte segments and the small random I/Os done due to the contiguous nature of their reads. LFS per by TPCB, most segments have only a few dead forms slightly better because the different trees were blocks available for reclamation. As a result the likely to be created at approximately the same time, cleaner does large reads as well as large writes to and their data is likely to be physically close on the reclaim space. Each of these large I/O operations disk. Phase 5 highlights the weakness in LFS. This busies the disk for long periods of time, making the phase is characterized by interspersed reads and writes. Once again, the large reads and writes per TPCB reads wait. formed by the cleaner compete with synchronous In an attempt to reduce this wait time, a second read requests, making LFS the poorest performer by set of tests were run with a smaller segment size (128 approximately 8%. When the cleaner is forced to kilobytes). The performance before cleaning is the run, synchronous I/O performance, typically read same as for the 1 megabyte case, but the after- performance, suffers. The next benchmark cleaning performance is slightly better (about 13%). In either case, the impact of the cleaner is so severe

Phase 1 Phase 2 Phase 3 Phase 4 Phase 5 Transactions Elapsed Time FFS 10.68 36.17 19.18 36.27 167.40 per second 1000 transactions EFS 9.12 34.25 16.90 32.55 168.75 FFS 8.9 112.51 (1.2) LFS 1.02 22.30 19.23 31.10 183.20 EFS 9.0 111.03(1.3)

Table 8: Multi-User Andrew Benchmark Results. This table LFS (no cleaner) 10.8 92.76(0.87) shows the average elapsed time, in seconds,of each phase of the LFS (cleaner, IM) 4.3 23237 (16.84) Andrew benchmark with four concurrent invocations. The times LFS (cleaner, 128K) 5.0 200.57 (68.72) are less than four times slower than the single-user times due to the overlapped CPU and I/O time. The multi-usertest emphasizes the Table 9: TPCB Performance Results for FFS, EFS, and BSD- strengths and weaknesses of BSD-LFS. It performs wellin phases LFS. The TPCB database was scaled for a 10 TPS system 1 and 2 whichare predominantly writesand in phase 4 wheretem (1,000,000 accounts, 100 tellers, and 10 branches). The elapsed poral localityproducesgood read performance. However, perfor time is reported for runs of 1000 transactions. The BSD-LFS mance suffers in Phase 5 where read and write traffic is inter results show performance before the cleaner begins to run and after spersed and the impact ofthe cleaneris most severe. the cleaner begins to run. The after-cleaner performance is shown for file systems with segment sizes of one megabyte and 128K. that BSD-LFS cannot compete with either FFS or additional memory (for segment summaries and on- EFS. For the tests presented here, the disk was run disk inodes), so segment writing needs to be initiated ning at 85% utilization, and the cleaner was continu before memory becomes a critical resource to avoid ally running. Note that FFS and EFS allocate up to memory thrashing or deadlock. 90% of the disk capacity without exhibiting any per The delayed allocation of BSD-LFS makes formance degradation [MCKU84], This result shows accounting of available free space more complex that log-structured file systems are much more sensi than that in a pre-allocated system like FFS. In tive to the disk utilization. While the user-level Sprite-LFS, the space available to a file system is the cleaner avoids synchronization costs between user sum of the disk space and the buffer pool. As a processes and the kernel, it cannot avoid the conten result, data is written to the buffer pool for which tion on the disk arm. there might not be free space available on disk. Since the applications that wrote the data may have 6. Conclusions exited before the data is written to disk, there is no The implementation of BSD-LFS highlighted way to report the "out of disk space'* condition. some subtleties in the overall LFS strategy. While This failure to report errors is unacceptable in a pro allocation in BSD-LFS is simpler than in extent- duction environment To avoid this phenomena, based file systems or file systems like FFS, the available space accounting must be done as dirty management of memory is much more complicated. blocks enter the cache instead of when they are writ The Sprite-LFS implementation addressed this prob ten from cache to disk. Accounting for the actual lem by reserving large amounts of memory. Since disk space required is difficult because inodes are not this is not feasible in most environments, a more written into dirty buffers and segment summaries are complex mechanism to manage buffer and memory not created until the segment is written. Every time requirements is necessary. LFS operates best when it an inode is modified in the inode cache, a count of can write out many dirty buffers at once. However, inodes to be written is incremented. When blocks are holding dirty data in memory until much data has dirtied, the number of available disk blocks is decre accumulated requires consuming more memory than mented. To decide if there is enough disk space to might be desirable andmay not be allowed(e.g. NFS allow another write into the cache, the number of semantics require synchronous writes). In addition, segment summaries necessary to write what is in the the act of writing a segment requires allocation of cache is computed, added to the number of inode blocks necessary to write the dirty inodes and

Deleted Data ta Blocks (file Inode Block Clean Segment

Indli Deleted Data veDsta- Inode Block Live Data Inode Block

Block

Clean Segment

Figure 9: Segment Layout for Bad Cleaner Behavior. Seg Figure 10: Segment Layout After Cleaning. The cleaner ments 1 and 2 contain data. The cleanerwill attempt to free cleaned segment 1. hi doing so, it rewrote the indirect block up the one disk block of deleted data from segment 1. How that previously resided in segment 2. Now that block has ever, to rewrite the data in segment 1, it will dirty the meta been deleted and the cleaner will be able to reclaim a disk datablock currently in segment 2. As a result, the cleaner will block by cleaning segment 2. not generate any additional clean blocks. compared to the amount of space available on the Also, the sequential nature of BSD-LFS write disk. To create more available disk space, either the patterns makes it nearly ideal for tertiary storage dev cleaner must run or dirty blocks in the cache must be ices [KOHL93]. LFS may be extended to include deleted. multiple devices in a single file system. If one or A third and more critical situation arises when more of these devices is a robotic storage device, the cleaner consumes more disk space than it frees such as a tape stacker, then the file system may have during cleaning. Although this cannot happen over tremendous storage capacity. Such a file system the long term, during short term intervals it can. would be particularly suitable for on-line archival or Consider the simple three segment file systemshown backup storage. below in Figure 9. Segment 1 contains one free An early version of the BSD-LFS implementa block (the first block marked "Deleted Data"). tion was shipped as part of the 4.4BSD-Alpha However, cleaning that segment requires rewriting release. The current version described in this paper the indirect block for file 1. Therefore, after segment will be available as part of4.4BSD. Additionally, the 1 is cleaned,segment3 will be full,segment1 willbe FFS shipped with 4.4BSD will contain the enhance clean, and one block in segment 2 will be dead (Fig ments to provide clustered reading and writing. ure 10). While the total number of live blocks on the system has not increased, it has not decreased either, 8. Acknowledgements and the act of cleaning the segment has not created any additional space. It is possible to construct cases Mendel Rosenblum and John Ousterhout are responsible for the original design and implementa where cleaning a segment actually decreases the amount of available space (consider a segment that tion of Sprite-LFS. Without their work, we never would have tried any of this. We also wish to contains N blocks from N different files, each of express our gratitude to John Wilkes and Hewlett- which is accessed via an indirect block and the indirectblock resides in a different segment). There Packard for their support. fore two segments are reserved for the cleaner. One guarantees that the cleaner can run at all, and the 9. References second ensures that small overflows can be accom modated until more space is reclaimed. [BAKE91] Baker, M., Hartman., J., Kupfer., M., Shirriff, L., Ousterhout, J., "Measurements of a Dis tributed File System," Proceedings ofthe 13th Sym 7. Future Directions posium on Operating System Principles, Monterey, The novel structures of BSD-LFS makes it an CA, October 1991,198-212. Published as Operating exciting vehicle for adding functionality to the file SystemsReview25,5 (October 1991). system. For example, there are two characteristics of BSD-LFS that make it desirable for transaction pro [BAKE92] Baker, M., Asami, S., Deprit, E., cessing. First, themultiple, random writes of a single Ousterhout, S., Seltzer, M., "Non-Volatile Memory transaction get bundled and written at sequential for Fast Reliable File Systems,'* to appear in speeds, so we expect to see a dramatic performance Proceedings oftheFifth Conferenceon Architectural improvement in multi-user transaction applications, if Support for Programming Languages and Operating sufficient disk space is available. Second, since data Systems, Boston, MA, October 1992. is neveroverwritten, before-images of updated pages existin the filesystemuntiltheyare reclaimed by the [CAR92] Carson, S„ Setia, S., "Optimal Write Batch cleaner. An implementation that exploits these two Size in Log-structured File Systems", Proceedings of characteristics is described and analyzed in 1992 Usenix Workshop on File Systems, Ann Arbor, [SELT93] on Sprite-LFS, and we plan on doing a MI, May 21-22 1992,79-91. prototype implementation of transactions in BSD- LFS. [KAZA90] Kazar, M., Leverett, B., Anderson, O., The "no-overwrite'* characteristic of BSD- Vasilis, A., Bottos, B., Chutani, S., Everhart, C, LFS makes it ideal for supporting unrm which would Mason, A., Tu, S., Zayas, E., "DECorum File Sys undoa file deletion. Savinga single copy of a file is tem Architectural Overview," Proceedings of the no more difficult than changing the cleanerpolicyto 1990 Summer Usenix Anaheim, CA, June 1990, not reclaim space from the last version of a file, and 151-164. the only challenge is finding the old inode. More sophisticated versioning should be only marginally [HAER83] Haerder, T. Reuter, A. "Principles of more complicated. Transaction-Oriented Database Recovery'*, Comput ing Surveys, 15(4); 1983,237-318. [KLEI86] S. R. Kleiman, "Vnodes: An Architecture 1-15. Published as Operating Systems Review 25, 5 for Multiple File System Types in Sun UNIX," (October 1991). Also available as Transactions on Usenix Conference Proceedings, June 1986,238-247. Computer Systems 10,1 (February 1992), 26-52. [KOHL93] Kohl, J., Staelin, C, Stonebraker, M., [ROSE92] Rosenblum, M., "The Design and Imple "Highlight: Using a Log-structured File System for mentation of a Log-structured File System", PhD Tertiary Storage Management," Proceedings 1993 Thesis, University of California, Berkeley, June Winter Usenix, San Diego, CA, January 1993. 1992. Alsoavailable as Technical Report UCB/CSD 92/696. [MCKU84] Marshall Kirk McKusick, William Joy, Sam Leffler, and R. S. Fabry, "A Fast File System [SELT90] Seltzer, M., Chen, P., Ousterhout J., for UNIX", ACM Transactions on Computer Sys- "Disk Scheduling Revisited," Proceedings of the tems,2Q), August 1954,181-197. 1990 Winter Usenix, Washington, D.C., January 1990,313-324. [MCKU90] Marshall Kirk McKusick, Michael J. Karels, Keith Bostic, "A Pageable Memory Based [SELT91] Seltzer, M., Stonebraker, M., "Read Filesystem," Proceedings of the 1990 Summer Optimized File Systems: A Performance Evalua Usenix Technical Conference, Anaheim, CA, June tion," Proceedings 7th Annual International Confer 1990,137-144. ence on Data Engineering, Kobe, Japan, April 1991, 602-611. [MORA90] Moran, J., Sandberg, R., Coleman, D., Kepecs, J., Lyon, B., Breaking Through theNFS Per [SELT93] Seltzer, M., "Transaction Support in a formance Barrier," Proceedings of the 1990 Spring Log-Structured File System," To appear in the European Unix Users Group, Munich, Germany, Proceedings of the 1993 International Conference on 199-206, April 1990. Data Engineering, April 1993,Vienna. [MCV091] McVoy, L., Kleiman, S., "Extent-like [THOM78] Thompson, K., "Unix Implementation", Performance from aUnix File System", Proceedings Bell Systems Technical Journal, 57(6), part 2, July- Winter Usenix 1991, Dallas, TX, January 1991, 33- August 1978,1931-1946. 44. [TPCB90] Transaction Processing Performance [OUST85] Ousterhout J., Costa, H., Harrison, D., Council, "TPC Benchmark B", Standard Kunze, J., Kupfer, M., Thompson, J., "A Trace- Specification, Waterside Associates, Fremont CA., Driven Analysis of the UNIX 4.2BSD File System," 1990. Proceedings of the Tenth Symposium on Operating System Principles, December 1985,15-24. Published asOperating Systems Review 19,5 (December 1985). [OUST88] Ousterhout J., Douglis, F., "Beating the I/O Bottleneck: A Case for Log Structured File Sys tems", Computer Science Division (EECS), Univer sity of California, Berkeley, UCB/CSD 88/467, October 1988. [OUST90] Ousterhout, J. "Why Aren't Operating Systems Getting Faster As Fast as Hardware?" Proceedings of the 1990 Summer Usenix, Anaheim, CA, June 1990,247-256.

[ROSE90] Rosenblum, M., Ousterhout J. K., "The LFS Storage Manager**, Proceedings of the 1990 Summer Usenix, Anaheim, CA, June 1990,315-324. [ROSE91] Rosenblum, M., Ousterhout, J. K., "The Design and Implementation of a Log-Structured File System", Proceedings ofthe Symposium on Operat ing System Principles, Monterey, CA, October 1991, [KLEI86] S. R. Kleiman, "Vnodes: An Architecture 1-15. Published as Operating Systems Review 25, 5 for Multiple File System Types in Sun UNIX," (October 1991). Also available as Transactions on Usenix ConferenceProceedings, June 1986,238-247. Computer Systems 10,1 (February 1992), 26-52.

[KOHL93] Kohl, J., Staelin, C, Stonebraker, M., [ROSE92] Rosenblum, M., "The Design and Imple "Highlight Using a Log-structured File System for mentation of a Log-structured File System", PhD Tertiary Storage Management," Proceedings 1993 Thesis, University of California, Berkeley, June Winter Usenix, San Diego, CA, January 1993. 1992. Also available as Technical Report UCB/CSD 92/696. [MCKU84] Marshall Kirk McKusick, William Joy, Sam Leffler, and R. S. Fabry, "A Fast File System [SELT90] Seltzer, M., Chen, P., Ousterhout J., for UNIX", ACM Transactions on Computer Sys- "Disk Scheduling Revisited,** Proceedings of the tems,2Q), August 1984,181-197. 1990 Winter Usenix, Washington, D.C., January 1990,313-324. [MCKU90] Marshall Kirk McKusick, Michael J. Karels, Keith Bostic, "A Pageable Memory Based [SELT91] Seltzer, M., Stonebraker, M., "Read Filesystem," Proceedings of the 1990 Summer Optimized File Systems: A Performance Evalua Usenix Technical Conference, Anaheim, CA, June tion," Proceedings 7th AnnualInternationalConfer 1990,137-144. ence on Data Engineering, Kobe, Japan, April 1991, 602-611. [MORA90] Moran, J., Sandberg, R., Coleman, D., Kepecs, J., Lyon, B., Breaking Through the NFS Per [SELT93] Seltzer, M., "Transaction Support in a formance Barrier," Proceedings of the 1990 Spring Log-Structured File System," To appear in the European Unix Users Group, Munich, Germany, Proceedings of the 1993 International Conference on 199-206, April 1990. Data Engineering, April 1993, Vienna.

[MCV091] McVoy, L., Kleiman, S., "Extent-like [THOM78] Thompson, K., "Unix Implementation", Performance from a Unix FileSystem", Proceedings Bell Systems Technical Journal, 57(6), part 2, July- Winter Usenix 1991, Dallas, TX, January 1991, 33- August 1978,1931-1946. 44. [TPCB90] Transaction Processing Performance [OUST85] Ousterhout, J., Costa, H„ Harrison, D., Council, "TPC Benchmark B", Standard Kunze, J., Kupfer, M., Thompson, J., "A Trace- Specification, Waterside Associates, Fremont CA., Driven Analysis of the UNIX 4.2BSD File System," 1990. Proceedings of the Tenth Symposium on Operating System Principles, December 1985,15-24. Published as Operating Systems Review 19,5 (December 1985).

[OUST88] Ousterhout, J., Douglis, F., "Beating the VO Bottleneck: A Case for Log Structured File Sys tems", Computer Science Division (EECS), Univer sity of California, Berkeley, UCB/CSD 88/467, October 1988.

[OUST90] Ousterhout, J. "Why Aren't Operating Systems Getting Faster As Fast as Hardware?" Proceedings of the 1990 Summer Usenix, Anaheim, CA, June 1990,247-256.

[ROSE90] Rosenblum, M., Ousterhout J. K., "The LFS Storage Manager", Proceedings of the 1990 Summer Usenix, Anaheim, CA, June 1990,315-324.

[ROSE91] Rosenblum, M., Ousterhout, J. K., "The Design and Implementation of a Log-Structured File System", Proceedings of the Symposium on Operat ing System Principles, Monterey, CA, October 1991,