Journaled Soft-updates

Marshall Kirk McKusick Author and Consultant JeffRoberson Consultant

ABSTRACT

This paper describes the work to add ‘‘journaling lite’’tosoft updates and its incorporation into the FreeBSD fast filesystem. Because soft updates prevent most inconsistencies, the journal need only track those inconsistencies that soft updates fails to address. Specifically,the journal con- tains the information needed to recoverthe block and inode resources that have been freed but whose freed status failed to makeittodisk before a system failure. After acrash, a variant of the venerable fsck program runs through the journal to identify and free the lost resources. Only if an inconsistencybetween the log and filesystem is detected is it necessary to run fsck.The jour- nal is tiny, 16Mb is usually enough independent of filesystem size. Although journal processing needs to be done before restarting, the processing time is typically just a fewseconds and in the worst case a minute. It is not necessary to build a newfilesystem to use soft-updates journaling. The addition or deletion of soft-updates journaling to existing fast filesystems is done using the tunefs program.

2. Compatibility with Other Implementations Journaling is enabled via tunefs and only 1. Background and Introduction requires a fewspare superblock fields and 16Mb of The soft updates dependencytracking system free blocks for the journal. These minimal require- wasadopted by FreeBSD in 1998 as an alternative to ments makeiteasily enabled on existing FreeBSD the popular journaled-filesystem technique [Ganger & filesystems. The journal’sfilesystem blocks are Patt, 1994; McKusick, Bostic, Karels, & Quarterman, placed in an inode named .sujournal in the root of the 1996]. While the runtime performance and consis- filesystem and filesystem flags are set such that older tencyguarantees of soft updates are comparable to non-journaling kernels will trigger a full filesystem journaled filesystems [Seltzer et al, 2000], it relies on check upon mounting a previously journaled volume. an expensive and time-consuming background filesys- When mounting a journaled filesystem, older kernels tem recovery operation after a crash [McKusick, clear a flag indicating that journaling is being done so 2002]. This paper outlines a method for eliminating that when the filesystem is next encountered by a ker- the necessity of an expensive background or fore- nel that does journaling, it will knowthat that the ground whole-filesystem check operation through the journal is invalid and will ensure that the filesystem is use of a small journal which logs the only twoincon- consistent and clear the journal before resuming use sistencies possible in soft updates. The first is allo- of the filesystem. cated but unreferenced blocks; the second is incor- rectly high link counts. Incorrectly high link counts 3. Journal Format include unreferenced inodes that were being deleted The journal is kept as a circular log of segments and files that were unlinked but open [Ganger,McKu- containing records which describe metadata opera- sick, & Patt, 2000]. This journal allows a journal- tions. If the journal fills, the filesystem must complete analysis program to complete recovery in just a few enough operations to expire journal entries before seconds independent of filesystem size. allowing newoperations. In practice, the journal almost neverfills. 4. Modifications that RequireJournaling Each journal segment contains a unique The next subsections describe the operations sequence number and a timestamp which identifies that must be journaled so that the information needed the filesystem mount instance so old segments can be to clean up the filesystem is available to fsck. discarded during journal processing. Journal entries are aggregated into segments to minimize the number 4.1. Increased Link Count of writes to the journal. Each segment contains the Alink count may be increased through a hard last valid sequence number at the time it was written link or file creation. The link count is temporarily to allow fsck to recoverthe head and tail by scanning increased during a rename. Here, the operation is the the entire journal. Segments are variably sized as same. The inode number,parent inode number,direc- some multiple of the disk block size and are written tory offset, and initial link count are all recorded in atomically to avoid read/modify/write cycles in run- the journal. Soft updates guarantees that the inode ning filesystems. link count will be increased and stable on disk prior to The journal-analysis has been incorporated into anydirectory write. The journal write must occur the fsck program. This incorporation into the existing prior to the inode write that updates the link count and fsck program has several benefits. The existing startup prior to the bitmap write that allocates the inode if it is scripts already call fsck to see if it needs to be run in newly allocated. foreground or background. Forfilesystems running with journaled soft updates, fsck can request to run in 4.2. Decreased Link Count foreground and do the needed journaled operations The inode link count is decreased through before the filesystem is brought online. If the journal unlink or rename. The inode number,parent inode, fails for some reason, it can instead report that a full directory offset, and initial link count are all recorded fsck needs to be run as the traditional fallback. Thus, in the journal. The deleted directory entry is guaran- this newfunctionality can be introduced without any teed to be written before the link is adjusted down. need for system administrators to change the way that As with increasing the link count, the journal write theystart up their systems. Finally,the invoking of must happen prior to all other writes. fsck means that after the journal has been processed, it is possible for debugging purposes to fall through and 4.3. Unlink While Referenced run a complete check of the filesystem to ensure that the journal is working properly. Unlinked yet referenced files pose a unique problem for journaled filesystems. In UNIX, an The journal entry size is 32 bytes, providing inode’sstorage is not reclaimed until after the final quite a dense representation allowing for 16 entries name is removedand the last reference is closed. per-sector.The journal is created in a single area of Simply leaving the journal entry valid while waiting the filesystem in as contiguous an allocation as is for applications to close their dangling references is available. Weconsidered spreading it out across untenable as it will easily exhaust journal space. A cylinder groups to optimize locality for writes but it solution which scales to the total number of inodes in ended up being so small that this approach was not the filesystem is required. At least twoapproaches are practical and would makescanning the entire journal possible, a replication of the inode allocation bitmap, during cleanup too slow. or a linked list of inodes to be freed. We hav e chosen The journal blocks are claimed by a named to use the linked-list approach. immutable inode. This approach allows user-level In the linked-list case, which is employed by access to the journal for debugging and statistics gath- several filesystems (, , etc.), the super-block ering purposes as well as providing backwards com- contains the inode number that serves as the head of a patibility with older kernels that do not support jour- singly linked list of inodes to be freed, with each naling. Wehav e found that a journal size of 16Mb is inode storing a pointer to the next inode in the list. sufficient in eventhe most tortuous and worst-case The advantage of this approach is that at recovery benchmarks. A 16Mb journal can coverover500,000 time you need only examine a single pointer in the namespace operations or 8Gb of outstanding alloca- superblock which will already be in memory.The tions (assuming a standard 16Kb block size). disadvantage is that you must keep an in memory dou- bly-linked list so that you can rapidly remove aninode once it is unreferenced. This approach ingrains a filesystem-wide lock in the design and incurs non- local writes when maintaining the list. In practice we 5.1. Cylinder Group Rollbacks have found that unreferenced inodes occur rarely Soft updates previously did not require anyroll- enough that this approach is not a bottleneck. backs of cylinder groups as theywere always the first Removalfrom the list may be done lazily but or last write in a group of changes. When a block or must be completed prior to anyre-use of the inode. inode has been allocated but its journal record has not Additions to the list must be stable prior to reclaiming yet been written to disk, it is not safe to write the journal space for the final unlink but otherwise may be updated bitmaps and associated allocation informa- delayed long enough to avoid needing the write at all tion. The routines which write blocks with if the file is quickly closed. Addition and removal bmsafemap dependencies nowrollback anyalloca- involveonly a single write to update the preceding tions with unwritten journal operations. pointer to the subsequent inode. 5.2. Inode Rollbacks 4.4. Change of Directory Offset The inode link count must be rolled back to the Anytime a directory compaction movesan link count as it existed prior to anyunwritten journal entry,ajournal entry must be created indicating the entries. Allowing it to growbeyond this count would old and newlocations of the entry.The kernel does not cause filesystem corruption but it would prohibit not knowatthe time of the move whether a remove the journal recovery from adjusting the link count will followit, so at this time all offset changes are properly.Soft updates already prevents the link count journaled. Without this information fsck would be from decreasing before the directory entry is removed unable to disambiguate multiple revisions of the same as a premature decrement could cause filesystem cor- directory block. ruption. When an unlinked file has been closed, its inode 4.5. Block Allocation and Free cannot be returned to the inode freelist until its zeroed When performing either block allocation or block pointers have been written to disk so that its free, whether it is a fragment, indirect block, directory blocks can be freed and it has been removedfrom the block, direct block, or extended attributes the record is on-disk list of unlinked files. The unlinked-file inode the same. The inode number of the file and the offset is not completely removedfrom the list of unlinked of the block within the file is recorded using negatives files until the next pointer of the inode that precedes it for indirects and extents as is done with ‘‘getblk’’. in the list has been updated on disk to point to the Additionally,the disk block address and number of inode that follows it on the list. If the unlinked-file fragments is included in the journal record. The jour- inode is the first inode on the list of unlinked files, nal entry must be written to disk prior to anyalloca- then it is not completely removedfrom the list of tion or free. unlinked files until the head-of-unlinked-files pointer When freeing an indirect only the root of the in the superblock has been updated on disk to point to indirect tree is logged. Thus, for truncation we need a the inode that follows it on the list. maximum of 15 journal entries, 12 for direct blocks and 3 for indirects. These 15 journal entries allowus 5.3. Reclaiming Journal Space to free a large amount of space with a minimum of To reclaim journal space from previously writ- journaling overhead. During recovery, fsck will fol- ten records, the kernel must knowthat the operation lowindirect blocks and free anydescendants includ- the journal record describes is stable on disk. This ing other indirects. Forthis algorithm to work, the requirement means that when a newfile is created, the contents of the indirect block must remain valid until journal record cannot be freed until writes are com- the journal record is free so that user data is not con- pleted for a cylinder group bitmap, an inode, a direc- fused with indirect pointers. tory block, a directory inode, and possibly some num- ber of indirect blocks. When a newblock is allocated, 5. Additional Requirements of Journaling the journal record cannot be freed until writes are Some operations that had not previously completed for the newblock pointer in the inode or required tracking under soft updates need to be indirect, the cylinder group bitmap, and the block tracked when journaling is introduced. This section itself. Blocks pointers within indirects are not stable describes these newrequirements. until all parent indirects are fully reachable on disk via the inode indirect pointers. To facilitate fulfill- ment of these requirements, the dependencies that describe these operations carry pointers to the oldest 6. The Recovery Process segment structure in the journal containing journal The next subsections describe the use of the entries that describe outstanding operations. journal by fsck to clean up the filesystem after a crash. Some operations may be described by multiple entries. For example, when making a newdirectory, 6.1. Scanning the Journal its addition creates three newnames. Each of these To dorecovery,the fsck program must first scan names is associated with a reference count on the the journal from start to end to discoverthe oldest inode to which the name refers. When one of these valid sequence number.Wecontemplated keeping dependencies is satisfied, it may pass its journal entry journal head and tail pointers, but that would require reference to another dependencyifanother operation extra writes to the superblock area. Because the jour- on which the journal entry depends is not yet com- nal is small, the extra time spent scanning it to iden- plete. If the operation is complete, the final reference tify the head and tail of the valid journal seemed a rea- on the journal record is released. When all references sonable tradeofftoreduce the run-time cost of main- to journal records in a journal segment are released, taining the journal. So, the fsck program must dis- its space is reclaimed and the oldest valid segment coverthe first segment containing a still valid sequence number is adjusted. We can only release the sequence number and work from there. Journal oldest free journal segment, since the journal is records are then resolved in order.Journal records are treated as a circular queue. marked with a timestamp that must match the filesys- tem mount time as well as a CRC to protect the valid- 5.4. Handling aFull Journal ity of the contents. If the journal everbecomes full, we must pre- vent anynew journal entries from being created until 6.2. Adjusting Link Counts more space becomes available from the retirement of Foreach journal record recording a link the oldest valid entries. Avery effective way to stop increase, fsck needs to examine the directory at the the creation of newjournal records is to suspend the offset provided and see whether the directory entry for filesystem using the mechanism in place for taking the indicated inode number exists on disk. If it does snapshots. Once suspended, existing operations on not exist, but the inode link count was increased, then the filesystem are permitted to complete, but new the recorded link count needs to be decremented. operations that wish to modify the filesystem are put to sleep until the suspension is lifted. Foreach journal record recording a link decrease, fsck needs to examine the directory at the We doacheck for journal space before each offset provided and see whether the directory entry for operation that will change a link count or allocate a the indicated inode number exists on disk. If it has block. If we find that the journal is approaching a full been deleted on disk, but the inode link count has not condition, we suspend the filesystem and expedite the been decremented, then the recorded link count needs progress on the soft-updates work-list processing to to be decremented. speed the rate at which journal entries get retired. As the operation that did the check has already started, it Compaction of directory offsets for entries that is permitted to finish, but future operations are are being tracked complicates the link adjustment blocked. Thus, operations must be suspended while scheme presented above.Since directory blocks are there is still enough journal space to complete opera- not written synchronously, fsck must look up each tions already in progress. When enough journal directory entry in all its possible locations. entries have been freed, the filesystem suspension is When an inode is added and removedfrom a lifted and normal operations resume. directory multiple times fsck is not be able to correctly In practice, we had to create a minimal sized assess the link count giventhe algorithm presented journal (4Mb) and run scripts designed to create huge above.The chosen solution is to pre-process the jour- numbers of link-count changes, block allocations, and nal and link all entries related to the same inode block frees to trigger the journal-full condition. Even together.Inthis way,all operations not known to be under these tests, the filesystem suspensions were committed to the disk can be examined concurrently infrequent and very brief lasting under a second. to determine howmanylinks should exist relative to the known stable count that existed prior to the first journal entry.Duplicate records that occur when an inode is added and deleted at the same offset many times are discarded, resulting in a coherent count. 6.3. Updating the Allocated Inode Map the time to run fsck after system crashes. Once the link counts have been adjusted, fsck The primary purpose of the journaling project must free anyinodes whose link count has fallen to wastoeliminate long filesystem check times. A zero. In addition, fsck must free anyinodes that were 40TB volume may takeanentire day and a consider- unlinked but still in use at the time that the system able amount of memory to check. We hav e run sev- crashed. The head of the list of unreferenced inode is eral scenarios to understand and validate the recovery in the superblock as described in section 4.3. The fsck time. program must traverse this list of unlinked inodes and Arelatively normal case for developers is to run free them. aparallel buildworld. Crash recovery from this case The first step in freeing an inode is to add all of demonstrates time to recoverfrom moderate write its blocks to list of blocks that need to be freed. Next workload. A 250GB disk was filled to 80% with the inode needs to be zero’ed to showthat it is not in copies of the FreeBSD source tree. One copywas use. Finally,the inode bitmap in its cylinder group selected at random and an 8 way buildworld pro- must be updated to reflect that it is available and all ceeded for 10 minutes before the box was reset. the appropriate filesystem statistics updated to reflect Recovery from the journal took 0.9 seconds. An addi- its availability. tional run with traditional fsck wasused to verify the safe recovery of the filesystem. The fsck took approx- 6.4. Updating the Allocated Block Map imately 27 minutes, or 1800 times as long. Once the journal has been scanned, it provides a Atesting volunteer with a 92% full 11TB vol- list of blocks that were intended to be freed. The jour- ume spanning 14 drivesona3ware RAID controller nal entry lists the inode from which the block was to generated hundreds of megabytes of dirty data by be freed. Forrecovery, fsck processes each free record writing random length files in parallel before resetting by checking to see if the block is still claimed by its the machine. The resulting recovery operation took associated inode. If it finds that the block is no longer less than one minute to complete. Anormal fsck run claimed, it is freed. takes approximately 10 hours on this filesystem. Foreach block that is freed either by the deallo- cation of an inode, or through the identification 8. FutureWork process described above,the block bitmap in its cylin- The next subsection describes some areas that der group must be updated to reflect that it is available we have not yet explored that may give further perfor- and all the appropriate filesystem statistics updated to mance improvements to our implementation. reflect its availability.When a fragment is freed, the fragment availability statistics must also be updated. 8.1. Rollback of Directory Deletions Doing a rollback of a directory addition is easy. 7. Performance The newdirectory entry has its inode number set to Journaling adds extra running time and memory zero to indicate that it is not really allocated. How- allocations to the traditional soft-updates requirements ev er, rollback of directory deletions is much more dif- and also additional I/O operations to write the journal. ficult as the space may have been claimed by a new The overhead of the extra running time and memory allocation. There are times when being able to roll allocations was immeasurable in the benchmarks that back a directory deletion would be very convenient. we ran. The extra I/O was mostly evident in the Forexample, preventing the removalofanold name increased delay for individual operations to complete. prior to a newname reaching the disk when a file is Operation completion time is usually only evident to renamed. Here, we have considered using a distin- an application when it does an ‘‘fsync’’system call guished inode number that the filesystem internally which causes it to wait for the file to reach the disk. would recognize as being in use, but that would not be Otherwise, the extra I/O to the journal only becomes returned to the user application. However, atpresent evident in benchmarks that are limited by the filesys- we cannot rollback deletes, which requires anydelete tem’sI/O bandwidth before journaling is enabled. In journaling to be written to disk prior to the writing of summary,asystem running with journaled soft affected directory blocks. updates will neverrun faster than one running soft updates without journaling. So, systems with small 8.2. Truncate and Weaker Guarantees filesystems such as an embedded system will usually As a potential optimization, the ‘‘truncate’’sys- want to run soft updates without journaling and take tem call may choose to instead record the intended file size and operate more lazily,relying on the log to operations associated with the records in that segment recoverany partially completed operations correctly. have been committed to disk, the jseg structure allows This approach also allows us to do partial truncations its segment to be freed. If its segment is the oldest asynchronously.Further,the journal allows for the valid segment, that segment as well as anyunused weakening of other soft dependencyguarantees segments that followitare returned for the use of although we have not yet been fully explored these future journal entries. reduced guarantees and do knowknowiftheyprovide A jsegdep structure tracks the validity of a writ- anyreal benefit. ten journal record. When all the record’sassociated dependencies have been written to disk, the jseg that 9. New Data Structures tracks the segment in which it is contained is notified Forthose with an interest in the details of the that it is no longer needed. the implementation, this section catalogs the data A jaddref structure tracks a newreference (link structures that have been added to the soft updates count) on an inode and prevents the link count implementation to support the journaling. increase and bitmap allocation until a journal entry has been written. 9.1. New Dependency Structures A jremref structure tracks a removedreference The following structures have been added to the (unlink) on an inode and prevents the directory standard soft updates structures to support the journal- remove from proceeding until the journal entry is ing. The first three records fulfill similar roles to the written. existing soft updates structures as theytrack when their filesystem resource has been written to disk, then A jmvref structure tracks name relocations trigger another step in the filesystem operation that within a directory block that occur as a result of direc- theyare tracking. tory compaction. This information is used by the recovery code to updated the expected offsets for A freework structure handles the release of a added and removednames. The jmvref prevents the tree of blocks or a single block. Each indirect block directory directory block in which the compaction in a tree is allocated its own freework structure. Each occurred from being written to disk until the journal indirect block may be freed only when all its children entry is written. have been freed. Thus, we enforce the rule that an allocated block must have a valid path to a root that is A jnewblk structure tracks a newly allocated journaled. block or fragment and prevents the direct or indirect block pointer as well as the cylinder-group bitmap A freedep structure tracks the completion of a from being written until it is written to the journal. bitmap write for a freework.One freedep may cover manyfreed blocks so long as theyreside in the same A jfreeblk structure tracks the journal write for cylinder group. When the cylinder group is written, freeing a block or tree of blocks. The block pointer the freedep decrements the reference count on the cannot be cleared in the inode or indirect prior to the freework which is freed when its reference count jfreeblk journal entry being written. reaches zero. A jfreefrag structure tracks the freeing of a sin- A sbdep structure tracks the writing of the gle block when a fragment is extended or an indirect superblock that contains the head of the list of inodes page is replaced. It is only needed if the fragment is whose names have been deleted, but are still being not part of a larger freeblks operation. The block held open by a process. This sbdep structure ensures pointer cannot be cleared in the inode or indirect prior that the superblock is always pointing at the first pos- to the jfreefrag journal entry being written. sible unlinked inode for the recovery process. A jtrunc structure journals the intent to truncate The remaining nine newdependencystructures an inode to a non-zero value. The jtrunc record must are used to track writes to the journal. Typically they be written to the journal prior to the synchronous par- will prevent updates to filesystem data structures until tial truncation process. The associated jsegdep that their tracked journal entry has been written to the tracks the jtrunc is not released until the truncation is disk. Theyare identified with a leading ‘‘j’’intheir complete and the truncated inode has been written to name. disk. A jseg structure tracks the records within a jour- nal segment. A segment contains all the journal records written in a single disk write. When all of the 9.2. Types of Journal Records did his graduate work at the University of California The following structures all exist on-disk within at Berkeley, where he receivedmaster’sdegrees in the journal file. Each structure is a uniform size, 32 computer science and business administration and a bytes, which simplifies journal processing. Each jour- doctoral degree in computer science. He has twice nal record has an opcode that can further refine its been president of the board of the Usenix Association, operation. Records with more than one opcode have is currently a member of the editorial board of ACM’s their opcode noted below. Queue magazine, and is a member of the Usenix Association and ACM, and is a senior member of the Every 512 byte disk block starts with a jsegrec IEEE. record that may describe more than one block of jour- nal entries. The jsegrec contains a 64-bit sequence In his spare time, he enjoys swimming, scuba number and the oldest valid sequence number as diving, and wine collecting. The wine is stored in a described in section 3. It also has a count of valid specially constructed wine cellar (accessible from the records and blocks along with a timestamp that identi- web at http://www.mckusick.com/˜mckusick/) in the fies the mount instance. basement of the house that he shares with Eric All- man, his domestic partner of 30-and-some-odd years. A jrefrec record uniquely describes a single link Youcan contact him via email at . (opcode of JOP_REMREF). If the link is transitioning to or awayfrom zero, it also affects the allocation bit- map. It contains the inode number,parent inode num- JeffRoberson is a consultant who livesonthe ber,directory offset, starting link count, and file mode. island of Maui in the Hawai’ian island chain. When he is not cycling, hiking, or otherwise enjoying the A jmvrec record describes a relocated directory island, he gets paid to improve FreeBSD.Heispartic- entry when its containing directory block is com- ularly interested in problems facing server installa- pacted by the kernel. It contains an inode number, tions and has worked on areas as varied as the kernel parent inode number,old directory offset and new memory allocator,thread scheduler,filesystems inter- directory offset. faces, and network packet storage among others. You A jblkrec record describes either an allocation can contact him via email at . JOP_FREEBLK)ofablock. It contains an inode num- ber,logical block number,physical block number,and References frag count. Negative logical numbers indicate Ganger,McKusick, & Patt, 2000. extended attributes or indirect blocks. G. Ganger,M.McKusick, & Y.Patt, “Soft Updates: A jtrncrec record is used only for partial trunca- ASolution to the Metadata Update Problem in tion where the recovery process must evaluate the cur- Filesystems,” ACMTransactions on Computer Sys- tems 18(2), p. 127−153 (May 2000). rent size of the inode and complete the truncation. It contains an inode number,desired file size, and Ganger & Patt, 1994. G. Ganger & Y.Patt, “Metadata Update Performance desired external attribute size. A jtrncrec record is not in File Systems,” USENIX Symposium on Operating used when truncating to zero. Rather,all direct blocks Systems Design and Implementation, p. 49−60 and root indirects are logged as frees and the inode (November 1994). pointers are written to zero so that it may all be done McKusick, Bostic, Karels, & Quarterman, 1996. asynchronously. M. McKusick, K. Bostic, M. Karels, & J. Quarter- man, The Design and Implementation of the 4.4BSD , p. 269−271, Addison WesleyPub- 10. Biographies lishing Company, Reading, MA (1996). Dr.Marshall Kirk McKusick writes books and McKusick, 2002. articles, consults, and teaches classes on UNIX- and M. K. McKusick, “Running Fsck in the Back- BSD-related subjects. While at the University of Cal- ground,” Proceedings of the BSDCon 2002 Confer- ifornia at Berkeley, heimplemented the 4.2BSD fast ence, pp. 55-64 (February 2002). filesystem, and was the Research Computer Scientist Seltzer et al, 2000. at the BerkeleyComputer Systems Research Group M. Seltzer,G.Ganger,M.K.McKusick, K. Smith, C. Soules, & C. Stein, “Journaling versus Soft (CSRG) overseeing the development and release of Updates: Asynchronous Meta-data Protection in File 4.3BSD and 4.4BSD. His particular area of interest is Systems,” Proceedings of the San Diego Usenix Con- the filesystem. He earned his undergraduate degree in ference, pp. 71-84 (June 2000). Electrical Engineering from Cornell University,and