Using a Log-Structured File System for Tertiary Storage Management

HighLight: Using a Log-structured File System for Tertiary Storage Managementy John T. Kohl Carl Staelin University of California, Berkeley and Hewlett-Packard Laboratories Digital Equipment Corporation Michael Stonebraker University of California, Berkeley November 20, 1992 1. Introduction HighLight combines both conventional disk secondary Abstract storage and robotic tertiary storage into a single ®le sys- Robotic storage devices offer huge storage capacity at a tem. It builds upon the 4.4BSD LFS [10], which derives low cost per byte, but with large access times. Integrating directly from the Sprite Log-structured File System (LFS) these devices into the storage hierarchy presents a chal- [9], developed at the University of California at Berke- lenge to ®le system designers. Log-structured ®le systems ley by Mendel Rosenblum and John Ousterhout as part of (LFSs) were developed to reduce latencies involved in ac- the Sprite operating system. LFS is optimized for writ- cessing disk devices, but their sequential write patterns ing data, whereas most ®le systems (e.g. the BSD Fast match well with tertiary storage characteristics. Unfortu- File System [4]) are optimized for reading data. LFS di- nately, existing versions only manage memory caches and vides the disk into 512KB or 1MB segments, and writes disks, and do not support a broader storage hierarchy. data sequentially within each segment. The segments are threaded together to form a log, so recovery is quick; it HighLight extends 4.4BSD LFS to incorporate both sec- entails a roll-forward of the log from the last checkpoint. ondary storage devices (disks) and tertiary storage devices Disk space is reclaimed by copying valid data from dirty (such as robotic tape jukeboxes), providing a hierarchy segments to the tail of the log and marking the emptied within the ®le system that does not require any application segments as clean. support. This paper presents the design of HighLight, pro- poses various policies for automatic migration of ®le data Since log-structured ®le systems are optimized for between the hierarchy levels, and presents initial migration write performance, they are a good match for the write- mechanism performance ®gures. dominated environment of archival storage. However, system performance will depend on optimizing read performance, since LFS already optimizes write performance. Therefore, migration policies and mechanisms should ar- This research was sponsored in part by the University of California and Digital Equipment Corporation under Digital's ¯agship research range the data on tertiary storage to improve read perfor- project ªSequoia 2000: Large Capacity Object Servers to Support Global mance. Change Research.º Other industrial and governmentpartners include the California Department of Water Resources, United States Geological HighLight was developed to provide a data storage ®le Survey, MCI, ESL, Hewlett Packard, RSI, SAIC, PictureTel, Metrum system for use by Sequoia researchers. Project Sequoia Information Storage, and Hughes Aircraft Corporation. This work was 2000 [14] is a collaborative project between computer sci- also supported in part by Digital Equipment Corporation's Graduate Engineering Education Program. entists and earth science researchers to develop the nec- y Permission has been granted by the USENIX Association to essary support structure to enable global change research reprint the above article. This article was originally published in the on a larger scale than current systems can support. High- USENIX Association Conference Proceedings, January 1993. Copy- Light is one of several ®le management avenues under right c USENIX Association, 1993. 1 exploration as a supporting technology for this research. table and the inode map. The segment summary table Other storage management efforts include the Inversion contains information describing the state of each segment support in the POSTGRES database system [7] and the in the ®le system. Some of this information is necessary Jaquith manual archive system [6] (which was developed for correct operation of the ®le system, such as whether for other uses, but is under consideration for Sequoia's the segment is clean or dirty, while other information is use). used to improve the performance of the cleaner, such as the number of live data bytes in the segment. The inode The bulk of the on-line storage for Sequoia will be map contains the current disk address of each ®le's inode, provided by a 600-cartridge Metrum robotic tape unit; as well as some auxiliary information used for ®le system each cartridge has a capacity of 14.5 gigabytes for a total bookkeeping. In 4.4BSD LFS, both the inode map and of nearly 9 terabytes. We also expect to have a collection the segment summary table are contained in a regular ®le, of smaller robotic tertiary devices (such as the Hewlett- called the i®le. Packard 6300 magneto-optic changer). HighLight will have exclusive rights to some portionof thetertiary storage When reading ®les, the only difference between LFS space. and FFS is that the inode's location is variable. Once the system has found the inode (by indexing the inode map), HighLight is currently running in our laboratory, with a LFS reads occur in the same fashion as FFS reads, by simple automated ®le-oriented migration policy as well as following direct and indirect block pointers1. a manual migration tool. HighLight can migrate ®les to tertiary storage and automatically fetch them again from When writing, LFS and FFS differ substantially. In FFS, tertiary storage into the cache to enable application access. each logical block within a ®le is assigned a location upon allocation, and each subsequent operation (read or write) The remainder of this paper presents HighLight's mech- is directed to that location. In LFS, data are written to the anisms and some preliminary performance measurements, tail of the log each time they are modi®ed, so their location and speculates on some useful migration policies. We be- changes. This requires that their index structures (indirect gin with a thumb-nail sketch of the basic Log-structured blocks, inodes, inode map entries, etc.) be updated to ®le system, followed by a discussion of our basic storage re¯ect their new location, so these index structures are and migration model and a comparison with existing re- also appended to the log. lated work in policy and mechanism design. We continue with a brief discussion of potential migration policies and a In order to provide the system with a ready supply of description of HighLight's architecture. We present some empty segments for the log, a user-level process called preliminary measurements of our system performance, the cleaner garbage collects free space from dirty seg- and conclude with a summary and directions for future ments. The cleaner selects one or more dirty segments to work. be cleaned, appends all valid data from those segments to the tail of the log, and then marks those segments clean. The cleaner communicates with the ®le system by reading the i®le and calling a handful of LFS-speci®c system 2. LFS Primer calls. Making the cleaner a user-level process simpli®es the adjustment of cleaning policies. The primary characteristic of LFS is that all data are stored in a segmented log. The storage consists of large contigu- For recovery purposes the ®le system takes periodic ous spaces called segments which may be threaded to- checkpoints. During a checkpoint the address of the most gether to form a linear log. New data are appended to the recent i®le inode is stored in the superblock so that the log, and periodically the system checkpoints the state of recovery agent may ®nd it. During recovery the threaded the system. During recovery the system will roll-forward log is used to roll forward from the last checkpoint. Each from the last checkpoint, using the information in the log segment of the log may contain several partial segments. to recover the state of the ®le system at failure. Obviously, A partial segment is considered an atomic update to the as data are deleted or replaced, the log contains blocks of log, and is headed by a segment summary cataloging its invalid or obsolete data, and the system must coalesce this contents. The summary also includes a checksum to verify wasted space to generate new, empty segments for the log. that the entire partial segment is intact on disk and provide an assurance of atomicity. During recovery, the system 4.4BSD LFS shares much of its implementation with the scans the log, examining each partial segment in sequence. Berkeley Fast File System (FFS) [4]. It has two auxiliary data structures not found in FFS: the segment summary 1 In fact, LFS and FFS share this indirection code in 4.4BSD. 2 summary log contents file system (state) tail of log auto s new data reads; initial 0 d,a i writes migration 1 c (empty segment) caching 2 d s i s i disk farm tertiary jukebox(es) . ... ... State Key: Block Key: Figure 2: The storage hierarchy. d = dirty c = clean s = Summary a = active i = i−node vast storage. It manages the storage and the migration = data between the two levels. The basic storage and migration model is illustrated in Figure 2. HighLight has a great deal of ¯exibility, allowing arbi- trary data blocks, directories, indirect blocks, and inodes Figure 1: LFS data layout. to migrate to tertiary storage at any time. It uses the basic LFS layout to manage the on-disk storage and applying a variant on the cleaning mechanism to provide the migration mechanism. A natural consequence of this layout Figure 1 shows the on-disk data structures of 4.4BSD is the use of LFS segments for the tertiary-resident data LFS. The on-disk data space is divided into segments.

Load more