Metadata Efficiency in Versioning File Systems
Total Page:16
File Type:pdf, Size:1020Kb
2nd USENIX Conference on File and Storage Technologies, San Francisco, CA, Mar 31 - Apr 2, 2003. Metadata Efficiency in Versioning File Systems Craig A.N. Soules, Garth R. Goodson, John D. Strunk, Gregory R. Ganger Carnegie Mellon University Abstract of time over which comprehensive versioning is possi- ble. To be effective for intrusion diagnosis and recovery, Versioning file systems retain earlier versions of modi- this duration must be greater than the intrusion detection fied files, allowing recovery from user mistakes or sys- latency (i.e., the time from an intrusion to when it is de- tem corruption. Unfortunately, conventional versioning tected). We refer to the desired duration as the detection systems do not efficiently record large numbers of ver- window. In practice, the duration is limited by the rate sions. In particular, versioned metadata can consume of data change and the space efficiency of the version- as much space as versioned data. This paper examines ing system. The rate of data change is an inherent aspect two space-efficient metadata structures for versioning of a given environment, and an analysis of several real file systems and describes their integration into the Com- environments suggests that detection windows of several prehensive Versioning File System (CVFS), which keeps weeks or more can be achieved with only a 20% cost in all versions of all files. Journal-based metadata encodes storage capacity [41]. each metadata version into a single journal entry; CVFS uses this structure for inodes and indirect blocks, reduc- In a previous paper [41], we described a prototype self- ing the associated space requirements by 80%. Multiver- securing storage system. By using standard copy-on- sion b-trees extend each entry’s key with a timestamp and write and a log-structured data organization, the proto- type provided comprehensive versioning with minimal keep current and historical entries in a single tree; CVFS uses this structure for directories, reducing the associ- performance overhead ( 10%) and reasonable space ef- ated space requirements by 99%. Similar space reduc- ficiency. In that work, we discovered that a key de- tions are predicted via trace analysis for other versioning sign requirement is efficient encoding of metadata ver- strategies (e.g., on-close versioning). Experiments with sions (the additional information required to track the CVFS verify that its current-version performance is sim- data versions). While copy-on-write reduces data ver- ilar to that of non-versioning file systems while reducing sioning costs, conventional versioning implementations overall space needed for history data by a factor of two. still involve one or more new metadata blocks per ver- Although access to historical versions is slower than con- sion. On average, the metadata versions require as much ventional versioning systems, checkpointing is shown to space as the versioned data, halving the achievable detec- mitigate and bound this effect. tion window. Even with less comprehensive versioning, such as Elephant [37] or VMS [29], the metadata history can become almost ( ¡ 80%) as large as the data history. 1 Introduction This paper describes and evaluates two methods of stor- ing metadata versions more compactly: journal-based Self-securing storage [41] is a new use for versioning metadata and multiversion b-trees. Journal-based meta- in which storage servers internally retain file versions data encodes each version of a file’s metadata in a jour- to provide detailed information for post-intrusion diag- nal entry. Each entry describes the difference between nosis and recovery of compromised client systems [40]. two versions, allowing the system to roll-back to the ear- We envision self-securing storage servers that retain ev- lier version of the metadata. Multiversion b-trees retain ery version of every file, where every modification (e.g., all versions of a metadata structure within a single tree. a WRITE operation or an attribute change) creates a Each entry in the tree is marked with a timestamp indi- new version. Such comprehensive versioning maximizes cating the time over which the entry is valid. the information available for post-intrusion diagnosis. The two mechanisms have different strengths and weak- Specifically, it avoids pruning away file versions, since nesses. We discuss these and describe how both tech- this might obscure intruder actions. For self-securing niques are integrated into a comprehensive versioning storage, pruning techniques are particularly dangerous file system called CVFS. CVFS uses journal-based meta- when they rely on client-provided information, such as data for inodes and indirect blocks to encode changes CLOSE operations — the versioning is being done specif- to attributes and file data pointers; doing so reduces the ically to protect stored data from malicious clients. space used for their histories by 80%. CVFS implements Obviously, finite storage capacities will limit the duration directories as multiversion b-trees to encode additions should be recovered is known. Additionally, there are and removals of directory entries; doing so reduces the only certain versions that are of any value to the user; in- space used for their histories by 99%. Combined, these termediate versions that contain incomplete data are use- mechanisms nearly double the potential detection win- less. Therefore, versioning aimed at recovery from user dow over conventional versioning mechanisms, without mistakes should focus on retaining key versions of im- increasing the access time to current versions of the data. portant files. Journal-based metadata and multiversion b-trees are also Recovery from system corruption: When a system valuable for conventional versioning systems. Using becomes corrupted, administrators generally have no these mechanisms with on-close versioning and snap- knowledge about the scope of the damage. Because of shots would provide similar reductions in versioned this, they restore the entire state of the file system from metadata. For on-close versioning, this reduces the to- some well-known “good” time. A common versioning tal space required by nearly 35%, thereby reducing the technique to help with this is the online snapshot. Like pressure to prune version histories. Identifying solid a backup, a snapshot contains a version of every file in heuristics for such pruning remains an open area of re- the system at a particular time. Thus, snapshot systems search [37], and less pruning means fewer opportunities present sets of known-valid system images at a set of to mistakenly prune important versions. well-known times. The rest of this paper is divided as follows. Section 2 dis- Analysis of historical changes: A history of versions cusses conventional versioning and motivates this work. can help answer questions about how a file reached a Section 3 discusses the two space-efficient metadata ver- certain state. For example, version control systems (e.g., sioning mechanisms and their tradeoffs. Section 4 de- RCS [43], CVS [16]) keep a complete record of com- scribes the CVFS versioning file system. Section 5 ana- mitted changes to specific files. In addition to selec- lyzes the efficiency of CVFS in terms of space efficiency tive recovery, this record allows developers to figure out and performance. Section 6 describes how our version- who made specific changes and when those changes were ing techniques could be applied to other systems. Sec- made. Similarly, self-securing storage seeks to enable tion 7 discusses additional related work. Section 8 sum- post-intrusion diagnosis by providing a record of what marizes the paper’s contributions. happened to stored files before, during, and after an in- trusion. We believe that every version of every file should be kept. Otherwise, intruders who learn the pruning 2 Versioning and Space Efficiency heuristic will leverage this information to prune any file versions that might disclose their activities. For exam- Every modification to a file inherently results in a new ple, intruders may make changes and then quickly re- version of the file. Instead of replacing the previous ver- vert them once damage is caused in order to hide their sion with the new one, a versioning file system retains tracks. With a complete history, administrators can de- both. Users of such a system can then access any histori- termine which files were changed and estimate damage. cal versions that the system keeps as well as the most re- Further, they can answer (or at least construct informed cent one. This section discusses uses of versioning, tech- hypotheses for) questions such as “When and how did niques for managing the associated capacity costs, and the intruder get in?” and “What was their goal?” [40]. our goal of minimizing the metadata required to track file versions. 2.1 Uses of Versioning 2.2 Pruning Heuristics File versioning offers several benefits to both users and system administrators. These benefits can be grouped A true comprehensive versioning system keeps all ver- into three categories: recovery from user mistakes, re- sions of all files for all time. Such a system could sup- covery from system corruption, and analysis of historical port all three goals described above. Unfortunately, stor- changes. Each category stresses different features of the ing this much information is not practical. As a result, all versioning system beneath it. versioning systems use pruning heuristics. These prun- Recovery from user mistakes: Human users make mis- ing heuristics determine when versions should be created takes, such as deleting or erroneously modifying files. and when they should be removed. In other words, prun- Versioning can help [17, 29, 37]. Recovery from such ing heuristics determine which versions to keep from the mistakes usually starts with some a priori knowledge total set of versions that would be available in a compre- about the nature of the mistake. Often, the exact file that hensive versioning system. 2 2.2.1 Common Heuristics 2.3 Lossless Version Compression A common pruning technique in versioning file systems is on-close versioning.