BTRFS: Investigation of Several Aspects Regarding Fragmentation

BTRFS: AUTODEFRAG AND DEFRAGMENTATION Page 1 of 13 BTRFS: Investigation of several aspects regarding fragmentation Elijah D. Mirecki Athabasca University February 26, 2014 BTRFS: AUTODEFRAG AND DEFRAGMENTATION Page 2 of 13 Abstract BTRFS defragmentation is currently an active topic of discussion among operating system designers. Defragmentation is especially important for BTRFS because of its copy-on-write (COW) nature, which causes data extents to be scattered all over the system when writing occurs. This paper provides an investigation of the current BTRFS defragmentation algorithm as well as the autodefrag algorithm. Several illustrations have been created to represent how the defragmentation algorithm works at a high level. Discussion about possible solutions and ideas for defragmentation and autodefrag is also included. The main problem with the current defragmentation algorithm is that it sometimes merges more extents than are needed to provide relatively unfragmented files. Another problem is that when choosing which files to defragment, flash drives and disk drives are treated the same. This is an issue because the two drive types should have different goals. Disk drives aim for contiguous space, while flash drives need less writes to prolong their lifetime. The solutions and ideas discussed in this paper should theoretically allow defragmentation to merge extents more appropriately as well as meeting the goals for both flash and disk drives. This paper should be followed by simulations comparing several areas of defragmentation and file selection algorithms. Some implementation details are provided as well in case the simulations return positive. Several other filesystems are touched upon, including ext4, HFS+ and NTFS. Defragmentation and their fragmentation situation are both covered and compared briefly. Introduction and background Introduction to BTRFS BTRFS is a extent based copy-on-write (COW) file system. An extent is basically an array of blocks. Instead of the filesystem metadata leaf nodes pointing at each individual block, only a single block pointer and an extent length must be set in the inode. With COW, when a block is modified, a write does not occur on an existing extent, but is written to a new extent on another part of the drive. Assuming a written block has occurred in the middle of an existing extent, the extent will have to be split into multiple parts. The inode will point to the beginning of the old extent, to the new extent created by the write, and then a third pointer to the end of the old extent. The data in the middle of the old extent may then be deallocated. This allows a much more versatile file system, but results in the system becoming much more fragmented. Some of the main features of BTRFS include compression, clones, and snapshots. Snapshots allow the filesystem state to be saved for backups or archives. Clones are copies of files that initially share the same extents. As changes are made to the clone or the original file, the files will slowly share less with each other due to COW. Since BTRFS is a COW system, snapshots and clones are able to be created extremely quickly by simply creating new file pointers. Once there are no more pointers to an BTRFS: AUTODEFRAG AND DEFRAGMENTATION Page 3 of 13 extent on the drive, it will be deallocated. BTRFS is also transaction based. Write operations are not completed immediately, but are placed in a transaction located in memory. The system commits all of the operations at regular intervals, or when otherwise required such as if the system is shutting down. This allows compression and write placement optimization to take place while the transaction resides in memory, before the data is even written to the disk. This is also a form of the delayed-allocation method – a method designed with the intent of reducing fragmentation by combining several write operations before the actual write occurs. As mentioned above, BTRFS suffers greatly from fragmentation. For enterprise systems, the built-in BTRFS defragmenter may be run according to the system schedule. But for a general solution, BTRFS implements a mechanism called autodefrag. Autodefrag works by automatically choosing which files to defragment as they are modified. Files which are chosen will be defragmented along with each transaction being performed. This is a good solution for most systems because it keeps fragmentation low when small writes occur in files, which is the circumstance that generally causes the most file fragmentation. Motivation for Research As I was reading about ZFS on miscellaneous websites, I stumbled upon several other file systems such as BTRFS and XFS. At the time I started reading some more articles about BTRFS and the basic concepts; Copy-on-write, B-tree data structures, etc. Once it was required that I find a topic to research, my first thought was BTRFS, because I was quite interested with it in the past. I then headed over to the Athabasca online library, where I found a recent article which covered many topics of BTRFS from several levels (Resource 1). As I was reading the article, it seemed as though the defragmentation system has been one of the largest problems with BTRFS currently. From what I have gathered in several articles, the autodefrag algorithm is the solution for selecting which files should be defragmented by default. Another problem that I realized was that the current defragmentation algorithm tends to strip out sharing of extents with clones and snapshots. This is a big issue, since clones and snapshots are core elements of the system. I also discovered that the defragmentation algorithms treat flash drives the same as disk drives. This stood out as yet another major problem, because flash drives behave much different. On a flash drive, access time is basically a constant time, so contiguous data doesn't matter nearly as much. Problems Excess Defragmentation. The current defragmentation has a major flaw. When a file begins to be defragmented, the ENTIRE file will be scanned for extents that are worthy of being merged. There are several problems with this approach: 1. Performance will be affected. 2. A lot of sharing with be lost with clones and snapshots. 3. Drive lifetime will be reduced. The performance will be degraded for several reasons. One reason is that the algorithm will have to scan the entire file, instead of just the part of the file that needs attention (the section of the file which has been modified). Another reason is of course, because extra extents will have to be COWed. BTRFS: AUTODEFRAG AND DEFRAGMENTATION Page 4 of 13 In a large file such as a database, this could be a massive overhead. Much more sharing with snapshots and clones will also be lost unnecessarily. Defragmenting only the small section of a file which has been modified will most likely not require a lot of COW operations. If an entire file is defragmented, then extents all over the file will be merged, and therefore duplicated among clones and snapshots. Drive lifetime will be decreased when more writing occurs. Drive lifetime is especially critical to flash drives, which are starting to dominate the secondary storage market these days. Autodefrag Optimization for Flash Drives Currently, the autodefrag algorithm basically just checks for the size of a COW operation to determine whether or not to add a file to the defragmentation list, despite which kind of drive is in use. The problem here is that with flash drives, defragmentation is not as big of an issue as on disk drives, so we don't want to defragment the file as often. The main problem is that flash drives have a shorter life span than disk drives, but don't provide a means of factoring this into account. Because the defragmentation algorithm will behave the same on a flash drive, it will try to merge extents to form contiguous data, which will wear out the flash drives if it is overdone. Another more minor problem is that performance may be impacted if more files are defragmented. The algorithm should be smarter to handle flash drive autodefrag differently to accommodate the factor of drive life span. Methods To begin exploring the circumstances of fragmentation in BTRFS, several documents and papers were read. The documents that were found covered a broad range of how BTRFS works, but were not specifically about the fragmentation issue or defragmentation algorithms, which were covered briefly. A general knowledge and understanding about how BTRFS works as a whole was gained from resource 1 mainly, but other miscellaneous articles and mailing lists were used as well. Some of the main tactics used to explore the defragmentation algorithm were simply to poke around the Linux source code itself (version 3.12.8). The source code is a big place, so it could be difficult to find where to start. To find the starting point, the git logs for the Linux source code were searched regarding BTRFS defragmentation and autodefrag [5]. Once several functions were found that seemed to be reasonable starting points for exploring defragmentation and autodefreg, those functions were examined for comments and other method calls. The Linux Cross Reference (LXR) [6] was used extensively throughout exploring the source code. Data structures and functions which were related to autodefrag and defragramentation were traced and examined. Results and findings Findings The main starting places in the Linux source code that I found were the functions btrfs_ioctl_defrag(), btrfs_defrag_file() and cow_file_range(). The cow_file_range() function is what performs the COW operations. While performing these operations, it also checks if a file should be defragmented according to the autodefrag policy. The current code will add a file to the list if a write is made that is under 64KB, and if the write is only part of the file.

BTRFS: Investigation of Several Aspects Regarding Fragmentation

Welcome to the Ally Skills Workshop

Performance Evaluation of Filesystems Compression Features

The Kernel Report

A Method of Merging Vmware Disk Images Through File System Unification by Sarah X

How to Respond to Code of Conduct Reports

IBM Research Report

Gender Dimensions of Free and Open Source Development

R&S®TCE900 Open Source Acknowledgment

State of the Union When You Don’T Need Union Mounts

Tlx9 Open Source Acknowledgment

Generating Realistic Impressions for File-System Benchmarking

An Analysis of Compare-By-Hash