FGDEFRAG: a Fine-Grained Defragmentation Approach to Improve Restore Performance

FGDEFRAG: A Fine-Grained Defragmentation Approach to Improve Restore Performance Yujuan Tan∗, Jian Wen∗, Zhichao Yan†, Hong Jiang†, Witawas Srisa-an‡, Baiping Wang∗, Hao Luo§ ∗College of Computer Science, Chongqing University, China Email: [email protected], [email protected], [email protected] †University of Texas Arlington, Email: [email protected], [email protected] ‡University of Nebraska Lincoln, Email:[email protected] §Nimble Storage, Email:[email protected] Abstract are each uniquely “fingerprinted” [13] to identify chunk- In deduplication-based backup systems, the removal level data redundancy. A chunk fingerprint is generated of redundant data transforms the otherwise logically by a secure hash algorithm [11] according to the adjacent data chunks into physically scattered chunks content in that chunk. If two fingerprints of two chunks on the disks. This, in effect, changes the retrieval generated by applying the same hash algorithm are operations from sequential to random and significant- identical, they are regarded as duplicate, or redundant ly degrades the performance of restoring data. These chunks and only one instance is stored; the other scattered chunks are called fragmented data and many chunk is then replaced by an address pointer to the techniques have been proposed to identify and sequen- stored instance. In backup systems, the incrementally tially rewrite such fragmented data to new address areas, changing nature of the data streams leads to a very trading off the increased storage space for reduced high compression ratio, typically from 10x to 100x number of random reads (disk seeks) to improve the since a large percentage of the data is redundant among restore performance. However, existing solutions for different backup versions. backup workloads share a common assumption that While data deduplication significantly increases stor- every read operation involves a large fixed-size window age space efficiency, it also substantially complicates of contiguous chunks, which restricts the fragment the post-deduplication storage management [14], [4]. identification to a fixed-size read window. This can For example, due to the removal of redundant chunks, lead to inaccurate detections due to false positives since the logically adjacent data chunks that belong to a the data fragments can vary in size and appear in any specific file or data stream are scattered in different different and unpredictable address locations. places on disks, transforming the retrieval operations Based on these observations, we propose FGDE- of such files or data streams from sequential to random. FRAG, a Fine-Grained defragmentation approach that This significantly increases the retrieval time because uses variable-sized and adaptively located data groups, of the extra disk seeks, with the worst case of one seek instead of using fixed-size read windows, to accurate- per chunk. ly identify and effectively remove fragmented data. Fig. 1 illustrates this problem with a simple but When we compare its performance to those of existing intuitive example. In Fig. 1, File A and file A’ share solutions, FGDEFRAG not only reduces the amount the common chunk C. When file A’ enters the backup of rewritten data but also significantly improves the system after file A, only chunks E and F of file A’ would restore performance. Our experimental results show be stored since chunk C is already stored by file A, that FGDEFRAG can improve the restore performance and thus chunk C is stored separately (non-sequentially) by 19% to 262%, while simultaneously reducing the from chunks E and F. Thus reading file A’ requires at rewritten data by 29% to 70%. least two disk seeks, one for chunk C and another for chunks E and F. Generally, we call a chunk such as I. Introduction chunk C as fragmented data of file A’. If chunk C is not large enough to amortize the extra disk seek overhead, Data deduplication is a lossless compression technology this fragmentation problem can result in excessive disk that has been widely used in backup [19], [8], [1], seeks and lead to poor restore performance that can [17] and archival systems [12], [18], [2]. It breaks data degrade the recovery time objective (RTO). RTO is a streams into approximately equal-sized data chunks that very important performance metric for any customers 1 File A File Aÿ windows where the split parts in each neighborhood Chunk Chunk Chunk Chunk Chunk Chunk B C D E C F fail to meet the threshold. Clearly, in both cases this group of closely stored reference chunks would be File A and File Aÿstored on disks identified as fragmented data when they actually are Chunk Chunk Chunk Chunk Chunk B C D E F not, thus resulting in false positive identification. These stored by stored by false positive detections can lead to rewriting more File A File Aÿ fragmental chunks but without substantially improving Fig. 1: An example of fragmented data. the restore performance. Based on these aforementioned analysis and observations, we propose FGDEFRAG, a Fine-Grained de- who buy backup products. fragmentation approach to improve restore performance The fragmentation problem in deduplication-based in deduplication-based backup systems. The main idea backup systems has been identified and studied to a and salient feature of FGDEFRAG is to use variable- certain extent by both industry and academia [3], sized and adaptively located groups, instead of the fixed- [7], [5], [9], [10], [6]. To restore a backup stream, size windows in the existing approaches, to identify all these existing approaches have made a common, fragmented data and atomically read data for data fundamental assumption that each read operation in- restores. Specifically, FGDEFRAG first divides the data volves a large fixed number of contiguous chunks with stream into variable-sized logical groups based on the a single disk seek. With this assumption, the disk seek on-disk store address affinity of the referenced chunks, time is sufficiently amortized to become negligible for i.e., contiguous and/or close-by referenced chunks are each data read operation, and the read performance is grouped into the same group and far-apart chunks are determined by the percentage of referenced chunks per separated into different groups. Then for each logical read in each backup stream. That is, existing approaches group, it identifies fragmented data by comparing the identify fragmented data based on the percentage of valid read bandwidth, defined to be the total data the referenced chunks in each fixed-size window (i.e., volume of the referenced chunks of this group divided the size of the read data). If the percentage of the by the time spent reading the entire group (referenced referenced chunks is smaller than a preset threshold, and non-referenced chunks) and doing disk seek, to a these referenced chunks will be identified as fragmented preset bandwidth threshold. If it is smaller than the data and rewritten to be with the unique chunks of the threshold, the corresponding referenced chunks will be same data stream to make future reads more sequential, identified as fragmental chunks. Finally, FGDEFRAG trading off the increased storage space for a reduced organizes the fragmental chunks and the new unique number of reads (disk seeks) in the restore process. chunks of the same backup stream into variable sized Taking Fig. 1 for example, chunk C may be rewritten to physical groups and writes them to disks in batches. be with chunks E and F for File A’ to read them more In data restores, the variable sized physical and logical sequentially. groups are each read atomically. Unfortunately, while the existing approaches can When we compare its performance to those of the improve the data restore/read performance by identify- existing approaches, FGDEFRAG has two unique ad- ing and rewriting the fragmental chunks, they fail to vantages. First, FGDEFRAG identifies the fragmented accurately identify and effectively remove data frag- data based on the variable-sized groups and the measure mentation. More specifically, since the amount of data of valid read bandwidth of the referenced chunk for involved in each read in the existing approaches is each group, enabling it to accurately identify fragmental assumed to be a fixed-size unit, the identification of chunks and only rewrite a minimal number of redundant fragmented data is restricted within a fixed-size window data chunks with increased spatial locality to improve (i.e., the size of the read data). But in reality, fragmented the restore performance. Second, FGDEFRAG reads data vary in size and can appear in different, unpre- variable-sized groups based on the address affinity of dictable address locations. Detecting fragmented data referenced chunks, rather than reading a fixed large in a fixed-size window can restrict the size and location number of chunks each time regardless of the disk of the fragmented data that can be identified and cause addresses of the referenced and non-referenced chunks, many false positive detections. enabling it to accurately locate and read the referenced Consider, for example, a group of referenced chunks chunks with fewer disk seeks and a smaller amount stored sufficiently close to one another that they either of data to improve the data restore performance. Our reside in a single read window but fail to meet the preset experimental results show that FGDEFRAG can achieve percentage threshold of referenced chunks, or meet a restore performance improvement between 19% to the threshold but are split into two neighboring read 262% while simultaneously reducing the amount of 2 Data object 1 rewritten data by 29% to 70%, when compared to the A B C D E F G H I J K L M existing defragmentation approaches. N0 O P Q R S T Data object 2 The rest of this paper is organized as follows.

Load more