Delta Compression Techniques
Total Page:16
File Type:pdf, Size:1020Kb
D but the concept can also be applied to multimedia Delta Compression and structured data. Techniques Delta compression should not be confused with Elias delta codes, a technique for encod- Torsten Suel ing integer values, or with the idea of coding Department of Computer Science and sorted sequences of integers by first taking the Engineering, Tandon School of Engineering, difference (or delta) between consecutive values. New York University, Brooklyn, NY, USA Also, delta compression requires the encoder to have complete knowledge of the reference files and thus differs from more general techniques for Synonyms redundancy elimination in networks and storage systems where the encoder has limited or even Data differencing; Delta encoding; Differential no knowledge of the reference files, though the compression boundaries with that line of work are not clearly defined. Definition Delta compression techniques encode a target Overview file with respect to one or more reference files, such that a decoder who has access to the same Many applications of big data technologies in- reference files can recreate the target file from the volve very large data sets that need to be stored on compressed data. Delta compression is usually disk or transmitted over networks. Consequently, applied in cases where there is a high degree of data compression techniques are widely used to redundancy between target and references files, reduce data sizes. However, there are many sce- leading to a much smaller compressed size than narios where there are significant redundancies could be achieved by just compressing the tar- between different data files that cannot be ex- get file by itself. Typical application scenarios ploited by compressing each file independently. include revision control systems and versioned For example, there may be many versions of file systems that store many versions of a file or a document, say a Wikipedia article, that dif- software or content updates over networks where fer only slightly from one to the next. Delta the recipient already has an older version of the compression techniques attempt to exploit such data. Most work on delta compression techniques redundancy between pairs or groups of files to has focused on the case of textual and binary files, achieve better compression. © Springer International Publishing AG, part of Springer Nature 2018 S. Sakr, A. Zomaya (eds.), Encyclopedia of Big Data Technologies, https://doi.org/10.1007/978-3-319-63962-8_63-1 2 Delta Compression Techniques We define delta compression for the most This early work motivated Unix tools such as common case of a single reference file. Formally, diff and bdiff, which create a patch, i.e., a set we have two strings (files): ftar 2 ˙ (the target of edit commands, that can be used to update file) and fref 2 ˙ (the reference file). Then the the reference file to the target file. Such tools goal for an encoder E with access to both ftar and are widely used to create patches for software fref is to construct a file fı 2 ˙ of minimum updates, and one advantage is that these patches size, such that a decoder D can reconstruct ftar are easily readable by humans who need to under- from fref and fı . We also refer to fı as the delta stand the changes. However, the tools typically do of ftar and fref . not generate delta files of competitive size, due to Given this definition, research on delta com- (1) limitations in the type of edit operations that pression techniques has focused on three chal- are supported, as discussed in the previous para- lenges: (1) How to build delta compression tools graph, and (2) the absence of native compression that result in fı of small size while also achieving methods for reducing the size of the patch. The fast compression and decompression speeds and latter issue can be partially addressed by applying small memory footprint. (2) How to use delta general purpose file compressors such as gzip or compression techniques to compress collections bzip to the generated patches, though this is still of files, i.e., how to select suitable reference files far from optimal. given a target file, or how to select a sequence of The first problem was partially addressed by delta compression steps between files to achieve Tichy (1984), who defined the string-to-string good compression for a collection ideally while correction problem with block moves.Forafile also allowing fast retrieval of individual files. f ,letfŒidenote the ith symbol of f , 0 i< (3) How to apply delta compression in various jf j, and fŒi;jdenote the block of symbols from application scenarios. i until (and including) j . Then a block move is a triple .p;q;l/such that fref Œp;:::;pC l 1 D ftarŒq;:::;q C l 1, representing a nonempty Key Research Findings common substring of fref and ftar of length l. The file fı can now be constructed as a minimal String-to-String Correction and covering set of block moves, such that every Differencing ftarŒi that also appears in fref is included in Some early related work was done by Wagner and exactly one block move. Fisher (1974), who studied the string-to-string Tichy (1984) also showed that a greedy algo- correction problem. This is the problem of find- rithm results in a minimal cover set, leading to ing the best (shortest) sequence of insert, delete, an fı that can be constructed in linear space and and update operations that transform one string time using suffix trees. However, the multiplica- fref to another string ftar. The solution in Wagner tive constant in the space complexity made the and Fisher (1974) is based on finding the longest approach impractical. A more practical approach common subsequence of the two strings using uses hash tables with linear space but quadratic dynamic programming and adding all remaining time worst-case complexity (Tichy 1984). Subse- characters to ftar explicitly. However, the string- quent work in Ajtai et al. (2002)showedhowto to-string correction problem does not capture the further reduce the space used during compression full generality of the delta compression problem. while avoiding a significant increase in com- In particular, in the string-to-string correction pressed size, while Agarwal et al. (2004)showed problem, it is implicitly assumed that the data how to reduce size by picking better copies than common to ftar and fref appears in (roughly) the a greedy approach. Finally, Xiao et al. (2005) same order in the two files, and the approach can- showed bounds for delta compression under a not exploit the case of substrings in fref appearing very simple model of file generation where a file repeatedly in ftar. is edited according to a two-state Markov process that either keeps or replaces content. Delta Compression Techniques 3 Delta Compression Using Lempel-Ziv built beforehand by scanning fref , assuming fref Coding is not too large. When looking for matches, all The block move-based approach proposed in tables are searched to find the best match. Hash- Tichy (1984) leads to a basic shift in the ing of substrings is done based on the initial three development of delta compression algorithms, characters, with chaining inside each hash bucket. D as it suggests thinking about delta compression As observed in Hunt et al. (1998), in many as a sequence of copies from the reference file cases, the location of the next match in fref is into the target file, as opposed to a sequence of a short distance after the end of the previous edit operations on the reference file. This led to match, especially when the files are very similar. a set of new algorithms based on the Lempel- Thus, the locations of matches in fref are usu- Ziv family of compression algorithms (Ziv ally best represented as offsets from the end of and Lempel 1977, 1978), which naturally lend the previous match in fref . However, sometimes themselves to such a copy-based approach while there are isolated matches in other parts of fref . also allowing for compression of the resulting This motivates zdelta to maintain two pointers patches. into each reference file and to code locations by In particular, the LZ77 algorithm can be storing which pointer is used, the direction of the viewed as a sequence of copy operations that offset, and the offset itself. Pointers are initially replace a prefix of the string being encoded by set to the start of the file and afterward usually a reference to an identical previously encoded point to the ends of the previous two matches in substring. Thus, delta compression could be the same reference file. If a match is far away viewed as simply performing LZ77 compression from both pointers, a pointer is only moved to the with the file fref representing previously encoded end of it if it is the second such match in a row. text. In fact, we can also include the part of ftar Other more complex pointer movement policies that has already been encoded in the search for might give additional moderate improvements. a longest matching prefix. A few extra changes To encode the generated offsets, pointers, di- are needed to get a practical implementation of rections, match lengths, and literals, zdelta uses LZ77-based delta compression. Implementations the Huffman coding facilities provided by zlib, of this approach include vdelta (Hunt et al. 1998), while vdelta uses a byte-based encoding that is several versions of xdelta (MacDonald 2000), faster but less compact. zdelta (Trendafilov et al. 2002), the recent and very fast ddelta (Xia et al. 2014b) and edelta (Xia Window Management in LZ-Based Delta et al.