Remote Synchronization Using Differential Compression

Date of acceptance Grade th 6 November, 2012 ECLA Instructor Sasu Tarkoma Lossless Differential Compression for Synchronizing Arbitrary Single- Dimensional Strings Jari Karppanen Helsinki, September 20th, 2012 Master’s Thesis UNIVERSITY OF HELSINKI Department of Computer Science HELSINGIN YLIOPISTO − HELSINGFORS UNIVERSITET – UNIVERSITY OF HELSINKI Tiedekunta − Fakultet – Faculty Laitos − Institution − Department Faculty of Science Department of Computer Science Tekijä − Författare − Author Jari Karppanen Työn nimi − Arbetets titel − Title Lossless Differential Compression for Synchronizing Arbitrary Single-Dimensional Strings Oppiaine − Läroämne − Subject Computer Science Työn laji − Arbetets art − Level Aika − Datum − Month and year Sivumäärä − Sidoantal − Number of pages th Master’s thesis Sep 20 , 2012 93 pages Tiivistelmä − Referat − Abstract Differential compression allows expressing a modified document as differences relative to another version of the document. A compressed string requires space relative to amount of changes, irrespective of original document sizes. The purpose of this study was to answer what algorithms are suitable for universal lossless differential compression for synchronizing two arbitrary documents either locally or remotely. Two main problems in differential compression are finding the differences (differencing), and compactly communicating the differences (encoding). We discussed local differencing algorithms based on subsequence searching, hashtable lookups, suffix searching, and projection. We also discussed probabilistic remote algorithms based on both recursive comparison and characteristic polynomial interpolation of hashes computed from variable-length content-defined substrings. We described various heuristics for approximating optimal algorithms as arbitrary long strings and memory limitations force discarding information. Discussion also included compact delta encoding and in-place reconstruction. We presented results from empirical testing using discussed algorithms. The conclusions were that multiple algorithms need to be integrated into a hybrid implementation, which heuristically chooses algorithms based on evaluation of the input data. Algorithms based on hashtable lookups are faster on average and require less memory, but algorithms based on suffix searching find least differences. Interpolating characteristic polynomials was found to be too slow for general use. With remote hash comparison, content-defined chunks and recursive comparison can reduce protocol overhead. A differential compressor should be merged with a state-of-art non-differential compressor to enable more compact delta encoding. Input should be processed multiple times to allow constant a space bound without significant reduction in compression efficiency. Compression efficiencly of current popular synchronizers could be improved, as our empiral testing showed that a non-differential compressor produced smaller files without having access to one of the two strings. ACM Computing Classification System (CCS): E.4 [Coding and Information Theory]: Data compaction and compression, Error control codes; F.2.2 [Computation by Abstract Devices]: Nonnumerical Algorithms and Problems---Pattern matching Avainsanat – Nyckelord − Keywords synchronization, differential, compression, delta encoding, differencing , dictionary searching, fingerprint, patch, hashing, projective, set reconciliation, content-defined chunking, characteristic polynomial, cpi Säilytyspaikka − Förvaringställe − Where deposited Kumpula Science Library, serial number C-2012- Muita tietoja − Övriga uppgifter − Additional information Contents 1 Introduction .............................................................................................................. 1 2 Differential Compression ......................................................................................... 3 2.1 Algorithmic Objective ....................................................................................... 3 2.2 Differencing ....................................................................................................... 8 2.3 Delta Encoding ................................................................................................. 11 2.4 Decompressing ................................................................................................. 14 2.5 Measuring Performance ................................................................................... 17 3 Local Differential Compression ............................................................................. 19 3.1 Subsequence Searching .................................................................................... 20 3.2 Dictionary Searching ....................................................................................... 22 3.2.1 Almost Optimal Greedy Algorithm .......................................................... 23 3.2.2 Bounded Hash Chaining ........................................................................... 29 3.2.3 Sliding Window ........................................................................................ 35 3.2.4 Sparse Fingerprinting ............................................................................... 36 3.3 Suffix Searching ............................................................................................... 37 3.4 Projective Searching ........................................................................................ 42 3.5 Delta Encoding and Decoding ......................................................................... 44 3.6 Tolerating Superfluous Differences ................................................................. 49 4 Remote Differential Compression .......................................................................... 52 4.1 Probabilistic Method ........................................................................................ 52 4.2 Fingerprint Collisions ...................................................................................... 55 4.3 Chunking .......................................................................................................... 58 4.4 Alignment Problem .......................................................................................... 60 4.4.1 Content-Defined Division Functions ........................................................ 63 4.5 Differencing Fingerprints ................................................................................. 67 4.5.1 Fingerprint Comparison ............................................................................ 68 4.5.2 Charasteristic Polynomial Interpolation ................................................... 72 4.6 Synchronization ............................................................................................... 78 4.7 Optimizations ................................................................................................... 80 5 Empirical Comparison ............................................................................................ 82 5.1 Test Specifications ........................................................................................... 83 5.2 Measurement Results ....................................................................................... 86 6 Conclusion .............................................................................................................. 91 List of References ........................................................................................................... 94 For Nella. 1 1 Introduction Often data needs to be synchronized with another version of the same data: two separate instances of similar data need to be made identical. After synchronization is complete, one of the instances has been replaced with data identical to the other instance, so that only one version of data remains. For example, a file stored on cloud (such as Dropbox or Google Drive) needs to be updated to a more recent version of the same file that has been modified; or a software application needs to be upgraded to a more recent version of the same application. Often old data is very similar to the new version because only small sections of the data have changed. The two versions of data to be synchronized might be stored on different physical computers, connected by a network. Copying the new version completely requires copying all of the unchanged data, too, even though the unchanged data is already on the destination computer in the outdated version. Time and resources are wasted due to the unnecessary copying. Especially mobile devices utilizing wireless networks are often limited in transfer bandwidth and storage space. As mobile computing is becoming more commonplace, data is being modified on multiple devices and the versions on the devices need to be kept synchronized. Instead of copying new version of data completely to replace older version of the data, differential compression (also called delta compression) allows transmitting or storing only the differences between the versions [ABF02]. Differential compressors can substitute sections of data in new version that are already present in the older version with merely references to those identical sections and thus exclude the common content from transmit. An optimal compressor needs to transmit or store only the information that is not present in the other version of data. The intended target version of data can be reconstructed from the other version and the difference information. When

Remote Synchronization Using Differential Compression

Fundamental Data Structures Contents

Data Structures

Tobiesen Ole.Pdf (2.195Mb)

String Hashing for Collection-Based Compression