Remote Synchronization Using Differential Compression

Total Page:16

File Type:pdf, Size:1020Kb

Remote Synchronization Using Differential Compression Date of acceptance Grade th 6 November, 2012 ECLA Instructor Sasu Tarkoma Lossless Differential Compression for Synchronizing Arbitrary Single- Dimensional Strings Jari Karppanen Helsinki, September 20th, 2012 Master’s Thesis UNIVERSITY OF HELSINKI Department of Computer Science HELSINGIN YLIOPISTO − HELSINGFORS UNIVERSITET – UNIVERSITY OF HELSINKI Tiedekunta − Fakultet – Faculty Laitos − Institution − Department Faculty of Science Department of Computer Science Tekijä − Författare − Author Jari Karppanen Työn nimi − Arbetets titel − Title Lossless Differential Compression for Synchronizing Arbitrary Single-Dimensional Strings Oppiaine − Läroämne − Subject Computer Science Työn laji − Arbetets art − Level Aika − Datum − Month and year Sivumäärä − Sidoantal − Number of pages th Master’s thesis Sep 20 , 2012 93 pages Tiivistelmä − Referat − Abstract Differential compression allows expressing a modified document as differences relative to another version of the document. A compressed string requires space relative to amount of changes, irrespective of original document sizes. The purpose of this study was to answer what algorithms are suitable for universal lossless differential compression for synchronizing two arbitrary documents either locally or remotely. Two main problems in differential compression are finding the differences (differencing), and compactly communicating the differences (encoding). We discussed local differencing algorithms based on subsequence searching, hashtable lookups, suffix searching, and projection. We also discussed probabilistic remote algorithms based on both recursive comparison and characteristic polynomial interpolation of hashes computed from variable-length content-defined substrings. We described various heuristics for approximating optimal algorithms as arbitrary long strings and memory limitations force discarding information. Discussion also included compact delta encoding and in-place reconstruction. We presented results from empirical testing using discussed algorithms. The conclusions were that multiple algorithms need to be integrated into a hybrid implementation, which heuristically chooses algorithms based on evaluation of the input data. Algorithms based on hashtable lookups are faster on average and require less memory, but algorithms based on suffix searching find least differences. Interpolating characteristic polynomials was found to be too slow for general use. With remote hash comparison, content-defined chunks and recursive comparison can reduce protocol overhead. A differential compressor should be merged with a state-of-art non-differential compressor to enable more compact delta encoding. Input should be processed multiple times to allow constant a space bound without significant reduction in compression efficiency. Compression efficiencly of current popular synchronizers could be improved, as our empiral testing showed that a non-differential compressor produced smaller files without having access to one of the two strings. ACM Computing Classification System (CCS): E.4 [Coding and Information Theory]: Data compaction and compression, Error control codes; F.2.2 [Computation by Abstract Devices]: Nonnumerical Algorithms and Problems---Pattern matching Avainsanat – Nyckelord − Keywords synchronization, differential, compression, delta encoding, differencing , dictionary searching, fingerprint, patch, hashing, projective, set reconciliation, content-defined chunking, characteristic polynomial, cpi Säilytyspaikka − Förvaringställe − Where deposited Kumpula Science Library, serial number C-2012- Muita tietoja − Övriga uppgifter − Additional information Contents 1 Introduction .............................................................................................................. 1 2 Differential Compression ......................................................................................... 3 2.1 Algorithmic Objective ....................................................................................... 3 2.2 Differencing ....................................................................................................... 8 2.3 Delta Encoding ................................................................................................. 11 2.4 Decompressing ................................................................................................. 14 2.5 Measuring Performance ................................................................................... 17 3 Local Differential Compression ............................................................................. 19 3.1 Subsequence Searching .................................................................................... 20 3.2 Dictionary Searching ....................................................................................... 22 3.2.1 Almost Optimal Greedy Algorithm .......................................................... 23 3.2.2 Bounded Hash Chaining ........................................................................... 29 3.2.3 Sliding Window ........................................................................................ 35 3.2.4 Sparse Fingerprinting ............................................................................... 36 3.3 Suffix Searching ............................................................................................... 37 3.4 Projective Searching ........................................................................................ 42 3.5 Delta Encoding and Decoding ......................................................................... 44 3.6 Tolerating Superfluous Differences ................................................................. 49 4 Remote Differential Compression .......................................................................... 52 4.1 Probabilistic Method ........................................................................................ 52 4.2 Fingerprint Collisions ...................................................................................... 55 4.3 Chunking .......................................................................................................... 58 4.4 Alignment Problem .......................................................................................... 60 4.4.1 Content-Defined Division Functions ........................................................ 63 4.5 Differencing Fingerprints ................................................................................. 67 4.5.1 Fingerprint Comparison ............................................................................ 68 4.5.2 Charasteristic Polynomial Interpolation ................................................... 72 4.6 Synchronization ............................................................................................... 78 4.7 Optimizations ................................................................................................... 80 5 Empirical Comparison ............................................................................................ 82 5.1 Test Specifications ........................................................................................... 83 5.2 Measurement Results ....................................................................................... 86 6 Conclusion .............................................................................................................. 91 List of References ........................................................................................................... 94 For Nella. 1 1 Introduction Often data needs to be synchronized with another version of the same data: two separate instances of similar data need to be made identical. After synchronization is complete, one of the instances has been replaced with data identical to the other instance, so that only one version of data remains. For example, a file stored on cloud (such as Dropbox or Google Drive) needs to be updated to a more recent version of the same file that has been modified; or a software application needs to be upgraded to a more recent version of the same application. Often old data is very similar to the new version because only small sections of the data have changed. The two versions of data to be synchronized might be stored on different physical computers, connected by a network. Copying the new version completely requires copying all of the unchanged data, too, even though the unchanged data is already on the destination computer in the outdated version. Time and resources are wasted due to the unnecessary copying. Especially mobile devices utilizing wireless networks are often limited in transfer bandwidth and storage space. As mobile computing is becoming more commonplace, data is being modified on multiple devices and the versions on the devices need to be kept synchronized. Instead of copying new version of data completely to replace older version of the data, differential compression (also called delta compression) allows transmitting or storing only the differences between the versions [ABF02]. Differential compressors can substitute sections of data in new version that are already present in the older version with merely references to those identical sections and thus exclude the common content from transmit. An optimal compressor needs to transmit or store only the information that is not present in the other version of data. The intended target version of data can be reconstructed from the other version and the difference information. When
Recommended publications
  • Fundamental Data Structures Contents
    Fundamental Data Structures Contents 1 Introduction 1 1.1 Abstract data type ........................................... 1 1.1.1 Examples ........................................... 1 1.1.2 Introduction .......................................... 2 1.1.3 Defining an abstract data type ................................. 2 1.1.4 Advantages of abstract data typing .............................. 4 1.1.5 Typical operations ...................................... 4 1.1.6 Examples ........................................... 5 1.1.7 Implementation ........................................ 5 1.1.8 See also ............................................ 6 1.1.9 Notes ............................................. 6 1.1.10 References .......................................... 6 1.1.11 Further ............................................ 7 1.1.12 External links ......................................... 7 1.2 Data structure ............................................. 7 1.2.1 Overview ........................................... 7 1.2.2 Examples ........................................... 7 1.2.3 Language support ....................................... 8 1.2.4 See also ............................................ 8 1.2.5 References .......................................... 8 1.2.6 Further reading ........................................ 8 1.2.7 External links ......................................... 9 1.3 Analysis of algorithms ......................................... 9 1.3.1 Cost models ......................................... 9 1.3.2 Run-time analysis
    [Show full text]
  • Data Structures
    Data structures PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Thu, 17 Nov 2011 20:55:22 UTC Contents Articles Introduction 1 Data structure 1 Linked data structure 3 Succinct data structure 5 Implicit data structure 7 Compressed data structure 8 Search data structure 9 Persistent data structure 11 Concurrent data structure 15 Abstract data types 18 Abstract data type 18 List 26 Stack 29 Queue 57 Deque 60 Priority queue 63 Map 67 Bidirectional map 70 Multimap 71 Set 72 Tree 76 Arrays 79 Array data structure 79 Row-major order 84 Dope vector 86 Iliffe vector 87 Dynamic array 88 Hashed array tree 91 Gap buffer 92 Circular buffer 94 Sparse array 109 Bit array 110 Bitboard 115 Parallel array 119 Lookup table 121 Lists 127 Linked list 127 XOR linked list 143 Unrolled linked list 145 VList 147 Skip list 149 Self-organizing list 154 Binary trees 158 Binary tree 158 Binary search tree 166 Self-balancing binary search tree 176 Tree rotation 178 Weight-balanced tree 181 Threaded binary tree 182 AVL tree 188 Red-black tree 192 AA tree 207 Scapegoat tree 212 Splay tree 216 T-tree 230 Rope 233 Top Trees 238 Tango Trees 242 van Emde Boas tree 264 Cartesian tree 268 Treap 273 B-trees 276 B-tree 276 B+ tree 287 Dancing tree 291 2-3 tree 292 2-3-4 tree 293 Queaps 295 Fusion tree 299 Bx-tree 299 Heaps 303 Heap 303 Binary heap 305 Binomial heap 311 Fibonacci heap 316 2-3 heap 321 Pairing heap 321 Beap 324 Leftist tree 325 Skew heap 328 Soft heap 331 d-ary heap 333 Tries 335 Trie
    [Show full text]
  • Tobiesen Ole.Pdf (2.195Mb)
    Faculty of Science and Technology MASTER’S THESIS Study program/ Specialization: Spring semester, 2016 Master of Science in Computer Science Open Writer: Ole Tobiesen ………………………………………… (Writer’s signature) Faculty supervisor: Reggie Davidrajuh Thesis title: Data Fingerprinting -- Identifying Files and Tables with Hashing Schemes Credits (ECTS): 30 Key words: Pages: 145, including table of contents Data Fingerprinting and appendix Fuzzy Hashing Machine Learning Enclosure: 4 7z archives Merkle Trees Finite Fields Mersenne Primes Stavanger, 15 June 2016 Front page for master thesis Faculty of Science and Technology Decision made by the Dean October 30th 2009 Abstract INTRODUCTION: Although hash functions are nothing new, these are not lim- ited to cryptographic purposes. One important field is data fingerprinting. Here, the purpose is to generate a digest which serves as a fingerprint (or a license plate) that uniquely identifies a file. More recently, fuzzy fingerprinting schemes — which will scrap the avalanche effect in favour of detecting local changes — has hit the spotlight. The main purpose of this project is to find ways to classify text tables, and discover where potential changes or inconsitencies have happened. METHODS: Large parts of this report can be considered applied discrete math- ematics — and finite fields and combinatorics have played an important part. Ra- bin’s fingerprinting scheme was tested extensively and compared against existing cryptographic algorithms, CRC and FNV. Moreover, a self-designed fuzzy hashing algorithm with the preliminary name No-Frills Hash has been created and tested against Nilsimsa and Spamsum. NFHash is based on Mersenne primes, and uses a sliding window to create a fuzzy hash.
    [Show full text]
  • String Hashing for Collection-Based Compression
    String Hashing for Collection-Based Compression Andrew Gregory Peel orcid.org/0000-0001-6710-7513 Submitted in total fulfilment of the requirements of the degree of Doctor of Philosophy December 2015 Department of Computing and Information Systems Melbourne School of Engineering The University of Melbourne Produced on archival quality paper i Abstract Data collections are traditionally stored as individually compressed files. Where the files have a significant degree of similarity, such as genomes, incremental backup archives, versioned repositories, and web archives, additional compression can be achieved using references to matching data from other files in the collection. We describe compression using long-range or inter-file similarities as collection- based compression (CBC). The principal problem of CBC is the efficient location of matching data. A common technique for multiple string search uses an index of hashes, referred to as fingerprints, of strings sampled from the text, which are then compared with hashes of substrings from the search string. In this thesis, we investigate the suitability of this technique to the problem of CBC. A CBC system, cobald, was developed which employs a two-step scheme: a preliminary long-range delta encoding step using the fingerprint index, fol- lowed by a compression of the delta file by a standard compression utility. Tests were performed on data collections from two sources: snapshots of web crawls (54 Gbytes) and multiple versions of a genome (26 Gbytes). Using an index of hashes of fixed-length substrings of length 1 024 bytes, significantly improved compression was achieved. The compression of the web collection was six times more than the compression achieved by gzip and three times 7-zip.
    [Show full text]