String Hashing for Collection-Based Compression

Total Page:16

File Type:pdf, Size:1020Kb

String Hashing for Collection-Based Compression String Hashing for Collection-Based Compression Andrew Gregory Peel orcid.org/0000-0001-6710-7513 Submitted in total fulfilment of the requirements of the degree of Doctor of Philosophy December 2015 Department of Computing and Information Systems Melbourne School of Engineering The University of Melbourne Produced on archival quality paper i Abstract Data collections are traditionally stored as individually compressed files. Where the files have a significant degree of similarity, such as genomes, incremental backup archives, versioned repositories, and web archives, additional compression can be achieved using references to matching data from other files in the collection. We describe compression using long-range or inter-file similarities as collection- based compression (CBC). The principal problem of CBC is the efficient location of matching data. A common technique for multiple string search uses an index of hashes, referred to as fingerprints, of strings sampled from the text, which are then compared with hashes of substrings from the search string. In this thesis, we investigate the suitability of this technique to the problem of CBC. A CBC system, cobald, was developed which employs a two-step scheme: a preliminary long-range delta encoding step using the fingerprint index, fol- lowed by a compression of the delta file by a standard compression utility. Tests were performed on data collections from two sources: snapshots of web crawls (54 Gbytes) and multiple versions of a genome (26 Gbytes). Using an index of hashes of fixed-length substrings of length 1 024 bytes, significantly improved compression was achieved. The compression of the web collection was six times more than the compression achieved by gzip and three times 7-zip. The genome collection was compressed ten times better than gzip and 7-zip. The compres- sion time was less than taken by 7-zip to compress the files individually. The use ii Abstract of content-defined selection of substrings for the fingerprint index was also inves- tigated, using chunking techniques. The overall result is a dramatic improvement on compression compared to existing approaches. String hash functions are an essential component of the CBC system, which have not been thoroughly investigated in the literature. As the fingerprint must remain unchanged for the lifetime of the collection, schemes such as universal hashing which may require re-hashing are not suitable. Instead, hash functions having randomness properties similar to cryptographic hash functions, but with much faster computation time, are preferred. Many functions in wide use are purported to have these properties, but the reported test results do not support conclusions about the statistical significance of the tests. A protocol for testing the collision properties of hash functions was developed having a well-defined statistical significance measure, and used to assess the performance of several popular string hashing functions. Avalanche is a property of hash functions which is considered to indicate good randomness. Avalanche is essential for good cryptographic hash functions, but the necessity of avalanche for non-cryptographic hash functions, which do not need to conceal their operation, has not been thoroughly investigated. This question was addressed indirectly by investigating the effect of function resilience when hashing biased data. The resilience of a function directly limits the degree of avalanche. A proof is given that a highly resilient function is optimal when hashing data from a zero-order source, while experiments show that high resilience is less advantageous for data biased in other ways, such as natural language text. Consequently, we conclude that a good hash function requires some degree of both resilience and avalanche. iii Declaration I declare that: (i) this thesis comprises only my original work towards the PhD except where indicated in the Preface; (ii) due acknowledgement has been made in the text to all other material used; and (iii) the thesis is fewer than 100 000 words in length, exclusive of tables, maps, bibliographies and appendices. Andrew Peel, December 3, 2015. iv v Preface Chapter 3 and Chapter 4 were substantially previously published as: Andrew Peel, Anthony Wirth and Justin Zobel. Collection-Based Compression using Discovered Long Matching Strings, in Proceed- ings of the twentieth ACM Conference on Information and Knowledge Management (CIKM), pages 2361–2364, 2011. None of the work described in this thesis has been submitted for other qualifi- cations. None of the work described in this thesis was carried out prior to PhD candidature enrolment. Writing and editing of this thesis was undertaken solely by the author. vi vii Acknowledgements I am immensely grateful to my supervisors, Professor Justin Zobel and Associate Professor Tony Wirth. This thesis would not have been possible without their oversight and advice. Having taken over five years, it has placed an unusually high demand on their time and attention. They have guided my development as a scholar and their constant care and encouragement are greatly appreciated. I am also grateful for the support given by Associate Professor Andrew Turpin, the third member of my thesis committee. My studies were financially supported by an Australian Postgraduate Award scholarship from the Australian Government, and a supplementary scholarship from NICTA, which is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program. My travel to the CIKM and SPIRE conferences in 2011 was partially funded by a grant from Google inc. In addition, the University of Melbourne provided facilities and technical support for the duration of my studies. Thanks to Pedro Reviriego who provided his Matlab code to calculate the exact solution of the balls-in-bins problem. Thanks also to Halil Ali who provided the Web Snap-shots data collection. I extend special thanks to Dr Kerri Morgan for introducing me to the mathe- matics of finite fields, for reading my paper drafts, and for her encouragement and viii Acknowledgements friendship. I also thank Professor Tony Wirth for convening TRICS, a reading group in theoretical computer science, where I obtained a broad understanding of the field beyond the narrow focus of my PhD. I thank the professional staff of the CIS department, especially Rhonda Smithies, Madalain Dolic and Leonora Strocio, for their assistance on many occa- sions. Also, the Engineering Faculty IT team, especially Thomas Weichert, who supported my computing needs. I am also grateful for the experience of tutoring and lecturing an undergrad- uate subject for the CIS department, and the guidance and support of Professor Peter Stuckey, the previous lecturer. Finally, I am immensely grateful to my family, Gilda, Lorenza and Giuliana. This thesis has taken substantially longer than anticipated, and they have waited patiently for me to finish before we embark on new plans. I would not have completed this thesis without their continual affection and support. Contents Abstract i Declaration iii Preface v Acknowledgements vii 1 Introduction 1 1.1 ProblemStatement .......................... 4 1.2 AimandScope ............................ 8 1.3 Significance .............................. 10 1.4 ThesisOverview............................ 11 2 Literature Review 13 2.1 StringHashing ............................ 14 2.1.1 Notation............................ 15 2.1.2 HashFunctions ........................ 15 2.1.3 Desirable Properties . 15 2.1.4 IdealHashFunction ..................... 17 2.1.5 UniversalHashing....................... 18 2.1.6 Balanced Hash Functions for Biased Keys . 22 x Contents 2.2 Non-UniversalHashFunctionsforStrings. 30 2.2.1 TestingHashFunctions. 36 2.3 RecursiveHashFunctions . 41 2.4 StringSearch ............................. 49 2.4.1 Fundamental AlgorithmsforString Search . 49 2.4.2 The Karp-Rabin Algorithm . 51 2.4.3 FingerprintIndexesforStringSearch . 53 2.4.4 Compressed Self-Indexes . 58 2.5 DeltaEncoding ............................ 61 2.5.1 OptimalEncoding ...................... 63 2.5.2 Suboptimal Encoding with Minimum Match Length . 65 2.6 Content-Defined Selection and Chunking . 69 2.7 DeltaCompressionSystems . 80 2.8 ResemblanceDetection . .. .. 82 2.9 Collection Compression . 84 2.10Summary ............................... 90 3 Collection-Based Compression 91 3.1 OverviewoftheCBCScheme . 93 3.2 Detailed Description . 97 3.3 Independence of Recursive Hash Functions . 106 3.4 Complexity of the Encoding Algorithm . 108 3.5 PerformanceEvaluation . 112 3.6 DataCollections ........................... 113 3.7 Results: Fixed-OffsetSelection. 115 3.8 Content-DefinedChunkSelection . 124 3.9 Results: Content-DefinedSelection . 130 Contents xi 3.10 ComparisonWithOtherSystems . 133 3.11Discussion ............................... 142 4 CBC with Compressed Collections 145 4.1 Evaluation............................... 150 4.2 Results: CompressedCollections. 151 4.3 Results: CBC on a Compressed Collection . 165 4.4 Discussion............................... 171 5 Hash Function Testing 175 5.1 Black-Box Collision Test . 177 5.2 Validation of the Collision Test . 187 5.3 Results Without Bit-Selection . 189 5.4 Results With Bit-Selection . 193 5.5 Discussion............................... 198 6 Hashing Biased Data 201 6.1 Resilient Hash Functions on Biased Data . 203 6.2 CaseStudyofaFunctionFamily . 213 6.3 TestForEffectiveResilience
Recommended publications
  • Fundamental Data Structures Contents
    Fundamental Data Structures Contents 1 Introduction 1 1.1 Abstract data type ........................................... 1 1.1.1 Examples ........................................... 1 1.1.2 Introduction .......................................... 2 1.1.3 Defining an abstract data type ................................. 2 1.1.4 Advantages of abstract data typing .............................. 4 1.1.5 Typical operations ...................................... 4 1.1.6 Examples ........................................... 5 1.1.7 Implementation ........................................ 5 1.1.8 See also ............................................ 6 1.1.9 Notes ............................................. 6 1.1.10 References .......................................... 6 1.1.11 Further ............................................ 7 1.1.12 External links ......................................... 7 1.2 Data structure ............................................. 7 1.2.1 Overview ........................................... 7 1.2.2 Examples ........................................... 7 1.2.3 Language support ....................................... 8 1.2.4 See also ............................................ 8 1.2.5 References .......................................... 8 1.2.6 Further reading ........................................ 8 1.2.7 External links ......................................... 9 1.3 Analysis of algorithms ......................................... 9 1.3.1 Cost models ......................................... 9 1.3.2 Run-time analysis
    [Show full text]
  • Remote Synchronization Using Differential Compression
    Date of acceptance Grade th 6 November, 2012 ECLA Instructor Sasu Tarkoma Lossless Differential Compression for Synchronizing Arbitrary Single- Dimensional Strings Jari Karppanen Helsinki, September 20th, 2012 Master’s Thesis UNIVERSITY OF HELSINKI Department of Computer Science HELSINGIN YLIOPISTO − HELSINGFORS UNIVERSITET – UNIVERSITY OF HELSINKI Tiedekunta − Fakultet – Faculty Laitos − Institution − Department Faculty of Science Department of Computer Science Tekijä − Författare − Author Jari Karppanen Työn nimi − Arbetets titel − Title Lossless Differential Compression for Synchronizing Arbitrary Single-Dimensional Strings Oppiaine − Läroämne − Subject Computer Science Työn laji − Arbetets art − Level Aika − Datum − Month and year Sivumäärä − Sidoantal − Number of pages th Master’s thesis Sep 20 , 2012 93 pages Tiivistelmä − Referat − Abstract Differential compression allows expressing a modified document as differences relative to another version of the document. A compressed string requires space relative to amount of changes, irrespective of original document sizes. The purpose of this study was to answer what algorithms are suitable for universal lossless differential compression for synchronizing two arbitrary documents either locally or remotely. Two main problems in differential compression are finding the differences (differencing), and compactly communicating the differences (encoding). We discussed local differencing algorithms based on subsequence searching, hashtable lookups, suffix searching, and projection. We also discussed
    [Show full text]
  • Data Structures
    Data structures PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Thu, 17 Nov 2011 20:55:22 UTC Contents Articles Introduction 1 Data structure 1 Linked data structure 3 Succinct data structure 5 Implicit data structure 7 Compressed data structure 8 Search data structure 9 Persistent data structure 11 Concurrent data structure 15 Abstract data types 18 Abstract data type 18 List 26 Stack 29 Queue 57 Deque 60 Priority queue 63 Map 67 Bidirectional map 70 Multimap 71 Set 72 Tree 76 Arrays 79 Array data structure 79 Row-major order 84 Dope vector 86 Iliffe vector 87 Dynamic array 88 Hashed array tree 91 Gap buffer 92 Circular buffer 94 Sparse array 109 Bit array 110 Bitboard 115 Parallel array 119 Lookup table 121 Lists 127 Linked list 127 XOR linked list 143 Unrolled linked list 145 VList 147 Skip list 149 Self-organizing list 154 Binary trees 158 Binary tree 158 Binary search tree 166 Self-balancing binary search tree 176 Tree rotation 178 Weight-balanced tree 181 Threaded binary tree 182 AVL tree 188 Red-black tree 192 AA tree 207 Scapegoat tree 212 Splay tree 216 T-tree 230 Rope 233 Top Trees 238 Tango Trees 242 van Emde Boas tree 264 Cartesian tree 268 Treap 273 B-trees 276 B-tree 276 B+ tree 287 Dancing tree 291 2-3 tree 292 2-3-4 tree 293 Queaps 295 Fusion tree 299 Bx-tree 299 Heaps 303 Heap 303 Binary heap 305 Binomial heap 311 Fibonacci heap 316 2-3 heap 321 Pairing heap 321 Beap 324 Leftist tree 325 Skew heap 328 Soft heap 331 d-ary heap 333 Tries 335 Trie
    [Show full text]
  • Tobiesen Ole.Pdf (2.195Mb)
    Faculty of Science and Technology MASTER’S THESIS Study program/ Specialization: Spring semester, 2016 Master of Science in Computer Science Open Writer: Ole Tobiesen ………………………………………… (Writer’s signature) Faculty supervisor: Reggie Davidrajuh Thesis title: Data Fingerprinting -- Identifying Files and Tables with Hashing Schemes Credits (ECTS): 30 Key words: Pages: 145, including table of contents Data Fingerprinting and appendix Fuzzy Hashing Machine Learning Enclosure: 4 7z archives Merkle Trees Finite Fields Mersenne Primes Stavanger, 15 June 2016 Front page for master thesis Faculty of Science and Technology Decision made by the Dean October 30th 2009 Abstract INTRODUCTION: Although hash functions are nothing new, these are not lim- ited to cryptographic purposes. One important field is data fingerprinting. Here, the purpose is to generate a digest which serves as a fingerprint (or a license plate) that uniquely identifies a file. More recently, fuzzy fingerprinting schemes — which will scrap the avalanche effect in favour of detecting local changes — has hit the spotlight. The main purpose of this project is to find ways to classify text tables, and discover where potential changes or inconsitencies have happened. METHODS: Large parts of this report can be considered applied discrete math- ematics — and finite fields and combinatorics have played an important part. Ra- bin’s fingerprinting scheme was tested extensively and compared against existing cryptographic algorithms, CRC and FNV. Moreover, a self-designed fuzzy hashing algorithm with the preliminary name No-Frills Hash has been created and tested against Nilsimsa and Spamsum. NFHash is based on Mersenne primes, and uses a sliding window to create a fuzzy hash.
    [Show full text]