String Hashing for Collection-Based Compression

String Hashing for Collection-Based Compression Andrew Gregory Peel orcid.org/0000-0001-6710-7513 Submitted in total fulfilment of the requirements of the degree of Doctor of Philosophy December 2015 Department of Computing and Information Systems Melbourne School of Engineering The University of Melbourne Produced on archival quality paper i Abstract Data collections are traditionally stored as individually compressed files. Where the files have a significant degree of similarity, such as genomes, incremental backup archives, versioned repositories, and web archives, additional compression can be achieved using references to matching data from other files in the collection. We describe compression using long-range or inter-file similarities as collection- based compression (CBC). The principal problem of CBC is the efficient location of matching data. A common technique for multiple string search uses an index of hashes, referred to as fingerprints, of strings sampled from the text, which are then compared with hashes of substrings from the search string. In this thesis, we investigate the suitability of this technique to the problem of CBC. A CBC system, cobald, was developed which employs a two-step scheme: a preliminary long-range delta encoding step using the fingerprint index, fol- lowed by a compression of the delta file by a standard compression utility. Tests were performed on data collections from two sources: snapshots of web crawls (54 Gbytes) and multiple versions of a genome (26 Gbytes). Using an index of hashes of fixed-length substrings of length 1 024 bytes, significantly improved compression was achieved. The compression of the web collection was six times more than the compression achieved by gzip and three times 7-zip. The genome collection was compressed ten times better than gzip and 7-zip. The compression time was less than taken by 7-zip to compress the files individually. The use ii Abstract of content-defined selection of substrings for the fingerprint index was also investigated, using chunking techniques. The overall result is a dramatic improvement on compression compared to existing approaches. String hash functions are an essential component of the CBC system, which have not been thoroughly investigated in the literature. As the fingerprint must remain unchanged for the lifetime of the collection, schemes such as universal hashing which may require re-hashing are not suitable. Instead, hash functions having randomness properties similar to cryptographic hash functions, but with much faster computation time, are preferred. Many functions in wide use are purported to have these properties, but the reported test results do not support conclusions about the statistical significance of the tests. A protocol for testing the collision properties of hash functions was developed having a well-defined statistical significance measure, and used to assess the performance of several popular string hashing functions. Avalanche is a property of hash functions which is considered to indicate good randomness. Avalanche is essential for good cryptographic hash functions, but the necessity of avalanche for non-cryptographic hash functions, which do not need to conceal their operation, has not been thoroughly investigated. This question was addressed indirectly by investigating the effect of function resilience when hashing biased data. The resilience of a function directly limits the degree of avalanche. A proof is given that a highly resilient function is optimal when hashing data from a zero-order source, while experiments show that high resilience is less advantageous for data biased in other ways, such as natural language text. Consequently, we conclude that a good hash function requires some degree of both resilience and avalanche. iii Declaration I declare that: (i) this thesis comprises only my original work towards the PhD except where indicated in the Preface; (ii) due acknowledgement has been made in the text to all other material used; and (iii) the thesis is fewer than 100 000 words in length, exclusive of tables, maps, bibliographies and appendices. Andrew Peel, December 3, 2015. iv v Preface Chapter 3 and Chapter 4 were substantially previously published as: Andrew Peel, Anthony Wirth and Justin Zobel. Collection-Based Compression using Discovered Long Matching Strings, in Proceed- ings of the twentieth ACM Conference on Information and Knowledge Management (CIKM), pages 2361–2364, 2011. None of the work described in this thesis has been submitted for other qualifi- cations. None of the work described in this thesis was carried out prior to PhD candidature enrolment. Writing and editing of this thesis was undertaken solely by the author. vi vii Acknowledgements I am immensely grateful to my supervisors, Professor Justin Zobel and Associate Professor Tony Wirth. This thesis would not have been possible without their oversight and advice. Having taken over five years, it has placed an unusually high demand on their time and attention. They have guided my development as a scholar and their constant care and encouragement are greatly appreciated. I am also grateful for the support given by Associate Professor Andrew Turpin, the third member of my thesis committee. My studies were financially supported by an Australian Postgraduate Award scholarship from the Australian Government, and a supplementary scholarship from NICTA, which is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program. My travel to the CIKM and SPIRE conferences in 2011 was partially funded by a grant from Google inc. In addition, the University of Melbourne provided facilities and technical support for the duration of my studies. Thanks to Pedro Reviriego who provided his Matlab code to calculate the exact solution of the balls-in-bins problem. Thanks also to Halil Ali who provided the Web Snap-shots data collection. I extend special thanks to Dr Kerri Morgan for introducing me to the mathe- matics of finite fields, for reading my paper drafts, and for her encouragement and viii Acknowledgements friendship. I also thank Professor Tony Wirth for convening TRICS, a reading group in theoretical computer science, where I obtained a broad understanding of the field beyond the narrow focus of my PhD. I thank the professional staff of the CIS department, especially Rhonda Smithies, Madalain Dolic and Leonora Strocio, for their assistance on many occa- sions. Also, the Engineering Faculty IT team, especially Thomas Weichert, who supported my computing needs. I am also grateful for the experience of tutoring and lecturing an undergrad- uate subject for the CIS department, and the guidance and support of Professor Peter Stuckey, the previous lecturer. Finally, I am immensely grateful to my family, Gilda, Lorenza and Giuliana. This thesis has taken substantially longer than anticipated, and they have waited patiently for me to finish before we embark on new plans. I would not have completed this thesis without their continual affection and support. Contents Abstract i Declaration iii Preface v Acknowledgements vii 1 Introduction 1 1.1 ProblemStatement .......................... 4 1.2 AimandScope ............................ 8 1.3 Significance .............................. 10 1.4 ThesisOverview............................ 11 2 Literature Review 13 2.1 StringHashing ............................ 14 2.1.1 Notation............................ 15 2.1.2 HashFunctions ........................ 15 2.1.3 Desirable Properties . 15 2.1.4 IdealHashFunction ..................... 17 2.1.5 UniversalHashing....................... 18 2.1.6 Balanced Hash Functions for Biased Keys . 22 x Contents 2.2 Non-UniversalHashFunctionsforStrings. 30 2.2.1 TestingHashFunctions. 36 2.3 RecursiveHashFunctions . 41 2.4 StringSearch ............................. 49 2.4.1 Fundamental AlgorithmsforString Search . 49 2.4.2 The Karp-Rabin Algorithm . 51 2.4.3 FingerprintIndexesforStringSearch . 53 2.4.4 Compressed Self-Indexes . 58 2.5 DeltaEncoding ............................ 61 2.5.1 OptimalEncoding ...................... 63 2.5.2 Suboptimal Encoding with Minimum Match Length . 65 2.6 Content-Defined Selection and Chunking . 69 2.7 DeltaCompressionSystems . 80 2.8 ResemblanceDetection . .. .. 82 2.9 Collection Compression . 84 2.10Summary ............................... 90 3 Collection-Based Compression 91 3.1 OverviewoftheCBCScheme . 93 3.2 Detailed Description . 97 3.3 Independence of Recursive Hash Functions . 106 3.4 Complexity of the Encoding Algorithm . 108 3.5 PerformanceEvaluation . 112 3.6 DataCollections ........................... 113 3.7 Results: Fixed-OffsetSelection. 115 3.8 Content-DefinedChunkSelection . 124 3.9 Results: Content-DefinedSelection . 130 Contents xi 3.10 ComparisonWithOtherSystems . 133 3.11Discussion ............................... 142 4 CBC with Compressed Collections 145 4.1 Evaluation............................... 150 4.2 Results: CompressedCollections. 151 4.3 Results: CBC on a Compressed Collection . 165 4.4 Discussion............................... 171 5 Hash Function Testing 175 5.1 Black-Box Collision Test . 177 5.2 Validation of the Collision Test . 187 5.3 Results Without Bit-Selection . 189 5.4 Results With Bit-Selection . 193 5.5 Discussion............................... 198 6 Hashing Biased Data 201 6.1 Resilient Hash Functions on Biased Data . 203 6.2 CaseStudyofaFunctionFamily . 213 6.3 TestForEffectiveResilience

String Hashing for Collection-Based Compression

Fundamental Data Structures Contents

Remote Synchronization Using Differential Compression

Data Structures

Tobiesen Ole.Pdf (2.195Mb)