String Hashing for Collection-Based Compression
String Hashing for Collection-Based Compression Andrew Gregory Peel orcid.org/0000-0001-6710-7513 Submitted in total fulfilment of the requirements of the degree of Doctor of Philosophy December 2015 Department of Computing and Information Systems Melbourne School of Engineering The University of Melbourne Produced on archival quality paper i Abstract Data collections are traditionally stored as individually compressed files. Where the files have a significant degree of similarity, such as genomes, incremental backup archives, versioned repositories, and web archives, additional compression can be achieved using references to matching data from other files in the collection. We describe compression using long-range or inter-file similarities as collection- based compression (CBC). The principal problem of CBC is the efficient location of matching data. A common technique for multiple string search uses an index of hashes, referred to as fingerprints, of strings sampled from the text, which are then compared with hashes of substrings from the search string. In this thesis, we investigate the suitability of this technique to the problem of CBC. A CBC system, cobald, was developed which employs a two-step scheme: a preliminary long-range delta encoding step using the fingerprint index, fol- lowed by a compression of the delta file by a standard compression utility. Tests were performed on data collections from two sources: snapshots of web crawls (54 Gbytes) and multiple versions of a genome (26 Gbytes). Using an index of hashes of fixed-length substrings of length 1 024 bytes, significantly improved compression was achieved. The compression of the web collection was six times more than the compression achieved by gzip and three times 7-zip.
[Show full text]