Bicriteria Data Compression∗

Bicriteria data compression∗ Andrea Farruggia, Paolo Ferragina, Antonio Frangioni, and Rossano Venturini Dipartimento di Informatica, Universitàdi Pisa, Italy ffarruggi, ferragina, frangio, [email protected] Abstract lem, named \compress once, decompress many times", In this paper we address the problem of trading that can be cast into two main families: the com- optimally, and in a principled way, the compressed pressors based on the Burrows-Wheeler Transform [6], size/decompression time of LZ77 parsings by introduc- and the ones based on the Lempel-Ziv parsing scheme ing what we call the Bicriteria LZ77-Parsing problem. [35, 36]. Compressors are known in both families that The goal is to determine an LZ77 parsing which require time linear in the input size, both for compress- minimizes the space occupancy in bits of the compressed ing and decompressing the data, and take compressed- file, provided that the decompression time is bounded space which can be bound in terms of the k-th order by T . Symmetrically, we can exchange the role of empirical entropy of the input [25, 35]. the two resources and thus ask for minimizing the But the compressors running behind those large- decompression time provided that the compressed space scale storage systems are not derived from those scien- is bounded by a fixed amount given in advance. tific results. The reason relies in the fact that theo- We address this goal in three stages: (i) we intro- retically efficient compressors are optimal in the RAM duce the novel Bicriteria LZ77-Parsing problem which model, but they elicit many cache/IO misses during formalizes in a principled way what data-compressors the decompression step. This poor behavior is most have traditionally approached by means of heuristics; prominent in the BWT-based compressors, and it is not (ii) we solve this problem efficiently, up to a negligible negligible in the LZ-based approaches. This motivated additive constant, in O(n log2 n) time and optimal O(n) the software engineers to devise variants of Lempel-Ziv's words of working space, by proving and deploying some original proposal (e.g. Snappy, LZ4) with the injection of specific structural properties of a weighted graph de- several software tricks which have beneficial effects on rived from the possible LZ77-parsings of the input file; memory-access locality. These compressors expanded 1 (iii) we execute a preliminary set of experiments which further the known jungle of space/time trade-offs , thus show that our novel compressor is very competitive to posing the software engineers in front of a choice: ei- all the highly engineered competitors (such as Snappy, ther achieve effective/optimal compression-ratios, pos- lzma, bzip2), hence offering a win-win situation in the- sibly sacrificing the decompression speed (as it occurs in ory&practice. the theory-based results [15{17]); or try to balance them by adopting a plethora of programming tricks which 1 Introduction trade compressed space by decompression time (such as in Snappy, LZ4 or in the recent LZ77-end [26]), thus The advent of massive datasets and the consequent de- waiving mathematical guarantees on their final perfor- sign of high-performing distributed storage systems| mance. such as BigTable by Google [7], Cassandra by Facebook In the light of this dichotomy, it would be natural [5], Hadoop by Apache|have reignited the interest of to ask for an algorithm which guarantees effective the scientific and engineering community towards the compression-ratios and efficient decompression speed in design of lossless data compressors which achieve effec- hierarchical memories. In this paper, however, we aim tive compression ratio and very efficient decompression for a more ambitious goal which is further motivated by speed. The literature abounds of solutions for this prob- the following two simple, yet challenging, questions: • who cares whether the compressed file is slightly ∗This work was partially supported by MIUR of Italy under projects PRIN ARS Technomedia 2012 and the eCloud EU Project. 1See e.g., http://mattmahoney.net/dc/. longer than the one achievable with BWT-based which a sequence of ` consecutive words can be copied compressors, provided that we can improve signifi- from memory location x to memory location y (with cantly BWT's decompression time? This is a natu- x ≥ y) in time f(x) + `. We remark that, in our ral question arising in the context of distributed scenario, this model is more proper than the frequently storage-systems, and the one leading the design adopted two-level memory model [3], because we care Snappy and LZ4. to differentiate between contiguous/random accesses to • who cares whether the compressed file can be memory-disk blocks, which is a feature heavily exploited decompressed slightly slower than Snappy or LZ4, in the design of modern compressors [10]. provided that we can improve significantly their Given these two ingredients, we devise a formal compressed space? This is a natural question in a framework that allows us to analyze any LZ77-parsing context where space occupancy is a major concern, scheme in terms of both the space occupancy (in bits) e.g., tablets and smart-phones, and the one for of the compressed file, and the time cost of its decom- which tools like Google's Zopfli have been recently pression taking into account the underlying memory hi- introduced. erarchy (see Section 3). More in detail, we will extend the model proposed in [17], based on a special weighted If we are able to offer mathematical guarantees to DAG consisting of n = jSj nodes, one per character of S, the meaning of \slightly longer/slower", then these two and m = O(n2) edges, one per possible phrase in the questions become pertinent and challenging in theory LZ77-parsing of S. In our new graph each edge will have too. So in this paper we introduce the following prob- attached two weights: a time weight, that accounts for lem, that we call bicriteria data compression: given an the time to decompress a phrase (derived according to input file S and an upper bound T on its decompres- the hierarchical-memory model mentioned above), and a sion time, the goal is to determine a compressed ver- space cost, that accounts for the number of bits needed sion of S which minimizes the compressed space pro- to store the LZ77-phrase associated to that edge (de- vided that it can be decompressed in T time. Sym- rived according to the integer-encoder adopted in the metrically, we could exchange the role of time/space compressor). Every path π from node 1 to node n in G resources, and thus ask for the compressed version of (hereafter, named \1n-path") corresponds to an LZ77- S which minimizes the decompression time provided parsing of the input file S whose compressed-space occu- that the compressed-space occupancy is within a fixed pancy is given by the sum of the space-costs of π's edges bound. (say s(π)) and whose decompression-time is given by the In order to attack this problem in a principled way sum of the time-weights of π's edges (say t(π)). As a re- we need to fix two ingredients: the class of compressed sult of this correspondence, we will be able to rephrase versions of S over which this bicriteria optimization our bicriteria LZ77-parsing problem into the well-known will take place; and the computational model measur- weight-constrained shortest path problem (WCSPP) (see ing the resources to be optimized. For the former in- [30] and references therein) over the weighted DAG G, in gredient we will take the class of LZ77-based compres- which the goal will be to search for the 1n-path π whose sors because they are dominant in the theoretical (e.g., decompression-time is t(π) ≤ T and whose compressed- [8, 9, 14, 17, 22]) and in the practical setting (e.g., gzip, space occupancy s(π) is minimized. Due to its vast 7zip , Snappy, LZ4, [24, 26, 34]). In Section 2, we will range of applications, WCSPP received a great deal of at- show that the Bicriteria data-compression problem for- tention from the optimization community. It is an NP- mulated over LZ77-based compressors is well funded be- Hard problem, even on a DAG with positive weights and cause there exists an infinite class of strings which can costs [11, 18], and it can be solved in pseudo-polynomial be parsed in many different ways, thus offering a wide O(mT ) time via dynamic programming [27]. Our ver- spectrum of space-time trade-offs in which small varia- sion of the WCSPP problem has m and T bounded by tions in the usage of one resource (e.g., time) may in- O(n log n) (see Section 3), so it can be solved in poly- duce arbitrary large variations in the usage of the other nomial time, namely O(mT ) = O(n2 log2 n) time and resource (e.g., space). O(n2 log n) space. Unfortunately this bounds are unac- For the latter ingredient, we take inspiration from ceptable in practice, because n2 ≈ 264 just for one Gb several models of computation which abstract multi- of data to be compressed. level memory hierarchies and the fetching of contiguous The second contribution of this paper is to prove memory words [1, 2, 4, 28, 33]. In these models the some structural properties of our weighted DAG which cost of fetching a word at address x takes f(x) time, allow us to design an algorithm that approximately where f(x) is a non-decreasing, polynomially bounded solves our version of WCSPP in O(n log2 n) time and function (e.g., f(x) = dlog xe and f(x) = xO(1)).

Bicriteria Data Compression∗

Schematic Entry

Full Document

ARC File Revision 3.0 Proposal

Steganography and Vulnerabilities in Popular Archives Formats.| Nyxengine Nyx.Reversinglabs.Com

Lossless Compression of Internal Files in Parallel Reservoir Simulation

How to 'Zip and Unzip' Files

Arrow: Integration to 'Apache' 'Arrow'

Summer 2010 PPAXAXCENTURIONCENTURION Boston Police Patrolmen’S Association, Inc

Parquet Data Format Performance

The Deep Learning Solutions on Lossless Compression Methods for Alleviating Data Load on Iot Nodes in Smart Cities

Forcepoint DLP Supported File Formats and Size Limits

Pdfnews 10/04, Page 2 –