Lossless Data Compression Christian Steinruecken King’S College
Total Page:16
File Type:pdf, Size:1020Kb
Lossless Data Compression Christian Steinruecken King’s College Dissertation submitted in candidature for the degree of Doctor of Philosophy, University of Cambridge August 2014 2014 © Christian Steinruecken, University of Cambridge. Re-typeset on March 2, 2018, using LATEX, gnuplot, and custom-written software. Lossless Data Compression Christian Steinruecken Abstract This thesis makes several contributions to the field of data compression. Lossless data com- pression algorithms shorten the description of input objects, such as sequences of text, in a way that allows perfect recovery of the original object. Such algorithms exploit the fact that input objects are not uniformly distributed: by allocating shorter descriptions to more probable objects and longer descriptions to less probable objects, the expected length of the compressed output can be made shorter than the object’s original description. Compression algorithms can be designed to match almost any given probability distribution over input objects. This thesis employs probabilistic modelling, Bayesian inference, and arithmetic coding to derive compression algorithms for a variety of applications, making the underlying probability distributions explicit throughout. A general compression toolbox is described, consisting of practical algorithms for compressing data distributed by various fundamental probability distributions, and mechanisms for combining these algorithms in a principled way. Building on the compression toolbox, new mathematical theory is introduced for compressing objects with an underlying combinatorial structure, such as permutations, combinations, and multisets. An example application is given that compresses unordered collections of strings, even if the strings in the collection are individually incompressible. For text compression, a novel unifying construction is developed for a family of context- sensitive compression algorithms. Special cases of this family include the PPM algorithm and the Sequence Memoizer, an unbounded depth hierarchical Pitman–Yor process model. It is shown how these algorithms are related, what their probabilistic models are, and how they produce fundamentally similar results. The work concludes with experimental results, example applications, and a brief discussion on cost-sensitive compression and adversarial sequences. Declaration I hereby declare that my dissertation entitled “Lossless Data Compression” is not substantially the same as any that I have submitted for a degree or diploma or other qualification at any other University. I further state that no part of my dissertation has already been or is being concurrently submitted for any such degree or diploma or other qualification. Except where explicit reference is made to the work of others, this dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration. This dissertation does not exceed sixty thousand words in length. Date: ........................... Signed: ............ ............................. Christian Steinruecken King’s College Contents Notation 11 1 Introduction 15 1.1 Codesandcommunication . .. 15 1.2 Datacompression................................. 16 1.3 Outlineofthisthesis ............................. ... 18 2 Classical compression algorithms 21 2.1 Symbolcodes .................................... 22 2.1.1 Huffmancoding............................... 23 2.2 Integercodes .................................... 24 2.2.1 Unarycode ................................. 25 2.2.2 Eliasgammacode.............................. 26 2.2.3 ExponentialGolombcodes . 27 2.2.4 Otherintegercodes............................. 28 2.3 Dictionarycoding................................ .. 29 2.3.1 LZW..................................... 29 2.3.2 Historyandsignificance . 30 2.3.3 Properties .................................. 32 2.4 Transformations................................. .. 32 2.4.1 Move-to-frontencoding. .. 32 2.4.2 Blockcompression ............................. 33 2.5 Summary ...................................... 34 3 Arithmetic coding 37 3.1 Introduction.................................... 37 3.1.1 Information content and information entropy . ........ 37 3.1.2 The relationship of codes and distributions . ........ 38 3.1.3 Algorithms for building optimal compression codes . ......... 39 3.2 Howanarithmeticcoderoperates . ..... 40 3.2.1 Mapping a sequence of symbols to nested regions . ...... 40 3.2.2 Mapping a region to a sequence of symbols . .... 42 7 8 CONTENTS 3.2.3 Encoding a region as a sequence of binary digits . ....... 42 3.3 Concrete implementation of arithmetic coding . .......... 43 3.3.1 Historicalnotes ............................... 49 3.3.2 Other generic compression algorithms . ...... 49 3.4 Arithmetic coding for various distributions . ........... 50 3.4.1 Bernoullicode................................ 50 3.4.2 Discreteuniformcode . .. .. .. 51 3.4.3 Finite discrete distributions . ..... 51 3.4.4 Binomialcode................................ 52 3.4.5 Beta-binomialcode . .. .. .. 54 3.4.6 Multinomialcode .............................. 55 3.4.7 Dirichlet-multinomial code . .... 56 3.4.8 Codes for infinite discrete distributions . ........ 57 3.5 Combiningdistributions . .... 58 3.5.1 Sequentialcoding .............................. 58 3.5.2 Exclusioncoding .............................. 59 3.5.3 Mixturemodels ............................... 60 4 Adaptive compression 61 4.1 Learningadistribution . .... 61 4.1.1 Introduction................................. 61 4.1.2 Motivation.................................. 62 4.2 Histogrammethods................................ 62 4.2.1 A simple exchangeable method for adaptive compression ........ 62 4.2.2 Dirichletprocesses . .. .. 65 4.2.3 Power-lawvariants . .. .. .. 65 4.2.4 Non-exchangeable histogram learners . ...... 66 4.3 Otheradaptivemodels ............................. .. 67 4.3.1 AP´olyatreesymbolcompressor . ... 67 4.4 Online versus header-payload compression . ......... 71 4.4.1 Onlinecompression . .. .. .. 71 4.4.2 Header-payload compression . ... 73 4.4.3 Whichmethodisbetter?. 75 4.5 Summary ...................................... 76 5 Compressing structured objects 79 5.1 Introduction.................................... 79 5.2 Permutations .................................... 82 5.2.1 Completepermutations. 82 5.2.2 Truncatedpermutations . 84 CONTENTS 9 5.2.3 Set permutations via elimination sampling . ....... 84 5.3 Combinations .................................... 85 5.4 Compositions .................................... 86 5.4.1 Strictly positive compositions with unbounded components. .. 87 5.4.2 Compositions with fixed number of components . ..... 87 5.4.3 Uniform K-compositions .......................... 88 5.4.4 Multinomial K-compositions........................ 88 5.5 Multisets....................................... 89 5.5.1 Multisets via repeated draws from a known distribution......... 90 5.5.2 Multisets drawn from a Poisson process . ..... 91 5.5.3 Multisets from unknown discrete distributions . ......... 92 5.5.4 Submultisets................................. 93 5.5.5 Multisets from a Blackwell–MacQueen urn scheme . ....... 93 5.6 Orderedpartitions ............................... .. 95 5.7 Sequences ...................................... 96 5.7.1 Sequences of known distribution, known length . ....... 97 5.7.2 Sequences of known distribution, uncertain length . .......... 98 5.7.3 Sequences of uncertain distribution . ...... 99 5.7.4 Sequences with known ordering . .. 99 5.8 Sets .........................................100 5.8.1 Uniformsets................................. 100 5.8.2 Bernoullisets ................................ 100 5.9 Summary ......................................100 6 Context-sensitive sequence compression 103 6.1 Introductiontocontextmodels . .. 104 6.1.1 Purebigramandtrigrammodels . 104 6.1.2 Hierarchical context models . 106 6.2 Generalconstruction . .. .. 107 6.2.1 Engines ...................................108 6.2.2 Probabilisticmodels . 108 6.2.3 Learning...................................110 6.3 ThePPMalgorithm ................................111 6.3.1 Basicoperation ............................... 111 6.3.2 Probabilityestimation . 113 6.3.3 Learningmechanism ............................ 115 6.3.4 PPMescapemechanisms. 115 6.3.5 OthervariantsofPPM . .. .. .. 116 6.4 OptimisingPPM ..................................117 6.4.1 Theeffectsofcontextdepth . 117 10 CONTENTS 6.4.2 Generalisation of the PPM escape mechanism . .. 117 6.4.3 The probabilistic model of PPM, stated concisely . ........122 6.4.4 Why the escape mechanism is convenient . 122 6.5 Blending.......................................123 6.5.1 Optimising the parameters of a blending PPM . .. 124 6.5.2 Backing-off versus blending . 124 6.6 Unboundeddepth.................................. 128 6.6.1 History....................................128 6.6.2 The Sequence Memoizer . 132 6.6.3 Hierarchical Pitman–Yor processes . ...... 132 6.6.4 Deplump...................................133 6.6.5 What makes Deplump compress better than PPM? . 134 6.6.6 Optimising depth-dependent parameters . ...... 136 6.6.7 Applications of PPM-like algorithms . .. 139 6.7 Conclusions ..................................... 139 7 Multisets of sequences 143 7.1 Introduction.................................... 143 7.2 Collections of fixed-length binary sequences . ..........144 7.2.1 Tree representation for multisets of fixed-length strings ......... 145 7.2.2 Fixed-depth multiset tree compression