The Context Tree Weighting Method : Basic Properties

The Context Tree Weighting Method : Basic Properties Frans M.J. Willems, Yuri M. Shtarkov∗and Tjalling J. Tjalkens Eindhoven University of Technology, Electrical Engineering Department, P.O. Box 513, 5600 MB Eindhoven, The Netherlands Abstract We describe a new sequential procedure for universal data compression for FSMX sources. Using a context tree, this method recursively weights the coding distribu- tions corresponding to all FSMX models, and realizes an efficient coding distribution for unknown models. Computational and storage complexity of the proposed procedure are linear in the sequence length. We give upper bounds on the cumulative redundancies, but also on the codeword length for individual sequences. With a slight modification, the procedure can be used to determine the FSMX model that gives the largest contribution to the coding probability, i.e. the minimum description length model. This model estimator is shown to converge with probability one to the actual model, for ergodic sources that are not degenerated. A suboptimal method is presented that finds a model with a description length that is not more than a prescribed number of bits larger than the minimum description length. This method makes it possible to choose a desired trade-off between performance and complexity. In the last part of the manuscript, lower bounds on the maximal individual redundancies, that can be achieved by arbitrary coding procedures, are found. It follows from these bounds, that the context tree weighting procedure is almost optimal. 1 Introduction, old and new concepts Sequential universal source coding procedures for FSMX sources often make use of a context tree which contains for each string (context) the number of zeros and the number of ones that have followed this context, in the source sequence seen so far. The standard approach (see e.g. Rissanen and Langdon [19], Rissanen [16], [17], and Weinberger, Lempel and Ziv [23]) is that, given all past source symbols, one uses this context tree to estimate the actual `state' of the FSMX source. Subsequently this state (context) is used to estimate the distribution that generates the next source symbol. This estimated distribution can be used by arithmetic coding procedures (see e.g. Rissanen and Langdon [19]) to encode (and ∗On leave from the Institute for Problems of Information Transmission, Ermolovoystr. 19, 101447, Moscow, GSP-4. 1 decode) the next source symbol efficiently, i.e. with low complexity and with negligible redundancy. It was shown by Rissanen [16] that the probability of choosing the correct context, i.e. the actual state of the FSMX source, tends to 1 if the length T of the observed sequence tends to infinity. More recently Weinberger, Lempel and Ziv [23] designed a procedure that achieves optimal exponential decay of the error probability in estimating the current state. These authors were also able to demonstrate that their coding procedure achieves asymptotically the lower bound on the average redundancy, as stated by Rissanen ([17], thm. 1). An unpleasant fact about the standard approach is that one has to specify parameters (α and β in Rissanen's procedure [16] or K for the Weinberger, Lempel and Ziv [23] method), that do not affect the asymptotical behavior of the procedure, but may have a big influence on the performance for practical sequence lengths. These artificial parameters are necessary to regulate the state estimation characteristics. This gave the authors the idea that the state estimation concept may not be as natural as we believe. A better starting principle would be, to just find a good coding distribution. This, more or less trivial, guideline immediately suggests the application of model weighting techniques. In what follows we will develop a tree-recursive model-weighting procedure, which is very easy to analyze, and has a very good performance, both in realized compression and in complexity. 2 Binary FSMX sources 2.1 Strings Strings are concatenations of binary symbols, hence s = b1−lb2−l ··· b0 with b−i 2 f0; 1g for i = 0; 1; ··· ; l − 1. Note that we index the symbols in the string from right to left, starting with 0 and going negative. For the length of a string s we write l(s). A semi-infinite 0 0 0 0 string s = ··· b−1b0 has length l(s) = 1. If we have two strings s = b1−l0 b2−l0 ··· b0 and 0 0 0 0 s = b1−lb2−l ··· b0 then s s = b1−l0 b2−l0 ··· b0; b1−lb2−l ··· b0 is the concatenation of both. If S is set of strings, then S × b =∆ fsb : s 2 Sg, for b 2 f0; 1g. 0 0 0 0 0 We say that a string s = b1−lb2−l ··· b0 is a postfix of string s = b1−l0 b2−l0 ··· b0 if l ≤ l 0 and b−i = b−i for i = 0; l − 1. The empty string λ has length l(λ) = 0. It is a postfix of all other strings. 2.2 FSMX source definition 1 A binary FSMX information source generates a sequence x−∞ of digits assuming values in b the alphabet f0; 1g. We denote by xa the sequence xaxa+1 ··· xb, and allow a and b to be infinitely large. The statistical behavior of an FSMX source can be described by means of a postfix set S. This postfix set is a collection of binary strings s(k), with k = 1; 2; ··· ; jSj. It should be proper and complete. Properness of the postfix set implies that no string in S 2 is a postfix of any other string in S. Completeness guarantees that each semi-infinite string ··· b−2b−1b0 has a postfix that belongs to S. The properness and completeness of the postfix set make it possible to define the postfix function σ(·). This function maps semi-infinite sequences onto their unique postfix s in S. To each postfix s in S there corresponds a parameter θs. Together, all parameters form the ∆ parameter vector ΘS = (θs(1); θs(2); ··· ; θs(jSj)). Each parameter assumes a value in [0; 1] and specifies a distribution over f0; 1g. If the FSMX source has emitted the semi-infinite t−1 ∆ sequence x−∞ = ··· xt−2xt−1 up to now, the postfix function tells us that the parameter t−1 for generating the next binary digit of the source is θs, where s = σ(x−∞), hence Definition 1 : The actual next-symbol probabilities for an FSMX source with postfix set S and parameter vector ΘS are t−1 t−1 ∆ Pa(Xt = 1jx ; S; ΘS ) = 1 − Pa(Xt = 0jx ; S; ΘS ) = θ t−1 ; for all t; (1) −∞ −∞ σ(x−∞) t−1 where x−∞ is the semi-infinite sequence of symbols preceeding xt. 4 All FSMX sources with the same postfix set are said to have the same model. Model and postfix set are equivalent. The model S together with the parameter vector ΘS specifies the FSMX source, sometimes denoted by (S; ΘS ). We will see later that the number of parameters of the FSMX source determines the (asymptotic) redundancy of our coding procedure. The FSMX model class is the set of all FSMX models. It is possible to specify FSMX models in this model class with a natural code. This code uses 2jSj − 1 binary digits to describe the model S. The method that achieves this is simple. We encode the postfix set S recursively. The code of S is the code of the empty string λ. The code of a string s is 0 if this string is in S. If s is not in S then its code is a 1 followed by the codes of the strings 0s and 1s. Observe now that there are 2jSj − 1 nodes that contribute one bit each to the specification code. FSMX sources were first described by Rissanen [17]. These sources form a subclass of the set of sources defined by finite state machines (`FSM's) and an extension (`X') of the class of finite order Markov sources. Example 1 : Consider a source with postfix set S = fs(1) = 00; s(2) = 10; s(3) = 1g and parameters θ00 = 0:5, θ10 = 0:3; and θ1 = 0:1 (see figure 1). The (conditional) probability of the source generating the sequence 0110100 given all past symbols forming the sequence ··· 010 can be calculated as follows : Pa(0110100j · · · 010) = Pa(0j · · · 010)Pa(1j · · · 0100)Pa(1j · · · 01001)Pa(0j · · · 010011) ·Pa(1j · · · 0100110)Pa(0j · · · 01001101)Pa(0j · · · 010011010) = (1 − θ10)θ00θ1(1 − θ1)θ10(1 − θ1)(1 − θ10) = 0:7 · 0:5 · 0:1 · 0:9 · 0:3 · 0:9 · 0:7 = 0:0059535: (2) The model (postfix set) S can be specified by code(S) = code(λ) = 1 code(0) code(1) 3 θ1 = 0:1 QQ Q 1 " Q Q θ10 = 0:3 Q λ Q Q Q Q 0 # QQ θ00 = 0:5 Figure 1: Model (postfix set) and parameters. = 1 1 code(00) code(10) 0 = 1 1 0 0 0; (3) whose length is 2jSj − 1 = 5 binary digits, as we expected. 3 Codes and redundancy T ∆ L ∆ Instead of the source sequence x1 = x1x2 ··· xT itself, the encoder sends a codeword c = T c1c2 ··· cL to the decoder. The decoder must be able to reconstruct the source sequence x1 from this codeword (noiseless source coding). We assume that both the encoder and the decoder have access to the past source 0 ∆ symbols x−∞ = ··· x−1x0. The codeword that is formed by the encoder depends not only T 0 on the source sequence x1 but also on this sequence of past symbols x−∞. To denote L T 0 this functional relationship we write c (x1 jx−∞): The length of the codeword is denote as T 0 L(x1 jx−∞).

The Context Tree Weighting Method : Basic Properties

The Deep Learning Solutions on Lossless Compression Methods for Alleviating Data Load on Iot Nodes in Smart Cities

Answers to Exercises

Superior Guarantees for Sequential Prediction and Lossless Compression Via Alphabet Decomposition

Improved Adaptive Arithmetic Coding in MPEG-4 AVC/H.264 Video Compression Standard

Applications of Information Theory to Machine Learning Jeremy Bensadon

Skip Context Tree Switching

Bi-Level Image Compression with Tree Coding

Appendix a Information Theory

Universal Loseless Compression: Context Tree Weighting(CTW)

Improved Context-Based Adaptive Binary Arithmetic Coding in MPEG-4 AVC/H.264 Video Codec

A Novel Encoding Algorithm for Textual Data Compression

Estimating the Entropy of Binary Time Series: Methodology, Some Theory and a Simulation Study