The Context Tree Weighting Method : Basic Properties

Frans M.J. Willems, Yuri M. Shtarkov∗and Tjalling J. Tjalkens

Eindhoven University of Technology, Electrical Engineering Department, P.O. Box 513, 5600 MB Eindhoven, The Netherlands

Abstract We describe a new sequential procedure for universal for FSMX sources. Using a context tree, this method recursively weights the coding distribu- tions corresponding to all FSMX models, and realizes an efficient coding distribution for unknown models. Computational and storage complexity of the proposed pro- cedure are linear in the sequence length. We give upper bounds on the cumulative redundancies, but also on the codeword length for individual sequences. With a slight modification, the procedure can be used to determine the FSMX model that gives the largest contribution to the coding probability, i.e. the minimum description length model. This model estimator is shown to converge with probability one to the actual model, for ergodic sources that are not degenerated. A suboptimal method is presented that finds a model with a description length that is not more than a prescribed number of bits larger than the minimum description length. This method makes it possible to choose a desired trade-off between performance and complexity. In the last part of the manuscript, lower bounds on the maximal individual redun- dancies, that can be achieved by arbitrary coding procedures, are found. It follows from these bounds, that the context tree weighting procedure is almost optimal.

1 Introduction, old and new concepts

Sequential universal source coding procedures for FSMX sources often make use of a context tree which contains for each string (context) the number of zeros and the number of ones that have followed this context, in the source sequence seen so far. The standard approach (see e.g. Rissanen and Langdon [19], Rissanen [16], [17], and Weinberger, Lempel and Ziv [23]) is that, given all past source symbols, one uses this context tree to estimate the actual ‘state’ of the FSMX source. Subsequently this state (context) is used to estimate the distribution that generates the next source symbol. This estimated distribution can be used by procedures (see e.g. Rissanen and Langdon [19]) to encode (and

∗On leave from the Institute for Problems of Information Transmission, Ermolovoystr. 19, 101447, Moscow, GSP-4.

1 decode) the next source symbol efficiently, i.e. with low complexity and with negligible redundancy. It was shown by Rissanen [16] that the probability of choosing the correct context, i.e. the actual state of the FSMX source, tends to 1 if the length T of the observed sequence tends to infinity. More recently Weinberger, Lempel and Ziv [23] designed a procedure that achieves optimal exponential decay of the error probability in estimating the current state. These authors were also able to demonstrate that their coding procedure achieves asymptotically the lower bound on the average redundancy, as stated by Rissanen ([17], thm. 1). An unpleasant fact about the standard approach is that one has to specify parameters (α and β in Rissanen’s procedure [16] or K for the Weinberger, Lempel and Ziv [23] method), that do not affect the asymptotical behavior of the procedure, but may have a big influence on the performance for practical sequence lengths. These artificial parameters are necessary to regulate the state estimation characteristics. This gave the authors the idea that the state estimation concept may not be as natural as we believe. A better starting principle would be, to just find a good coding distribution. This, more or less trivial, guideline immediately suggests the application of model weighting techniques. In what follows we will develop a tree-recursive model-weighting procedure, which is very easy to analyze, and has a very good performance, both in realized compression and in complexity.

2 Binary FSMX sources

2.1 Strings

Strings are concatenations of binary symbols, hence s = b1−lb2−l ··· b0 with b−i ∈ {0, 1} for i = 0, 1, ··· , l − 1. Note that we index the symbols in the string from right to left, starting with 0 and going negative. For the length of a string s we write l(s). A semi-infinite 0 0 0 0 string s = ··· b−1b0 has length l(s) = ∞. If we have two strings s = b1−l0 b2−l0 ··· b0 and 0 0 0 0 s = b1−lb2−l ··· b0 then s s = b1−l0 b2−l0 ··· b0, b1−lb2−l ··· b0 is the concatenation of both. If S is set of strings, then S × b =∆ {sb : s ∈ S}, for b ∈ {0, 1}. 0 0 0 0 0 We say that a string s = b1−lb2−l ··· b0 is a postfix of string s = b1−l0 b2−l0 ··· b0 if l ≤ l 0 and b−i = b−i for i = 0, l − 1. The empty string λ has length l(λ) = 0. It is a postfix of all other strings.

2.2 FSMX source definition

∞ A binary FSMX information source generates a sequence x−∞ of digits assuming values in b the alphabet {0, 1}. We denote by xa the sequence xaxa+1 ··· xb, and allow a and b to be infinitely large. The statistical behavior of an FSMX source can be described by means of a postfix set S. This postfix set is a collection of binary strings s(k), with k = 1, 2, ··· , |S|. It should be proper and complete. Properness of the postfix set implies that no string in S

2 is a postfix of any other string in S. Completeness guarantees that each semi-infinite string ··· b−2b−1b0 has a postfix that belongs to S. The properness and completeness of the postfix set make it possible to define the postfix function σ(·). This function maps semi-infinite sequences onto their unique postfix s in S. To each postfix s in S there corresponds a parameter θs. Together, all parameters form the ∆ parameter vector ΘS = (θs(1), θs(2), ··· , θs(|S|)). Each parameter assumes a value in [0, 1] and specifies a distribution over {0, 1}. If the FSMX source has emitted the semi-infinite t−1 ∆ sequence x−∞ = ··· xt−2xt−1 up to now, the postfix function tells us that the parameter t−1 for generating the next binary digit of the source is θs, where s = σ(x−∞), hence Definition 1 : The actual next-symbol probabilities for an FSMX source with postfix set S and parameter vector ΘS are

t−1 t−1 ∆ Pa(Xt = 1|x , S, ΘS ) = 1 − Pa(Xt = 0|x , S, ΘS ) = θ t−1 , for all t, (1) −∞ −∞ σ(x−∞)

t−1 where x−∞ is the semi-infinite sequence of symbols preceeding xt. 4 All FSMX sources with the same postfix set are said to have the same model. Model and postfix set are equivalent. The model S together with the parameter vector ΘS specifies the FSMX source, sometimes denoted by (S, ΘS ). We will see later that the number of parameters of the FSMX source determines the (asymptotic) redundancy of our coding procedure. The FSMX model class is the set of all FSMX models. It is possible to specify FSMX models in this model class with a natural code. This code uses 2|S| − 1 binary digits to describe the model S. The method that achieves this is simple. We encode the postfix set S recursively. The code of S is the code of the empty string λ. The code of a string s is 0 if this string is in S. If s is not in S then its code is a 1 followed by the codes of the strings 0s and 1s. Observe now that there are 2|S| − 1 nodes that contribute one bit each to the specification code. FSMX sources were first described by Rissanen [17]. These sources form a subclass of the set of sources defined by finite state machines (‘FSM’s) and an extension (‘X’) of the class of finite order Markov sources. Example 1 : Consider a source with postfix set S = {s(1) = 00, s(2) = 10, s(3) = 1} and parameters θ00 = 0.5, θ10 = 0.3, and θ1 = 0.1 (see figure 1). The (conditional) probability of the source generating the sequence 0110100 given all past symbols forming the sequence ··· 010 can be calculated as follows :

Pa(0110100| · · · 010) = Pa(0| · · · 010)Pa(1| · · · 0100)Pa(1| · · · 01001)Pa(0| · · · 010011)

·Pa(1| · · · 0100110)Pa(0| · · · 01001101)Pa(0| · · · 010011010)

= (1 − θ10)θ00θ1(1 − θ1)θ10(1 − θ1)(1 − θ10) = 0.7 · 0.5 · 0.1 · 0.9 · 0.3 · 0.9 · 0.7 = 0.0059535. (2)

The model (postfix set) S can be specified by

code(S) = code(λ) = 1 code(0) code(1)

3 θ1 = 0.1 QQ Q 1 ↑ Q Q θ10 = 0.3 Q λ Q  Q  Q  Q  0 ↓ QQ     θ00 = 0.5  Figure 1: Model (postfix set) and parameters.

= 1 1 code(00) code(10) 0 = 1 1 0 0 0, (3)

whose length is 2|S| − 1 = 5 binary digits, as we expected.

3 Codes and redundancy

T ∆ L ∆ Instead of the source sequence x1 = x1x2 ··· xT itself, the encoder sends a codeword c = T c1c2 ··· cL to the decoder. The decoder must be able to reconstruct the source sequence x1 from this codeword (noiseless source coding). We assume that both the encoder and the decoder have access to the past source 0 ∆ symbols x−∞ = ··· x−1x0. The codeword that is formed by the encoder depends not only T 0 on the source sequence x1 but also on this sequence of past symbols x−∞. To denote L T 0 this functional relationship we write c (x1 |x−∞). The length of the codeword is denote as T 0 L(x1 |x−∞). 0 The set of codewords that can be produced for a fixed sequence of past symbols x−∞, is assumed to be proper, i.e. this set of codewords satisfies the prefix condition. A codeword set satisfies the prefix condition if no codeword is the prefix of any other codeword in the set. All sequences c1c2 ··· cl, for some l = 1, 2, ··· ,L are a prefix of c1c2 ··· cL. The codeword lengths determine the individual redundancies. T 0 T Definition 2 : The individual redundancy ρ(x1 |x−∞, S, ΘS ) of a sequence x1 given 0 a sequence x−∞ of past symbols, with respect to a source with model S and parameter 1 vector ΘS , is defined as

T 0 ∆ T 0 1 ρ(x1 |x−∞, S, ΘS ) = L(x1 |x−∞) − log T 0 , (4) Pa(x1 |x−∞, S, ΘS )

T 0 T where L(x1 |x−∞) is the length of the codeword that corresponds to x1 . We consider only T T 0 sequences x1 with positive probability Pa(x1 |x−∞, S, ΘS ). 4

1The basis of the log(·) is assumed to be 2, throughout this manuscript.

4 T 0 The value log(1/Pa(x1 |x−∞, S, ΘS )) can be regarded as the information contained in T 0 x1 given the past x−∞. Note that we do not divide the redundancies by the block length T , we consider only cumulative redundancies. The objective in source coding is to design methods that achieve small (individual) redundancies. Since it is also very important that these methods have low (storage and computational) complexity, it would be more appropriate to say that the emphasis in source coding is on finding a desirable trade-off between redundancy and complexity. Therefore we analyze and compare methods both from the point of view of redundancy and complexity. Example 2 : Consider the FSMX source of the previous example. For each past sequence 0 x−∞ we could use the Huffman’s construction ([9], or [4]) to find the set of codewords that P T 0 T 0 achieves the smallest possible expected redundancy T Pa(x |x , S, ΘS )ρ(x |x , S, ΘS ). x1 1 −∞ 1 −∞ 0 For past sequence x−∞ = ··· 010 we listed in the table below the probabilities, codeword T lengths and individual redundancies for all sequences x1 , for T = 3.

T T 0 L T 0 T 0 x1 Pa(x1 |x−∞, S, ΘS ) log(1/Pa) c (x1 |x−∞) L ρ(x1 |x−∞, S, ΘS ) 000 0.175 2.515 11 2 -0.515 001 0.175 2.515 000 3 0.485 010 0.315 1.667 01 2 0.333 011 0.035 4.837 00110 5 0.163 100 0.189 2.404 10 2 -0.404 101 0.081 3.626 0010 4 0.374 110 0.027 5.211 001110 6 0.789 111 0.003 8.381 001111 6 -2.381

The expected redundancy computed from this table turns out to be 0.074 bits.

4 Arithmetic coding

4.1 Introduction to the Elias algorithm An arithmetic encoder computes the codeword that corresponds to the actual source se- quence. The corresponding decoder reconstructs the actual source sequence from this codeword again performing simple computations only. This is different from encoders and decoders that use tables, which need to be constructed first and stored in memory. Arith- metic coding does not require construction and storage of the complete code before starting of the encoding and decoding process. On the other hand we see that arithmetic coding requires some computational effort during encoding and decoding. Using arithmetic codes it is possible to process source sequences with a large length. Large values of the length T are often needed to reduce the redundancy per source symbol. All arithmetic codes are based on the Elias algorithm, which was unpublished but described by Abramson [1] and Jelinek [10]. T The first idea behind the Elias algorithm is that to each source sequence x1 there corresponds a subinterval of [0, 1). This principle can be traced back to Shannon [20]. The

5 t second idea, due to Elias, is to order the sequences x1 of length t lexicographically, for t t t t t, t = 1, ··· ,T . For two sequences x1 andx ˜1 we have that x1 < x˜1 if and only if there exists a τ ∈ {1, 2, ··· , t} such that xi =x ˜i for i = 1, 2, ··· , τ −1 and xτ < x˜τ . This lexicographical ordering makes it possible to compute the subinterval corresponding to a source sequence sequentially (recursively).

4.2 Representation of sequences by intervals

t t t Suppose distribution Pc(x1), x1 ∈ {0, 1} , t = 0, 1, ··· ,T is what we shall call the coding 0 distribution, known to both encoder and decoder. The empty sequence (x1) will be denoted by φ. We assume (require) that

Pc(φ) = 1, t−1 t−1 t−1 t−1 t−1 Pc(x1 ) = Pc(x1 ,Xt = 0) + Pc(x1 ,Xt = 1), for all x1 ∈ {0, 1} , t = 1, ··· ,T, T T T and Pc(x1 ) > 0 for all possible x1 ∈ {0, 1} . (5) T T Possible sequences are sequences that can actually occur, i.e. sequences x1 with Pa(x1 ) > 0. To each source sequence there now corresponds a subinterval of [0, 1). t t t Definition 3 : The interval I(x1) corresponding to x1 ∈ {0, 1} , t = 0, 1, ··· ,T is defined as t ∆ h t t t  I(x1) = B(x1),B(x1) + Pc(x1) (6) t ∆ P t where B(x ) = t t Pc(˜x ). 4 1 x˜1

Note that for t = 0 we have that Pc(φ) = 1 and B(φ) = 0 (the only sequence of length 0 is φ itself), and consequently I(φ) = [0, 1). Observe that for any fixed value of t t, t = 0, 1, ··· ,T , all intervals I(x1) are disjoint, and their union is [0, 1). Each interval has a length equal to the corresponding coding probability. L Just like all source sequences, each codeword c = c1c2 ··· cL can be associated with a subinterval of [0, 1). Definition 4 : The interval J(cL) corresponding to the codeword cL is defined as h  J(cL) =∆ F (cL),F (cL) + 2−L , (7)

L ∆ P −l with F (c ) = l=1,L cl2 . 4 To understand this, note that cL can be considered as a binary fraction F (cL). Since cL is followed by other codewords, the decoder receives a stream of code digits from which only the first L digits correspond to cL. The decoder can determine the value that is represented by the binary fraction formed by the total stream c1c2 ··· cLcL+1 ···, i.e. ∆ X −l F∞ = cl2 , (8) l=1,∞ where it should be noted that the length of the total stream is not necessarily infinite. L L −L L Since F (c ) ≤ F∞ < F (c ) + 2 we may say that the interval J(c ) corresponds to the codeword cL.

6 4.3 Compression

T L T To compress a sequence x1 , we search for a (short) codeword c (x1 ) whose code interval L T J(c ) is contained in the sequence interval I(x1 ). This is the main principle in arithmetic coding.

L T T T ∆ Definition 5 : The codeword c (x1 ) for source sequence x1 consists of L(x1 ) = T dlog(1/Pc(x1 ))e + 1 binary digits such that

∆ T T L T T L(x1 ) −L(x1 ) F (c (x1 )) = dB(x1 ) · 2 e · 2 , (9)

T t where dae is the smallest integer ≥ a We consider only sequences x1 with Pc(x1) > 0. 4 Since L T T F (c (x1 )) ≥ B(x1 ) (10) and

T T T L T −L(x1 ) T −L(x1 ) −L(x1 ) F (c (x1 )) + 2 < B(x1 ) + 2 + 2 T T ≤ B(x1 ) + Pc(x1 ). (11)

L T T T we may conclude that J(c (x1 )) ⊆ I(x1 ), and therefore F∞ ∈ I(x1 ). Since all intervals T T I(x1 ) are disjoint the decoder can reconstruct the source sequence x1 from F∞. Note that L T after this reconstruction, the decoder can compute c (x1 ), just like the encoder, and find the location of the first digit of the next codeword. Note also that, since all code intervals are disjoint, no codeword is the prefix of any other codeword. Thus the code satisfies the prefix condition. From (9) we obtain the following Theorem 1 : The arithmetic code we have just described, achieves codeword lengths T L(x1 ) that satisfy T 1 L(x1 ) < log T + 2, (12) Pc(x1 ) T T for all possible x1 ∈ {0, 1} . T T The difference between the codeword length L(x1 ) and log(1/Pc(x1 )) is always less than 2 bits. We say that the individual coding redundancy is less than 2 bits.

4.4 Sequential computation The lexicographical ordering over the source sequences makes it possible to compute the T interval I(x1 ) sequentially. To do so, we transform the starting interval I(φ) = [0, 1) into I(x1),I(x1x2), ···, and I(x1x2 ··· xT ) respectively. The consequence of the lexicographical ordering over the source sequences is that

t X t B(x1) = Pc(x1) t t x˜1

7 X t−1 X t−1 = Pc(˜x1 ) + Pc(x1 , x˜t) t−1 t−1 x˜t

t t−1 t−1 In other words B(x1) can be computed from B(x1 ) and Pc(x1 , xt = 0). Therefore t t−1 the encoder and the decoder can easily find I(x1) after having determined I(x1 ), if it t−1 t−1 is ‘easy’ to determine probabilities Pc(x1 ,Xt = 0) and Pc(x1 ,Xt = 1) after having processed x1x2 ··· xt−1. This is often the case, also in the context tree weighting method as we shall see later. t−1 Observe that when the symbol xt is being processed, the source interval I(x1 ) = t−1 t−1 t−1 [B(x1 ),B(x1 ) + Pc(x1 ) is subdivided into two subintervals

t−1 h t−1 t−1 t−1  I(x1 ,Xt = 0) = B(x1 ),B(x1 ) + Pc(x1 ,Xt = 0) and t−1 h t−1 t−1 t−1 t−1  I(x1 ,Xt = 1) = B(x1 ) + Pc(x1 ,Xt = 0),B(x1 ) + Pc(x1 ) . (14)

The encoder proceeds with one of these subintervals depending on the symbol xt, therefore t t−1 I(x1) ⊆ I(x1 ). This implies that

T T −1 I(x1 ) ⊆ I(x1 ) ⊆ · · · ⊆ I(φ). (15)

The decoder determines from F∞ the source symbols x1, x2, ··· , xT respectively by t−1 comparing F∞ with thresholds D(x1 ). t−1 Definition 6 : The thresholds D(x1 ) are defined as

t−1 ∆ t−1 t−1 D(x1 ) = B(x1 ) + Pc(x1 ,Xt = 0), (16) for t = 1, 2, ··· ,T. 4 t−1 t−1 Observe that threshold D(x1 ) splits up the interval I(x1 ) in two parts (see (14). t−1 It is the upper boundary point of I(x1 ,Xt = 0) but also the lower boundary point of t−1 T t t−1 I(x1 ,Xt = 1). Since always F∞ ∈ I(x1 ) ⊆ I(x1), we have for D(x1 ) that

t t F∞ < B(x1) + Pc(x1) t−1 t−1 = B(x1 ) + Pc(x1 ,Xt = 0) t−1 = D(x1 ) if xt = 0, (17) and

t F∞ ≥ B(x1) t−1 t−1 = B(x1 ) + Pc(x1 ,Xt = 0} t−1 = D(x1 ) if xt = 1. (18)

t−1 Therefore the decoder can easily find xt by comparing F∞ with the threshold D(x1 ), in other words it can operate sequentially.

8 1

0.7 44/64

0.665 43/64

0.35

0

Figure 2: Encoding and decoding of 011 given past symbols ··· 0100.

We conclude this section with the observation that the Elias algorithm combines an acceptable coding redundancy with a desirable sequential implementation. The number of operations is linear in the source sequence length T . It is crucial however that the encoder t−1 and decoder have access to, or can easily determine the probabilities Pc(x1 ,Xt = 0) and t−1 Pc(x1 ,Xt = 1) after having processed x1x2 ··· xt−1. If we accept a loss of at most 2 bits coding redundancy, we are now left with the problem of finding good, sequentially T available, coding distributions Pc(x1 ). Example 3 : (See figure 2.) Assuming that the source is as in figure 1, we can use the T T T actual source distribution Pa(x1 | · · · 010) as coding distribution, i.e. Pc(x1 ) = Pa(x1 | · · · 010) for T T all x1 . In that case an arithmetic encoder computes the codeword for x1 = 011, (hence T = 3), as follows. From B(φ) = 0 the encoder first computes B(0) = B(φ) = 0 and Pc(0) = 0.7. Then B(01) = B(0) + Pc(00) = 0 + Pc(0) · 0.5 = 0.7 · 0.5 = 0.35 and Pc(01) = Pc(0) · 0.5 = 0.7 · 0.5 = 0.35. Finally B(011) = B(01) + Pc(010) = 0.35 + Pc(01) · 0.9 = 0.35 + 0.35 · 0.9 = 0.665 and Pc(011) = Pc(01) · 0.1 = 0.35 · 0.1 = 0.035. Now the codeword can be determined. The length L = dlog(1/0.035)e + 1 = 6 and therefore cL is the binary expansion of d0.665 · 64e/64 = 43/64, which is 101011. To see how the decoder operates note that 43/64 ≤ F∞ < 44/64. First the decoder compares F∞ with D(φ) = B(φ) + Pc(0) = 0 + 0.7 = 0.7. Clearly F∞ < 44/64 < 0.7 and therefore the decoder finds x1 = 0. Just like the encoder the decoder computes B(0) and Pc(0). The next threshold is D(0) = B(0) + Pc(00) = 0 + Pc(0) · 0.5 = 0.7 · 0.5 = 0.35. Now F∞ ≥ 43/64 > 0.35 and x2 = 1. The decoder computes B(01) and Pc(01) and the next threshold D(01) = B(01) + Pc(010) = 0.35+Pc(01)·0.9 = 0.35+0.35·0.9 = 0.665. Again F∞ ≥ 43/64 > 0.665 and the decoder produces x3 = 1. After computation of B(011) and Pc(011) the decoder can reconstruct the codeword 101011. Note that for this example the coding redundancy is 6 − log(1/0.035) = 1.163 bit. At the end of this section we mention that we have avoided discussing accuracy issues.

9 These issues are first investigated by Rissanen [15], who has contributed a lot more to arithmetic coding and (independently) by Pasco [13].

5 Coding for a known source

When the FSMX source is known to encoder and decoder, which means that the model S and the corresponding parameter vector ΘS are available to both, they can use the actual 0 distribution as coding distribution, hence for each sequence of past symbols x−∞ take

t 0 ∆ t 0 t t Pc(x1|x−∞, S, ΘS ) = Pa(x1|x−∞, S, ΘS ), for all x1 ∈ {0, 1} , t = 0, 1, ··· ,T. (19)

It will be clear that a coding distribution equal to the actual distribution satisfies (5). This definition allows sequential coding and decoding since

t−1 0 t−1 0 Pc(x ,Xt = 0|x , S, ΘS ) = (1 − θ t−1 ) · Pc(x |x , S, ΘS ), and 1 −∞ σ(x−∞) 1 −∞ t−1 0 t−1 0 Pc(x ,Xt = 1|x , S, ΘS ) = θ t−1 · Pc(x |x , S, ΘS ) for t = 1, ··· ,T, (20) 1 −∞ σ(x−∞) 1 −∞

t−1 0 t−1 and since Pc(x |x , S, ΘS ) and θ t−1 are known to encoder and decoder, after x 1 −∞ σ(x−∞) 1 was processed.

Theorem 2 : For any known FSMX source with model S and parameter vector ΘS , the individual redundancies with respect to the source (S, ΘS ) are upper bounded by

T 0 ρ(x1 |x−∞, S, ΘS ) < 2, (21)

T T T T 0 0 for all possible x1 ∈ {0, 1} , i.e. x1 with Pa(x1 |x−∞, S, ΘS ) > 0, for any sequence x−∞ of past symbols, if we use the coding distribution specified in (19). T T T 0 Proof : Fix x1 ∈ {0, 1} . Using for L(x1 |x−∞) the bound in theorem 1, we obtain T for the individual redundancies of sequence x1 with respect to source (S, ΘS )

T 0 T 0 1 ρ(x1 |x−∞, S, ΘS ) = L(x1 |x−∞) − log T 0 Pa(x1 |x−∞, S, ΘS ) 1 1 < log T 0 + 2 − log T 0 = 2 (22) Pc(x1 |x−∞, S, ΘS ) Pa(x1 |x−∞, S, ΘS )

T 0 T 0 by (19). It is assumed that Pc(x1 |x−∞, S, ΘS ) = Pa(x1 |x−∞, S, ΘS ) > 0. 2 It should be noted that the redundancy consists only of coding redundancy, i.e. redun- dancy resulting from not taking codeword lengths exactly equal to the information in the corresponding sequence. This is partly because it is not always possible, but also because it is not (computationally) efficient.

10 6 Coding for a memoryless source with unknown pa- rameter

In the FSMX framework, a memoryless source has a postfix set S = {λ}, hence |S| = 1. t−1 t−1 Consequently σ(x−∞) = λ for all x−∞, and θλ is the only parameter involved. Therefore

t−1 t−1 Pa(Xt = 1|x−∞, {λ}, θλ) = 1 − Pa(Xt = 0|x−∞, {λ}, θλ) = θλ, for all t. (23)

When the parameter θλ is unknown to encoder and decoder, they can not use the actual distribution as coding distribution. Instead they must use a cleverly chosen distribution. A coding distribution with nice properties was suggested by Krichevsky and Trofimov [12]. Definition 7 : The Krichevsky-Trofimov probability estimator for a sequence contain- ing a ∈ {0, 1, · · ·} zeros and b ∈ {0, 1, · · ·} ones, is given by

 1 ·(1+ 1 )·...·(a− 1 )· 1 ·(1+ 1 )·...·(b− 1 )  2 2 2 2 2 2 for a > 0, b > 0,  1·2·...·(a+b)  1 ·(1+ 1 )·...·(a− 1 ) ∆  2 2 2 for a > 0, b = 0, Pe(a, b) = 1·2·...·a (24) 1 ·(1+ 1 )·...·(b− 1 )  2 2 2 for a = 0, b > 0, and  1·2·...·b  1 for a = b = 0.

4 It can be shown (in fact this is how Krichevsky and Trofimov defined their estimator) that Z 1 a b Pe(a, b) = q (1 − θ) θ dθ, (25) 0,1 π (1 − θ)θ

a b in other words Pe(a, b) is a weighting of actual probabilities (1 − θ) θ . This weighting is 1 1 accomplished by assuming that the parameter θ has a ( 2 , 2 )-Dirichlet distribution. For this estimator we can find an upper and lower bound that appears to be better to handle than expression (24). The lemma below contains these bounds. It is proved in appendix A.

Lemma 1 : For a + b ≥ 1 the following inequalities hold for Pe(a, b) : s 1 1 a a b b 2 1 a a b b · √ ( ) ( ) ≤ Pe(a, b) ≤ · √ ( ) ( ) . (26) 2 a + b a + b a + b π a + b a + b a + b

t We are now ready to define a coding distribution for a memoryless source. Let a(x1) t t t be the number of zeros in x1 and b(x1) the number of ones in x1. The empty sequence φ contains a(φ) = 0 zeroes and b(φ) = 0 ones. The ‘Krichevsky-Trofimov’ coding distribution for a memoryless source is defined as

t ∆ t t t t Pc(x1|{λ}) = Pe(a(x1), b(x1)), for all x1 ∈ {0, 1} , t = 0, 1, ··· ,T. (27)

11 It can easily be verified that Pc(·|{λ}) as defined in (27) satisfies (5). Again as in the case where the source was known to encoder and decoder Pc(·|{λ}) is sequentially available. This follows from

t−1 1 t−1 a(x1 ) + 2 t−1 Pc(x1 ,Xt = 0|{λ}) = t−1 t−1 · Pc(x1 |{λ}), and a(x1 ) + b(x1 ) + 1 t−1 1 t−1 b(x1 ) + 2 t−1 Pc(x1 ,Xt = 1|{λ}) = t−1 t−1 · Pc(x1 |{λ}) for t = 1, ··· ,T, (28) a(x1 ) + b(x1 ) + 1

t−1 t−1 t−1 t−1 and the fact that a(x1 ), b(x1 ), and Pc(x1 |{λ}) are known after x1 is being processed. t−1 t−1 t−1 When Xt = 0 we increment count a, i.e. a(x1 ,Xt = 0) = a(x1 ) + 1, and b(x1 ,Xt = t−1 t−1 0) = b(x1 ). When Xt = 1, count b is incremented, hence we update a(x1 ,Xt = 1) = t−1 t−1 t−1 a(x1 ), and b(x1 ,Xt = 1) = b(x1 ) + 1. It should be noted that the encoder and decoder must both update their counts, in order to keep them identical. This is essential since the coding distributions depend on the previously encountered data, and the encoder and decoder must use the same coding distributions to guarantee proper performance. The performance of the estimator of Krichevsky and Trofimov follows from the next result.

Theorem 3 : For any memoryless source with unknown parameter θλ, the individual redundancies with respect to this source ({λ}, θλ) are upperbounded by 1 ρ(xT |x0 , {λ}, θ ) < log T + 3, (29) 1 −∞ λ 2

T T 0 for all x1 ∈ {0, 1} , for all sequences of past symbols x−∞, if we use the coding distribution specified in (27). Proof : First we split up the redundancy in parameter redundancy (the first term) and coding redundancy (the last two terms) :

T 0 T 0 1 ρ(x1 |x−∞, {λ}, θλ) = L(x1 |x−∞) − log T 0 Pa(x1 |x−∞, {λ}, θλ) T 0 Pa(x1 |x−∞, {λ}, θλ) T 0 1 = log T + L(x1 |x−∞) − log T . (30) Pc(x1 |{λ}) Pc(x1 |{λ}) For the coding redundancy we obtain using theorem 1 that

T 0 1 L(x1 |x−∞) − log T < 2. (31) Pc(x1 )|{λ})

T To upperbound the parameter redundancy, consider a sequence x1 with a zeros and b ones. Then

T 0 a b Pa(x1 |x−∞, {λ}, θλ) (1 − θλ) θλ log T = log Pc(x1 |{λ}) Pe(a, b)

12 (1 − θ )aθb ≤ log λ λ 1 √ 1 a a b b 2 a+b ( a+b ) ( a+b ) 1 ≤ log(a + b) + 1 2 1 = log T + 1. (32) 2 In this derivation the first inequality follows from lemma 1. The proof of the theorem is complete after substitution of (31) and (32) in (30). 2 We have seen that the redundancy now consists of coding redundancy and parameter redundancy. Parameter redundancy is a result of not knowing parameters, and therefore not being able to use the actual distribution as coding distribution. Note that this parameter redundancy is not very big. For T = 262144 = 218 we get a parameter redundancy of at most 10 bits.

Example 4 : Suppose that a memoryless source with parameter θλ = 0.1 generates a sequence 1024 0 with a = 901 zeroes and b = 123 ones. Then the probability Pa(x |x−∞, {λ}, θλ) of this se- −545.552 −547.737 quence is 2 . The probability estimate Pe(901, 123) = 2 . By lemma 1 the probability 1024 −1024·h(123/1024)−5−1 −1024·h(123/1024)−5−0.326 estimate Pe(x ) is larger than 2 and smaller than 2 , ∆ 1 1 where h(p) = p log p + (1 − p) log 1−p , for p ∈ [0, 1]. With 1024 · h(123/1024) = 542.410 this gives the bounds 2−548.410 and 2−547.736. Note that the upper bound is very close to the actual value of the estimate. Asymptotically for large T this bound is tight. Check that the parameter redundancy is smaller than 6 bits.

7 Coding for a source with known model and un- known parameters

7.1 Definition of the coding distribution Now consider a general FSMX source, whose postfix set S is known to the encoder and the T decoder. In that case, the corresponding postfix function σ(·) partitions the sequence x1 in memoryless subsequences. For each of these subsequences we can use the Krichevsky- t 0 t 0 Trofimov estimator. More precisely let as(x1|x−∞), respectively bs(x1|x−∞), denote the t number of zeros, respectively ones, that occur in x1 at instants τ for τ = 1, t such that τ−1 0 s = σ(x−∞), for each s ∈ S. Then for each sequence of past symbols x−∞ we define our coding distribution as

t 0 ∆ Y t 0 t 0 t t Pc(x1|x−∞, S) = Pe(as(x1|x−∞), bs(x1|x−∞)), for all x1 ∈ {0, 1} , t = 0, 1, ··· ,T. (33) s∈S Since the model of the source is known to the encoder and decoder, they can keep track of t−1 the postfix σ(x ) that determines the generation of xt, not the parameter θ t−1 itself, −∞ σ(x−∞) after having processed x1x2 ··· xt−1. This makes it possible for encoder and decoder to

13 t 0 t 0 update the counts as(x1|x−∞) and bs(x1|x−∞) for s ∈ S properly. It is again easy to see 0 that Pc(·|x−∞, S) which is defined in (33) is acceptable as a coding distribution, i.e. it 0 satisfies (5). Observe that also now Pc(·|x−∞, S) is sequentially available. This follows from

t−1 0 Pc(x1 ,Xt = 0|x−∞, S) t−1 0 1 a t−1 (x1 |x ) + σ(x−∞) −∞ 2 t−1 0 = t−1 0 t−1 0 · Pc(x1 |x−∞, S), and a t−1 (x |x−∞) + b t−1 (x |x−∞) + 1 σ(x−∞) 1 σ(x−∞) 1 t−1 0 Pc(x1 ,Xt = 1|x−∞, S) t−1 0 1 b t−1 (x1 |x ) + σ(x−∞) −∞ 2 t−1 0 = t−1 0 t−1 0 · Pc(x1 |x−∞, S), for t = 1, ··· ,T ,(34) a t−1 (x |x−∞) + b t−1 (x |x−∞) + 1 σ(x−∞) 1 σ(x−∞) 1 t−1 0 t−1 0 t−1 0 and the fact that a t−1 (x |x ), b t−1 (x |x ), and Pc(x |x , S) are known σ(x−∞) 1 −∞ σ(x−∞) 1 −∞ 1 −∞ t−1 t−1 0 after x is being processed. Count a t−1 (x |x ) is incremented when Xt = 0 and 1 σ(x−∞) 1 −∞ t−1 0 b t−1 (x |x ) when Xt = 1. σ(x−∞) 1 −∞ T 0 T 0 The notation Pc(x1 |x−∞, S) suggests that averaging over all ΘS of Pc(x1 |x−∞, S, ΘS ) T 0 T 0 = Pa(x1 |x−∞, S, ΘS ) has lead to Pc(x1 |x−∞, S). The assumption that all parameters 1 1 θs, s ∈ S in ΘS are Dirichlet-( 2 , 2 ) distributed justifies this observation. Equation (25) is the one-dimensional version of this weighting operation.

7.2 An upper bound on the redundancy To investigate the redundancy of this coding method we first give a definition. Definition 8 : ( ∆ z for 0 ≤ z < 1 γ(z) = 1 (35) 2 log z + 1 for z ≥ 1, 4 1 It can be seen that the function γ(·) is a convex-∩ continuation of 2 log z + 1 satisfying γ(0) = 0. Using this function we can state the following theorem. Theorem 4 : For any FSMX source with known model S and unknown parameter vector ΘS , the individual redundancies with respect to (S, ΘS ) are upper bounded by T ρ(xT |x0 , S, Θ ) < |S|γ( ) + 2, (36) 1 −∞ S |S|

T T 0 for all x1 ∈ {0, 1} , for any sequence x−∞ of past symbols, if we use the coding distribution specified in (33). Note that (36) can be rewritten as ( T 0 T + 2 for T = 1, ··· , |S| − 1 ρ(x1 |x−∞, S, ΘS ) < |S| T (37) 2 log |S| + |S| + 2 for T = |S|, |S| + 1, ··· .

14 1 Theorem 4 shows that the parameter redundancy is not more than roughly 2 log(T ) bits per parameter. Asymptotic considerations lead to stronger bounds, but we prefer absolute results here. Proof : The proof of this theorem differs from the previous one, only in the treatment of the parameter redundancy. Instead of one parameter, there are |S| parameters involved now. The structure of the coding distribution, makes it possible however, to split up the total parameter redundancy in |S| terms representing the parameter redundancies corresponding to each of the |S| subsequences, one for each postfix s in S. Each of these 1 terms can be upper bounded by 2 log(as + bs) + 1 as we have seen before, however only if as + bs > 0. For as + bs = 0 the term should not contribute to the redundancy. This is T T 0 why we have introduced the function γ. Consider a sequence x1 and let as = as(x1 |x−∞) T 0 and bs = bsk (x1 |x−∞), for s ∈ S. For the parameter redundancy we can now write

T 0 as bs Pa(x1 |x−∞, S, ΘS ) X (1 − θs) θs log T 0 = log Pc(x1 |x−∞, S) s∈S Pe(as, bs) 1  ≤ X log(a + b ) + 1 2 s s s∈S:as+bs>0 X = γ(as + bs) s∈S a + b T ≤ |S|γ(X s s ) = |S|γ( ). (38) s∈S |S| |S| The first inequality is obtained along the lines of (32). Adding the upper bound of 2 bits for the coding redundancy, yields the theorem. 2

0 Example 5 : Consider the FSMX source of example 1. For past sequence x−∞ = ··· 010 t−1 we listed in the table below the postfix σ(x−∞), the source symbol xt, the coding probability t 0 Pc(x1|x−∞, S), and the counts (a00, b00), (a10, b10), and (a1, b1), for t = 1, ··· , 7.

t−1 t 0 t σ(x−∞) xt Pc(x1|x−∞, S) (a00, b00) (a10, b10) (a1, b1) 1 10 0 1/2 · 1 = 1/2 (0,0) (1,0) (0,0) 2 00 1 1/2 · 1/2 = 1/4 (0,1) (1,0) (0,0) 3 1 1 1/2 · 1/4 = 1/8 (0,1) (1,0) (0,1) 4 1 0 1/4 · 1/8 = 1/32 (0,1) (1,0) (1,1) 5 10 1 1/4 · 1/32 = 1/128 (0,1) (1,1) (1,1) 6 1 0 1/2 · 1/128 = 1/256 (0,1) (1,1) (2,1) 7 10 0 1/2 · 1/256 = 1/512 (0,1) (2,1) (2,1)

The parameter redundancy is log(0.0059535/(1/512)) = 1.608 bit. The bound on the parameter 3 7 redundancy yields 2 log 3 + 3 = 4.834 bit.

7.3 An upper bound on the codeword length The corollary below is a straightforward consequence of theorem 4.

15 Corollary 1 : For any FSMX source with known model S, the codeword length t 0 L(x1|x−∞) is upper bounded by

T 0 1 T L(x1 |x−∞) < min log T 0 + |S|γ( ) + 2, (39) ΘS Pa(x1 |x−∞, S, ΘS ) |S|

T T 0 for all x1 ∈ {0, 1} , for any sequence x−∞ of past symbols, if we use the coding distribution specified in (33).

8 Coding for a source with unknown model and pa- rameters

8.1 Introduction of the context tree weighting method

Consider an FSMX source, whose postfix set S and parameter vector ΘS are unknown to the encoder and the decoder. Can we choose the coding distribution such that the redundancy is not much higher than what we have found in Theorem 4 for sources with known model (postfix set) ? The answer to this question turns out to be affirmative. We will propose a weighted coding distribution and show that it achieves redundancies not much higher than in the case where we knew the model. The resulting coding procedure is sequential again, and allows one-pass coding therefore. The weighted coding distribution is based on the concept of a context tree (see figure 3).

Definition 9 : Let D = 0, 1, ···. The context tree TD is a set of nodes labeled s, where s is a (binary) string with length l(s) such that 0 ≤ l(s) ≤ D. Each node s ∈ TD with l(s) < D, ‘splits up’ in two nodes, 0s and 1s. The node s is called the parent of the nodes 0s and 1s, who in turn are the children of s. To each node s ∈ TD, there correspond counts as ≥ 0 and bs ≥ 0. For the children 0s and 1s of parent node s, the counts must satisfy a0s + a1s = as and b0s + b1s = bs. 4 Note that we assume that the tree is full up to length D, i.e. all nodes corresponding to strings s with length not exceeding D, including the root which corresponds to λ, are present. Now, to each node there corresponds a weighted probability. This weighted probability is defined recursively on the tree TD. Without any doubt, this is the basic definition in this manuscript. D Definition 10 : To each node s ∈ TD, we assign a weighted probability Pw,s which is defined as

( 1 1 D D D ∆ 2 Pe(as, bs) + 2 Pw,0s · Pw,1s for 0 ≤ l(s) < D, Pw,s = (40) Pe(as, bs) for l(s) = D.

16 The context tree together with the weighted probability distribution is called a weighted context tree. 4

To be able to define a weighted coding distribution, we assume that the counts (as, bs), t 0 s ∈ TD are determined by the source sequence x1 seen up to now, assuming that x−∞ are the past symbols. t 0 t 0 Definition 11 : Let as = as(x1|x−∞), respectively bs(x1|x−∞), be the number of zeros, t τ−1 respectively ones, that occur in x1 at instants τ, 1 ≤ τ ≤ t such that s = σ(x−∞), for each s ∈ TD. The weighted probabilities corresponding to the nodes s ∈ TD are now denoted D t 0 0 by Pw,s(x1|x−∞). For any sequence x−∞ of past symbols, we define our weighted coding distribution as t 0 ∆ D t 0 Pc(x1|x−∞) = Pw,λ(x1|x−∞), (41) t t for all x1 ∈ {0, 1} , t = 0, 1, ··· ,T . 4 This coding distribution determines the context tree weighting method. To verify that 0 the weighted coding distribution Pc(·|x−∞) satisfies (5), but also to help us understand the sequential updating procedure we formulate a lemma. The proof of this lemma can be found in appendix B. t−1 Lemma 2 : Let s ∈ TD. If s is a postfix of x−∞, then

D t−1 0 D t−1 0 D t−1 0 Pw,s(x1 ,Xt = 0|x−∞) + Pw,s(x1 ,Xt = 1|x−∞) = Pw,s(x1 |x−∞), (42)

t−1 and, if s is a not a postfix of x−∞, then

D t−1 0 D t−1 0 D t−1 0 Pw,s(x1 ,Xt = 0|x−∞) = Pw,s(x1 ,Xt = 1|x−∞) = Pw,s(x1 |x−∞). (43)

0 To check that Pc(·|x−∞) as defined in (41) is an allowable coding distribution observe D 0 that Pw,λ(φ|x−∞) = 1, i.e. when no symbols are processed yet. In fact, the weighted probabilities corresponding to all nodes are 1 in that case. Subsequently, note that lemma D t−1 0 D t−1 0 D t−1 0 2 states that Pw,λ(x1 ,Xt = 0|x−∞) + Pw,λ(x1 ,Xt = 1|x−∞) = Pw,λ(x1 |x−∞) since λ t−1 0 is a postfix of all strings x−∞. We may conclude that Pc(·|x−∞) satisfies (5) after having verified that weighted probabilities are always positive.

8.2 Complexity issues

Another consequence of lemma 2 is that processing a new source symbol xt requires up- t−1 dating of only the weighted probabilities for nodes that correspond to prefixes of x−∞, i.e. t−1 t−1 t−1 nodes xt−D, ··· , xt−2, xt−1, λ. In other words the sequence of past symbols x−∞ defines a path in the tree, and only the weighted probabilities along this path are updated, starting t−1 from the full-depth node xt−D and working towards the root λ. More precisely

D t−1 0 P t−1 (x1 ,Xt = 0|x−∞) w,xt−d

17  1 t−1 0 t−1 0  Pe(axt−1 (x1 ,Xt = 0|x−∞), bxt−1 (x1 ,Xt = 0|x−∞))  2 t−d t−d  1 D t−1 0 D t−1 0 = + 2 P t−1 (x1 ,Xt = 0|x−∞) · P t−1 (x1 |x−∞) if 0 ≤ d < D,(44) w,xt−d−1xt−d w,xt−d−1xt−d  t−1 t−1 0  Pe(a t−1 (x1 ,Xt = 0), b t−1 (x1 ,Xt = 0|x )) for d = D,  xt−D xt−D −∞

t−1 where xt = λ, and x = 1 − x, for x ∈ {0, 1}. A similar expression can be given for D t−1 0 P t−1 (x1 ,Xt = 1|x−∞). Formula (44) more or less suggests that each node s not only w,xt−d D contains a weighted probability Pw,s, but also an estimated probability Pe(as, bs), and of course the counts as and bs. Also the estimated probabilities must be updated, hence in addition to (44) the encoder and decoder must perform

t−1 0 t−1 0 Pe(a t−1 (x ,Xt = 0|x ), b t−1 (x ,Xt = 0|x )) xt−d 1 −∞ xt−d 1 −∞ t−1 0 1 a t−1 (x1 |x−∞) + xt−d 2 t−1 0 t−1 0 = · Pe(a t−1 (x |x ), b t−1 (x |x )), for 0 ≤ d ≤(45)D. t−1 0 t−1 0 xt−d 1 −∞ xt−d 1 −∞ a t−1 (x1 |x−∞) + b t−1 (x1 |x−∞) + 1 xt−d xt−d

and a similar computation if Xt = 1. Again only the counts along the path from the full- depth node leaf to the root are updated. When Xt = 0 all a-counts are increased by one, for Xt = 1 the encoder and decoder increase the b-counts. The final conclusion after all this is, that the computational complexity for the context tree weighting method is linear in T . At first sight the storage complexity is exponential in D since the number of nodes in TD is 2D+1 −1. On the other hand however, it need not be more than linear in T . This follows from the fact that nodes s with as = bs = 0 need not be stored in memory. The reason for D this is that these nodes s (and all nodes that have s as a postfix) have Pw,s = Pe(as, bs) = 1. Each time a symbol is processed, D + 1 nodes are visited, and added if necessary. The number of nodes that are added during the processing of T symbols cannot exceed (D+1)T therefore.

8.3 An upper bound on the redundancy What about the redundancy of the context tree weighting method ? Before we can answer this question we have to define the cost of a model.

Definition 12 : The cost ΓD(S) of a model S, with respect to a tree TD, is defined as

∆ ΓD(S) = |S| − 1 + |{s : s ∈ S, l(s) 6= D}|. (46)

We assume that S is contained in TD, i.e. S ⊂ TD. 4 Our basic result concerning the context tree weighting technique can be stated now.

Theorem 5 : For any FSMX source with unknown model S ⊂ TD and unknown parameter vector ΘS , the individual redundancies with respect to (S, ΘS ) are upper bounded by T ρ(xT |x0 , S, Θ ) < Γ (S) + |S|γ( ) + 2, (47) 1 −∞ S D |S|

18 T T 0 for all x1 ∈ {0, 1} , for any sequence of past symbols x−∞, if we use the coding distribution specified in (41). T T Proof : Consider a sequence x1 ∈ {0, 1} . We split up the individual redundancy in model redundancy, parameter redundancy and coding redundancy.

T 0 T 0 1 ρ(x1 |x−∞, S, ΘS ) = L(x1 |x−∞) − log T 0 Pa(x1 |x−∞, S, ΘS ) T 0 Pc(x1 |x−∞, S) = log T 0 Pc(x1 |x−∞) T 0 Pa(x1 |x−∞, S, ΘS ) T 0 1 + log T 0 + L(x1 |x−∞) − log T 0 . (48) Pc(x1 |x−∞, S) Pc(x1 |x−∞) For the coding redundancy we obtain, using theorem 1, that

T 0 1 L(x1 |x−∞) − log T 0 < 2. (49) Pc(x1 |x−∞) For the parameter redundancy we use the bound given by (38) :

T 0 Pa(x1 |x−∞, S, ΘS ) T log T 0 ≤ |S|γ( ). (50) Pc(x1 |x−∞, S) |S| What remains to be investigated is the model redundancy. From (33) we know that

T 0 Y T 0 T 0 Pc(x1 |x−∞, S) = Pe(as(x1 |x−∞), bs(x1 |x−∞)). (51) s∈S

0 It is easy to lower bound Pc(·|x−∞) in similar terms, as we shall see soon. Consider the postfix set (model) S which is contained in the context tree TD. Leafs of S are nodes s in 0 the context tree for which s ∈ S. An internal node s of S is a node in TD that is a postfix 0 D T 0 of a string s ∈ S, s 6= s . We can now obtain a recursive lower bound on Pw,λ(x1 |x−∞) by observing that for s ∈ TD

 1 D T 0 D T 0  2 Pw,0s(x1 |x−∞) · Pw,1s(x1 |x−∞) if s is an internal node of S, D T 0  1 T 0 T 0 Pw,s(x1 |x−∞) ≥ 2 Pe(as(x1 |x−∞), bs(x1 |x−∞)) if s is a leaf of S with l(s) 6= D,  T 0 T 0 Pe(as(x1 |x−∞), bs(x1 |x−∞)) if s is a leaf of S with l(s) = D. (52) 1 It turns out that we ‘loose’ not more than a factor 2 (1 bit) in every internal node and in every leaf of S with l(s) 6= D. Therefore we get

T 0 D T 0 −ΓD(S) Y T 0 T 0 Pc(x1 |x−∞) = Pw,λ(x1 |x−∞) ≥ 2 Pe(as(x1 |x−∞), bs(x1 |x−∞)). (53) s∈S This results in the following upper bound for the model redundancy :

T 0 Pc(x1 |x−∞, S) 1 log ≤ log = ΓD(S). (54) T 0 −ΓD(S) Pc(x1 |x−∞) 2

19 (0,0) 1/1 HH 1/1 H (1,0) H 1/2 ¨¨ @ (1,0) ¨ 1/2 1/2 ¨ @ 1/2 @ (2,1) @ 1/16 (1,0) 5/64 A 1/2 H A 1/2 HH (1,1) A H 1/8 A ↑ 1 ¨¨ (0,1) ¨¨ 3/16 A 1/2 A 1/2 A (4,3) A 5/2048 (0,1) ¡95/32768 1/2 H ¡ 1/2 HH (2,1) ¡ H 1/16 ¡ ↓ 0 ¨¨ @ (2,0) ¨¨ 1/8 ¡ 3/8 @ ¡ 3/8 @ (2,2) ¡ @ 3/128 ¡ (0,1) 11/256 1/2 H 1/2 HH (0,1) H 1/2 ¨¨ (0,0) ¨ 1/2 1/1 ¨ 1/1

T 0 Figure 3: Weighted context tree T3 for x1 = 0110100 and x−∞ = ··· 010.

Substituting (49), (50) and (54) in (48), yields the theorem. 2 Theorem 5 is the basic theorem in this manuscript. In this theorem we recognize beside the coding and parameter redundancy the model redundancy. Model redundancy is a consequence of not knowing the model, and therefore not being able to take distribution 0 Pc(·|x−∞, S) as coding distribution. This results in a loss which is upper bounded by ΓD(S) bits. If we specify the model S ∈ TD using a natural code, similar to the one described in section 2, we also need ΓD(S) bit. To see this note that, in contrast to the code described there, strings s with l(s) = D need not contribute to the specification code. Per parameter, the model redundancy is not more than 2 bit. An other way of looking at the model redundancy in the theorem, is that the first parameter costs no more than 1 bit, all additional ones cost no more than 2 bit, however parameter corresponding to nodes at level D, cost no more than 1 bit. T Example 6 : Suppose a source generated the sequence x1 = 0110100 while the sequence of 0 past symbols x−∞ = ··· 010. For D = 3 we have plotted the weighted context tree TD in figure T 0 T 0 3. Node s contains the counts (as(x1 |x−∞), bs(x1 |x−∞)), the Krichevsky-Trofimov estimate D T 0 Pe(as, bs), and the weighted probability Pw,s(x1 |x−∞). The coding probability for this sequence is 95/32768. If we assume that the sequence was generated by the source from example 1, then we obtain an upper bound on the model redundancy of Γ3(S) = 5 bits. This can be checked from 1 P D (xT |x0 ) ≥ P · P w,λ 1 −∞ 2 w,0 w,1

20 1 1  ≥ P · P · P 2 2 w,00 w,10 w,1 1 1 1  1  1  ≥ P (a , b ) P (a , b ) · P (a , b ) 2 2 2 e 00 00 2 e 10 10 2 e 1 1 −5 T 0 = 2 Pc(x1 |x−∞, S). (55)

3 7 From the fact that the parameter redundancy is not larger than 2 log 3 + 3 = 4.834 bits, we obtain the upper bound 9.834 bit for the model and parameter redundancy together. The actual value of the total redundancy (not including coding redundancy) is log(0.0059535/(95/32768)) = 1.038 bit. Note that this is less than the 1.608 bit redundancy that results for the same sequence, if we assume that the model is known. The reason for this is that a model different from the actual model matches the source sequence better than the actual model. The context tree weighting method acts always as if it uses the best model for compression, as we shall see in the next subsection.

8.4 Upper bounds on the codeword length An important property of the context tree weighting method is given in the lemma below. In appendix C we give the proof of this lemma. t 0 Lemma 3 : The weighted coding probability Pc(x1|x−∞) is a weighting over all coding t 0 distributions Pc(x1|x−∞, S) for all S ∈ TD. More precisely

t 0 X −ΓD(S) t 0 Pc(x1|x−∞) = 2 · Pc(x1|x−∞, S). (56) S∈TD

t t 0 for all x1 ∈ {0, 1} , t = 0, 1, ··· ,T and all x−∞, and X 2−ΓD(S) = 1. (57)

S∈TD

A consequence of this lemma is that for the codeword length achieved by the context tree weighting method we have that

T 0 1 L(x1 |x−∞) < log T 0 + 2 Pc(x1 |x−∞) 1 = log + 2. (58) P −ΓD(S) T 0 S∈TD 2 · Pc(x1 |x−∞, S) This immediately leads to an upper bound on the codeword length, which is stated in the next theorem. T 0 Theorem 6 : The codeword length L(x1 |x−∞) is upper bounded by

T 0 1 L(x1 |x−∞) < log 1 + 2 (59) − log −ΓD(S) P (xT |x0 ,S) P c 1 −∞ S∈TD 2

21 T T 0 for all x1 ∈ {0, 1} , for any sequences x−∞ of past symbols, if we use the coding distribution specified in (41). T 0 The interpretation of this theorem is as follows. Let log(1/Pc(x1 |x−∞, S)) + ΓD(S) be the description length of our sequence with respect to S. This is reasonable as we have seen T 0 before. It takes ΓD(S) bits to specify the model S. Then log(1/Pc(x1 |x−∞, S)) bits are T 0 needed to encode the sequence x1 , given the past symbols x−∞, assuming that the model is S, if we ignore the coding redundancy. Inequality (59) says that the weighting method yields a codeword length which is not smaller than the log of the sum of the probabilities induced by the description lengths of all models S ⊂ TD plus 2 bit coding redundancy. This means that if there are two models yielding the same description length, the weighting method will achieve a codeword length which is at least one bit less. Replacing in theorem 6 the sum in the nominator in the log by single terms correspond- ing to model S ⊂ TD, and taking the minimum over all S ⊂ TD, we obtain the following corollary : T 0 Corollary 2 : The codeword length L(x1 |x−∞) is upper bounded by ! T 0 1 L(x1 |x−∞) < min log T 0 + ΓD(S) + 2, (60) S⊂TD Pc(x1 |x−∞, S)

T T 0 for all x1 ∈ {0, 1} , for any sequence x−∞ of past symbols, if we use the coding distribution specified in (41).

Given a model S, for any parameter vector ΘS , we know (see (38)) that : 1 1 T log T 0 ≤ log T 0 + |S|γ( ). (61) Pc(x1 |x−∞, S) Pa(x1 |x−∞, S, ΘS ) |S| Substitution of this in Corollary 2 yields : T 0 Corollary 3 : The codeword length L(x1 |x−∞) is upper bounded by ! T 0 1 T L(x1 |x−∞) < min min log T 0 + ΓD(S) + |S|γ( ) + 2, (62) S⊂TD ΘS Pa(x1 |x−∞, S, ΘS ) |S|

T T 0 for all x1 ∈ {0, 1} , for any sequence x−∞ of past symbols, if we use the coding distribution specified in (41).

Example 7 : We have simulated the source described in example 1. A sample sequence T 0 xD−1 was generated, from which the first D binary digits xD−1 have served as relevant part of 0 T x−∞ and x1 as the sequence to be compressed with the context tree weighting method, with D = 6. To make things simple we have assumed that the coding redundancy is zero. First we t 0 have computed the weighted codeword lengths log(1/Pc(x1|x−∞)) for t = 1, 350. These codeword t 0 lengths are compared to the description lengths log(1/Pc(x1|x−∞, S))+ΓD(S) for various models S. We have looked at ‘underestimated’ models {λ}, and {0, 1}, the actual model {00, 10, 1}, and ‘overestimated’ models {00, 10, 01, 11} and {000, 100, 010, 110, 001, 101, 011, 111}. Figure 4

22 35 ”wei” 30 ”f0” ”fw” 25 ”f1w0”

20

15

10

5

0

-5 0 50 100 150 200 250 300 350

Figure 4: Redundancies in bits for t = 1, 2, ··· , 350 for underestimated models, actual model, and weighted method.

35 ”wei” 30 ”f1w0” ”f1ww” 25 ”f11ww1ww”

20

15

10

5

0

-5 0 50 100 150 200 250 300 350

Figure 5: Redundancies in bits for t = 1, 2, ··· , 350 for overestimated models, actual model, and weighted method.

23 contains the description lengths corresponding to the underestimated models, the actual model, together with the weighted method codeword length, figure 5 contains the description lengths for overestimated models together with the actual model and the weighted method length. It is T 0 important to note that we always subtract the ideal codeword length log(1/Pa(x1 |x−∞, S, ΘS )) from the description lengths and the weighted codeword lengths, so that the vertical axis measures redundancy with respect to the actual model. Figure 4 shows that description length corresponding to the model {λ} indicated by curve ‘f0’, is smaller than the description length corresponding to model {0, 1} indicated by curve ‘fw’, and also smaller than the description length corresponding to the actual model {00, 10, 1} indicated by ‘f1w0’, for t smaller than 90. For t between 90 and 225 model {0, 1} gives the smallest description length. The actual model yields the best results only for t larger than 225. This indicates that for small values of the sequence length, it does not pay off to use complicated models for compression. As expected from corollary 2 the weighted method codeword length (indicated by ‘wei’), is always smaller than the description lengths of the fixed models. For large values of t it approaches the description length corresponding to the actual model. Figure 5 shows the description lengths corresponding to two overestimated models. Model {00, 10, 01, 11} is represented by curve ‘f1ww’ and {000, 100, 010, 110, 001, 101, 011, 111} by ‘f11ww1ww’. They are compared to the description length corresponding to the actual model and the code- word length yielded by the weighted method, which is again the smallest of all four. It can be observed that, as expected, overestimated models are worse than the actual model and the weighted method. To get an idea of the tightness of the bound in theorem 5 we computed the upper bound on the 3 350 redundancy of the weighting method with respect to the actual source. It is 5 + 2 log 3 + 3 = 18.300 bit for T = 350. The computed value is roughly 15 bit, as can be seen in the figures. The difference can be partly explained from the fact that for large T the upper bound on the parameter redundancy is not tight. Instead using the (asymptotically tight) lower bound of 3 350 3 π lemma 1 yields 5 + 2 log 3 + 2 log 2 = 16.277 bit. The difference of a little bit more than one bit must be the consequence of the fact that other models may have description lengths larger than the description length of the actual model, but all together they lead to a better result, as follows from theorem 6. Weighting techniques are not new in universal data compression. Already Davisson [6] demonstrates the importance of weighting (composite source, mixture distribution). Weighting in modeling systems is also not very new. Rissanen [17] more or less proposes a weighting technique for FSM sources, by discussing a distribution over all models. This definition, does not lead to a one-pass modeling method. Context tree weighting is a tree recursive method that does allow one pass coding. Therefore it should be regarded as an efficient implementation of weighting principles.

24 9 Minimum description length model selection

9.1 A sequential model selection method

T 0 In example 7 we have seen that for a sequence x1 , given the past symbols x−∞, different t 0 models S ⊂ TD give different description lengths log(1/Pc(x1|x−∞, S)) + ΓD(S). In this section we are interested in methods that give us the model that corresponds to the smallest description length. We start with a definition. t t Definition 13 : The minimum description length of a sequence x1 ∈ {0, 1} , t = 0, 1, ··· ,T , over the models S ⊂ TD is defined as ! t 0 ∆ 1 ΛD(x1|x−∞) = min log t 0 + ΓD(S) , (63) S⊂TD Pc(x1|x−∞, S)

0 given any sequence x−∞ of past symbols. 4 It turns out that a slight modification of the context tree weighting procedure yields a method that finds the minimum description length and the model that achieves it. Just like the weighting method, this method combines sequential updating with a good performance. Instead of assigning weighted probabilities to each node, there is a maximized probability connected to each node now. This maximized probability is also defined recursively on the context tree. In addition to the maximized probability, there corresponds a maximizing set to each node. These proper and complete sets minimize the ‘description length’ in each node (see the proof in appendix D). D Definition 14 : To each node s ∈ TD, we assign a maximized probability Pm,s, defined as

( 1 D D D ∆ 2 max(Pe(as, bs),Pm,0s · Pm,1s) for 0 ≤ l(s) < D, Pm,s = (64) Pe(as, bs) for l(s) = D,

D and maximizing set Sm,s which is defined as

 D D D D  Sm,0s × 0 ∪ Sm,1s × 1 if Pm,0s · Pm,1s > Pe(as, bs), for 0 ≤ l(s) < D, D ∆  D D Sm,s = {λ} if Pm,0s · Pm,1s ≤ Pe(as, bs), for 0 ≤ l(s) < D, (65)  {λ} for l(s) = D.

The context tree together with the maximized probability distribution and the maximizing sets, is called a maximized context tree. 4 In order to define a maximized coding distribution, and a maximizing postfix set, we t assume that the counts (as, bs), for all s ∈ TD are determined by the source sequence x1 0 seen so far, given the past symbols x−∞. t 0 t 0 Definition 15 : Let as = as(x1|x−∞), respectively bs(x1|x−∞), be the number of zeros, t τ−1 respectively ones, that occur in x1 at instants τ, for τ = 1, t such that s = σ(x−∞), for all s ∈ TD. The maximized probabilities corresponding to the nodes s ∈ TD are now denoted

25 D t 0 D t 0 0 by Pm,s(x1|x−∞), the maximizing sets by Sm,s(x1|x−∞). For any sequence x−∞ of past symbols, we define the maximized coding distribution as

t 0 ∆ D t 0 Pm(x1|x−∞) = Pm,λ(x1|x−∞), (66)

and the maximizing postfix set as

t 0 ∆ D t 0 Sm(x1|x−∞) = Sm,λ(x1|x−∞), (67)

t t for all x1 ∈ {0, 1} , t = 0, 1, ··· ,T . 4 Note that the definition of maximizing sets, guarantees their properness and complete- t 0 ness, hence also that of the maximizing postfix set Sm(x1|x−∞). 0 The maximized coding distribution Pm(·|x−∞) does not satisfy the restrictions (5), and therefore it is not a coding distribution in this sense. It does satisfy the weaker condition P T 0 P T 0 T T Pm(x |x ) ≤ T T Pc(x |x ) = 1 however, and therefore it is possible x1 ∈{0,1} 1 −∞ x1 ∈{0,1} 1 −∞ 0 to design a code based on this distribution. This explains why we call Pm(·|x−∞) a coding distribution. Observe that the complexity properties of the sequential context tree maximizing method are similar to that of the weighting method. Condition (43) in lemma 2 also holds for maximized probabilities. Consequently the sequential context tree maximizing method has a storage complexity and computational complexity which is not larger than linear in T , when we do not process nodes that are descendants of a node s with counts as + bs = 0. We are now ready to state the theorem that justifies proper performance of the context tree maximizing procedure, in the sense that it produces the minimum description length model belonging to the observed data. The proof of this theorem can be found in appendix D. t t Theorem 7 : For any sequence x1 ∈ {0, 1} , t = 0, 1, ··· ,T, the context tree maximizing method yields

1 1 t 0 t 0 log t 0 = log t 0 t 0 + ΓD(Sm(x1|x−∞)) = ΛD(x1|x−∞), (68) Pm(x1|x−∞) Pc(x1|x−∞, Sm(x1|x−∞))

0 given any sequence x−∞ of past symbols. To get an impression of the maximizing procedure we give an example. T 0 Example 8 : Consider again the sequence x1 = 0110100, with past symbols x−∞ = ··· 010. For D = 3 we have plotted the maximized context tree TD in figure 6. Node s contains the counts T 0 T 0 (as(x1 |x−∞), bs(x1 |x−∞)), the Krichevsky-Trofimov estimate Pe(as, bs), the maximized proba- D T 0 D T 0 bility Pm,s(x1 |x−∞), and the maximizing set Sm,s(x1 |x−∞). The maximized coding probability for this sequence is 5/4096. The maximizing postfix set is {λ}, which is not unusual for T = 7. Now that we know that the context tree maximizing procedure yields the minimum description length model, it is still a question whether this is a useful technique. We have

26 (0,0) 1/1 1/1 H HH (1,0) {λ} HH 1/2 ¨ 1/4 (1,0) ¨¨ @ 1/2 ¨¨ {λ} @ 1/2 @ @ (2,1) {λ} @ 1/16 (1,0) 1/32 A 1/2 {λ} A 1/2 HH A H (1,1) A {λ} HH 1/8 ↑ 1 ¨ A ¨ 1/8 (0,1) ¨ {0, 1} A 1/2 ¨¨ A 1/2 A (4,3) {λ} A 5/2048 (0,1) ¡ 5/4096 1/2 ¡ {λ} 1/2 HH ¡ H (2,1) ¡ {λ} HH 1/16 ¨ ¡ ¨ 3/32 @ ↓ 0 (2,0) ¨ {0, 1} ¡ 3/8 ¨¨ @ ¡ 3/8 @ ¡ @ (2,2) {λ} @ 3/128 ¡ (0,1) 3/256 1/2 {λ} H 1/2 H H (0,1) {λ} HH 1/2 ¨¨ 1/4 (0,0) ¨ 1/1 ¨¨ {λ} 1/1 {λ}

T 0 Figure 6: Maximized context tree T3 for x1 = 0110100 and x−∞ = ··· 010. seen before in corollary 2, that the codeword length achieved by the weighting method is smaller than the description length corresponding to any model S ⊂ TD, so from the perspective of compression we should not select the model first using the maximizing method, specify it to the decoder and then use the techniques from section 7 for coding with a known model and unknown parameters. What is then the advantage of this two- pass procedure ? A reason to choose this method could be, that the decoder complexity is lower than that of the weighted method. The decoder only needs to keep track of the counts (as, bs) for s ∈ S instead of for all s ∈ TD. And also the computation of the coding distribution can be done as in (34), which is much simpler than the tree recursive weighting method described in (44). Therefore the two-pass method achieves a decoder complexity, that is lower than before, both in storage and number of computations, and this can be important in some applications. In addition to this, we mention that it is possible to select a model not sequentially, but recursively on the context tree. This has the advantage that it can decrease the storage complexity enormously, at the expense of an increasing computational complexity

27 of course. Furthermore it is possible to modify this recursive model selection procedure, so that it does not have to find the model that achieves the minimum description length, but a model with a description length which is not much larger. This again decreases the (computational) complexity. In subsection 9.3 we deal with recursive model selection.

9.2 Model estimation It is known from the literature (see Rissanen [18], Barron [2], and Barron and Cover [3]) that minimum description length models can be good estimators for the actual model. We will show here that our minimum description length estimator, the context tree maximizing procedure, produces the actual model for all sufficiently large T with probability one, under certain (natural) conditions for the actual source. We will discuss these conditions first. Definition 16 : A postfix set S is said to be closed if the generator of each postfix s ∈ S belongs to S or is a postfix of an s ∈ S. The generator of a postfix s = b1−l ··· b−1b0 ∗ is b1−l ··· b−1. The closure S of a postfix set S is the smallest closed postfix set that contains S. To the closure S∗ of a postfix set S we can associate a finite state machine defined source (see Rissanen [17]), since every postfix s ∈ S∗ followed by a binary symbol most have a postfix in S∗. To each postfix s0 ∈ S∗ there corresponds as state, its parameter 0 θs0 is equal to θs if s ∈ S is a postfix of s . We say that a source (S, ΘS ) is irreducible if and only if the states of the finite state machine defined source associated with the closure S∗ of it postfix set S, with parameters determined by the parameter vector ΘS , form an irreducible set, i.e. if from each state in this set we can reach any other state in the set, in one or more transitions (see Gallager [7], p. 65). A source (S, ΘS ) is minimal if and only if for any pair of postfixes 0s, 1s ∈ S, we have that θ0s 6= θ1s. 4

Example 9 : Consider the postfix set S = {00, 010, 110, 1}. This set is not closed since the generator 01 of 010 and the generator 11 of 110 are missing. The closure of this set is S∗ = {00, 010, 110, 01, 11}. Note that there are strings in S that do not completely specify future statistical behavior. E.g. postfix 1 determines generation of the next digit, but if this digit is a 0 we are in trouble. Assuming that θ1 must be 0, in order to avoid the generation of a 0, is inconsistent with the fact that 010 and 110 are relevant (occurring) postfixes. Postfixes in a closed postfix set do not suffer from this phenomenon and specify the future statistic completely. The source (S, ΘS ) with S = {0, 1}, θ0 = 0 and θ1 = 1 is not irreducible. There are two irreducible sets {0} and {1}. Source (S, ΘS ) with S = {0, 1} and θ0 = θ1 = θ is not minimal. The source S = {λ} with θλ = θ has identical behavior, and ‘costs’ less. Now we can state the theorem, that informs us about the quality of the context tree maximizing estimator.

Theorem 8 : For an irreducible and minimal source (S, ΘS ) we have that

T 0 Sm(x |x−∞) → S, (69)

28 0 for any sequence x−∞ of past symbols, for all sufficiently large T , with probability one. The underlining indicates that xT is a random variable. The proof of this theorem can be found in appendix E. It is clear that the conditions are not too restrictive. If a source is not irreducible, there are either more irreducible sets, or there are transient states. If there are more irreducible sets, the estimator will find the model corresponding to the set in which the actual sequence moves. If there are transient states the estimator will not discover them. It is also easy to see that a source which is not minimal, will not be estimated. Although there exist other model estimators (for an overview see Kieffer [11]) the context tree maximizing method can be regarded as a first efficient method that converges for all T large enough with probability one to the actual model.

9.3 Recursive model selection In the previous subsection we have discussed the sequential context tree maximizing T method. To process the sequence x1 sequentially, we update counts, estimated proba- T bilities, maximized probabilities, and maximizing sets, for each symbol of the sequence x1 . After having processed the entire sequence we can find the maximized coding probability and maximizing postfix set in the root node of the context tree. It is also possible to find this probability and set, using a method that works recursively2 on the context tree TD. We start with the root node. When we process a node s we first T collect from x1 the counts as and bs and compute from this, the estimated probability D D Pe(as, bs). Only when it is necessary, we process the nodes 0s and 1s to obtain Pm,0s ·Pm,1s, D D so that we can compute the maximized probability Pm,s and the maximizing set Sm,s. The storage complexity of this method is negligible. The computational complexity of this method is proportional to the number of nodes that are processed. This number is essentially equal to the number of nodes encountered in the sequential procedure, if we consider it unnecessary to process the nodes descending from nodes s with as + bs = 0. T Taking into account that each processed node requires a scan of the entire sequence x1 , this approach does not seem very attractive. Nevertheless there could be an advantage, if there exist methods to detect that is unnecessary to process descending nodes. The lemma below provides a tool that leads to such a method. It is proved in appendix F.

Lemma 4 : For a node s ∈ TD with we have ( D D Pe(as, 0)Pe(0, bs) if l(s) = D − 1, Pm,0s · Pm,1s ≤ 1 (70) 4 Pe(as, 0)Pe(0, bs) for 0 ≤ l(s) < D − 1.

D D It follows from the lemma that if bs = 0 then Pe(as, bs) = Pe(as, 0) ≥ Pm,0s · Pm,1s and further processing is unnecessary. So if a node contains either only zeros or only

2We will not discuss methods that are partly sequential and partly recursive

29 ones, processing of its descendants is not needed. Note that this more general than the observation that the descendants of nodes s with as + bs = 0 need not be processed. In general we can compare from a processed node, the estimated probability Pe(as, bs) to the upper bound of the product of the maximized probabilities of its children as given by lemma 4. If the estimated probability is larger than the bound, we need not process the descendants of the node. It can be useful to replace the terms Pe(as, 0) and Pe(0, bs) in the upper bound by upper bounds again. From the proof of lemma 1 (see appendix A) we know that ∆(a, 0) ≤ q lima→∞ ∆(a, 0) = 1/π, and consequently

s 1 P (a, 0) ≤ , for a > 0. (71) e πa This elementary (and very tight) bound can be used to simplify the comparisons. It can be shown that the bounds given in lemma 4 are tight. This is the case when the node s splits up in two nodes such that one has counts (as, 0) and the other (0, bs).

9.4 Suboptimal recursive model selection If we use the method described in the previous subsection, and the computational and storage complexity are still very large, we can try to find a model that has a description length close to the minimum description length but is not necessarily minimal . There- fore we modify the previously discussed method. The following definition describes this modification (for α = 0 it defines the optimal recursive method again).

Definition 17 : Let α ≥ 0. Start in the root node. When we process a node s ∈ TD we collect the counts as and bs and compute, the estimated probability Pe(as, bs). Only when ( −α(as+bs) Pe(as, 0)Pe(0, bs) for l(s) = D − 1, Pe(as, bs) < 2 · 1 (72) 4 Pe(as, 0)Pe(0, bs) for 0 ≤ l(s) < D − 1, D D we process the nodes 0s and 1s to obtain Pm,0s·Pm,1s, so that we can compute the maximized D D probability Pm,s and the maximizing set Sm,s as in (64) and (65). If (72) is not satisfied, D 1 D we set Pm,s = 2 Pe(as, bs) and Sm,s = {λ} and leave all descendants of s unprocessed. 4 The performance this suboptimal recursive tree maximizing method follows from the next theorem. The proof of this theorem can be found in appendix G. Theorem 9 : Let α ≥ 0. The suboptimal recursive context tree maximizing method T T achieves, for any sequence x1 ∈ {0, 1} ,

1 1 T 0 T 0 log T 0 = log T 0 T 0 + ΓD(Sm(x1 |x−∞)) ≤ ΛD(x1 |x−∞) + αT, Pm(x1 |x−∞) Pc(x1 |x−∞, Sm(x1 |x−∞)) (73) 0 given any sequence x−∞ of past symbols.

30 It is clear that the number of processed nodes decreases if we increase α. Taking α = 0 yields optimal recursive model selection but also maximal complexity. Increasing α to 1, results in a possibly poor model {λ}, but also leads to minimal complexity, i.e. we only have −(a+b) to process the root node. This follows from the inequality Pe(a, b) ≥ 2 Pe(a, 0)Pe(0, b), which can be proved by induction on a + b. It is hard to find explicit relations between the number of processed nodes and the parameter α.

10 Lower bounds on the (maximum) redundancy

10.1 A general bound

T T Consider a code for the sequences x1 ∈ {0, 1} generated by source (S, ΘS ). If this code is uniquely decodable, then its codeword lengths satisfy the Kraft inequality (see Cover and Thomas [4], p. 90), i.e.

T 0 X −L(x1 |x−∞) 0 2 ≤ 1, for any x−∞. (74) T T 0 x1 :Pa(x1 |x−∞,S,ΘS )>0

T T This code induces a normalized dyadic distribution over the sequences x1 ∈ {0, 1} , in the following way

 −L(xT |x0 )  2 1 −∞ if P (xT |x0 , S, Θ ) > 0, ∆  −L(xT |x0 ) a 1 −∞ S T 0 P 2 1 −∞ Pd(x |x ) = xT :P (xT |x0 ,S,Θ )>0 (75) 1 −∞ 1 a 1 −∞ S  T 0  0 if Pa(x1 |x−∞, S, ΘS ) = 0.

T T 0 Using (74) and (75) we can write for the redundancy for any x1 with Pa(x1 |x−∞, S, ΘS ) > 0 0, given any x−∞. that

T 0 T 0 1 ρ(x1 |x−∞, S, ΘS ) = L(x1 |x−∞) − log T 0 Pa(x1 |x−∞, S, ΘS ) P (xT |x0 , S, Θ ) 1 = log a 1 −∞ S + log T 0 P −L(xT |x0 ) Pd(x1 |x−∞) T T 0 2 1 −∞ x1 :Pa(x1 |x−∞,S,ΘS )>0 T 0 Pa(x1 |x−∞, S, ΘS ) T 0 ≥ log T 0 , for Pa(x1 |x−∞, S, ΘS ) > 0. (76) Pd(x1 |x−∞)

T 0 T 0 This implies that a lower bound for log(Pa(x1 |x−∞, S, ΘS )/Pd(x1 |x−∞)) is also a lower T 0 bound on the redundancy ρ(x1 |x−∞, S, ΘS ). This will be used several times in the following subsections.

10.2 Known source

T T The maximum individual redundancy over all sequences x1 ∈ {0, 1} is easy to bound from below.

31 Theorem 10 : For a source (S, ΘS ) and any uniquely decodable code

T 0 max ρ(x |x , S, ΘS ) ≥ 0, (77) T T T 0 1 −∞ x1 ∈{0,1} :Pa(x1 |x−∞,SΘS )>0

0 for any sequence x−∞ of past symbols. Proof : Using (76) we find that

T 0 max ρ(x |x , S, ΘS ) T T 0 1 −∞ x1 :Pa(x1 |x−∞,SΘS )>0 P (xT |x0 , S, Θ ) ≥ max log a 1 −∞ S ≥ 0, (78) T T 0 T 0 x1 :Pa(x1 |x−∞,S,ΘS )>0 Pd(x1 |x−∞)

0 for any x−∞. 2 If we compare this with the upper bound on the redundancy obtained in theorem 2, we see that there is a gap of less than 2 bits.

10.3 Sources with known model and unknown parameters

T T Definition 18 : We define for all x1 ∈ {0, 1} 1 1 π  d(xT |x0 , S) =∆ X log(a + b ) + log , (79) 1 −∞ 2 s s 2 2 s∈S:as+bs>0

0 T 0 for a given model S and any sequence x−∞ of past symbols, where as = as(x1 |x−∞) and T 0 bs = bs(x1 |x−∞). 4

Theorem 11 : For a source with model S ∈ TD, and any uniquely decodable code

 T 0 T 0  max ρ(x |x , S, ΘS ) − d(x |x , S) ≥ 0, (80) |S| T T T 0 1 −∞ 1 −∞ ΘS ∈[0,1] ,x1 ∈{0,1} :Pa(x1 |x−∞,SΘS )>0

0 for any sequence x−∞ of past symbols. Proof : Things are little bit more complicated now. We slightly modify the maxi- mum likelihood approach that was suggested by Shtarkov [22]. Instead of maximizing the T 0 probability of sequence x1 given the past x−∞ over all parameters in the vector ΘS , we maximize a biased probability. We obtain a biased maximum likelihood code by defining the distribution

T 0 T 0 ∆ M(x1 |x−∞, S) 0 ∆ X T 0 Pmp(x1 |x−∞, S) = 0 where M(x−∞, S) = M(x1 |x−∞, S) and M(x−∞, S) T x1 P (xT |x0 , S, Θ ) M(xT |x0 , S) =∆ max a 1 −∞ S . (81) 1 −∞ d(xT |x0 ,S) ΘS 2 1 −∞

32 0 Now consider any uniquely decodable code. Then for any x−∞ we obtain for the maximum biased individual redundancy  T 0 T 0  max ρ(x |x , S, ΘS ) − d(x |x , S) T T 0 1 −∞ 1 −∞ ΘS ,x1 :Pa(x1 |x−∞,S,ΘS )>0 P (xT |x0 , S, Θ ) ! ≥ max log a 1 −∞ S − d(xT |x0 , S) T T 0 T 0 1 −∞ ΘS ,x1 :Pa(x1 |x−∞,S,ΘS )>0 Pd(x1 |x−∞)  T 0  Pa(x1 |x−∞,S,ΘS ) maxΘS d(xT |x0 ,S) T 0  1 −∞ Pmp(x1 |x−∞, S) = max log 2 + log  T  T 0 T 0  x1 Pmp(x1 |x−∞, S) Pd(x1 |x−∞)

P (xT |x0 , S)! = max log M(x0 , S) + log mp 1 −∞ ≥ log M(x0 , S). (82) T −∞ T 0 −∞ x1 Pd(x1 |x−∞)

The first inequality holds since the maximum of the actual probabilities over all ΘS given 0 0 x−∞ is certainly not 0. We are ready with the proof if we can show that M(x−∞, S) ≥ 1. T 0 0 X Pa(x1 |x−∞, S, ΘS ) M(x−∞, S) = max T 0 Θ d(x |x ,S) T S 2 1 −∞ x1 Q ( as )as ( bs )bs X s∈S:as+bs>0 as+bs as+bs = T 0 d(x1 |x−∞,S) xT 2 1 √ Q q π s∈S:a +b >0 Pe(as, bs) as + bs ≥ X s s 2 d(xT |x0 ,S) T 2 1 −∞ x1 X Y = Pe(as, bs) T s∈S x1 X T 0 = Pc(x1 |x−∞, S) = 1. (83) T x1 It will be clear that in this derivation the first inequality follows from Lemma 1. 2 T The theorem says that, no matter what code is used, there is always a sequence x1 and a parameter vector ΘS such that 1 1 π  ρ(xT |x0 , S, Θ ) ≥ X log(a + b ) + log , (84) 1 −∞ S 2 s s 2 2 s∈S:as+bs>0 0 given x−∞ and model S. From inspection of the proof of theorem 4 it follows, that the T T 0 method described there, achieves for all x1 with Pa(x1 |x−∞, S, ΘS ) > 0 1  ρ(xT |x0 , S, Θ ) < X log(a + b ) + 1 + 2. (85) 1 −∞ S 2 s s s∈S:as+bs>0 We observe a gap of 2 bits (coding redundancy) and in addition to that per parameter a 1 8 gap of 2 log π = 0.674 bits. Asymptotic considerations can make these last gaps smaller, but we prefer absolute bounds here.

33 10.4 Sources with unknown model and parameters A lower bound on the maximum redundancy can only be given for asymptotically large T , i.e. for sequences of codes. Theorem 12 : Any sequence of uniquely decodable codes, one for each T , satisfies

 T 0 T 0  lim inf max ρ(x1 |x−∞, S, ΘS ) − d(x1 |x−∞, S) − ΓD(S) ≥ 0, T →∞ |S| T T T 0 S⊂TD,ΘS ∈[0,1] ,x1 ∈{0,1} :Pa(x1 |x−∞,SΘS )>0 (86) 0 for any sequence of past symbols x−∞. Proof : We redefine the biased maximum likelihood code as follows

T 0 T 0 ∆ N(x1 |x−∞) 0 ∆ X T 0 Pmp(x1 |x−∞) = 0 where N(x−∞) = N(x1 |x−∞) and N(x−∞) T x1 P (xT |x0 , S, Θ ) N(xT |x0 ) =∆ max a 1 −∞ S . (87) 1 −∞ d(xT |x0 ,S)+Γ (S) S,ΘS 2 1 −∞ D Now we obtain along the same lines as (82) that

 T 0 T 0  max ρ(x |x , S, ΘS ) − d(x |x , S) − ΓD(S) T T 0 1 −∞ 1 −∞ S,ΘS ,x1 :Pa(x1 |x−∞,S,ΘS )>0 T 0 ! Pa(x1 |x−∞, S, ΘS ) T 0 ≥ max log − d(x |x , S) − ΓD(S) T T 0 T 0 1 −∞ S,ΘS ,x1 :Pa(x1 |x−∞,S,ΘS )>0 Pd(x1 |x−∞)  T 0  Pa(x1 |x−∞,S,ΘS ) maxS,ΘS d(xT |x0 ,S)+Γ (S) T 0  1 −∞ D Pmp(x1 |x−∞) = max log 2 + log  T  T 0 T 0  x1 Pmp(x1 |x−∞) Pd(x1 |x−∞)

P (xT |x0 , S)! = max log N(x0 ) + log mp 1 −∞ ≥ log N(x0 ), (88) T −∞ T 0 −∞ x1 Pd(x1 |x−∞)

0 0 for any x−∞, and any uniquely decodable code. For N(x−∞) we can now write

T 0 0 X Pa(x1 |x−∞, S, ΘS ) N(x−∞) = max T 0 S,Θ d(x |x ,S)+ΓD(S) T S 2 1 −∞ x1 T 0 Pa(x1 |x−∞,S,ΘS ) maxΘS d(xT |x0 ,S) = X max 2 1 −∞ S ΓD(S) T 2 x1 T 0 X Pc(x1 |x−∞, S) 0 ≥ max , for any x−∞. (89) S ΓD(S) T 2 x1 Now we define the joint distribution

T 0 ∆ T 0 ∆ −ΓD(S) Pc(S, x1 |x−∞) = Pc(S) · Pc(x1 |x−∞, S), where Pc(S) = 2 , (90)

34 with marginals and conditional distributions derived from it, denoted by Pc(·). Then

T 0 0 X Pc(x1 |x−∞, S) log N(x−∞) ≥ log max S ΓD(S) T 2 x1 X T 0 = log max Pc(S, x1 |x−∞) S T x1 X X T 0 T 0 ≥ log Pc(S|x1 , x−∞) · Pc(S, x1 |x−∞) T S x1 X T 0 T 0 ≥ Pc(S, x1 |x−∞) log Pc(S|x1 , x−∞) T S,x1 T 0 0 = −H(S|x , x−∞), for any x−∞. (91)

0 For this conditional entropy we can write, for any x−∞, using Fano’s inequality, that T 0 T 0 T 0 0 ≤ H(S|x , x−∞) = H(S|x , x−∞, Sm(x , x−∞)) 0 ≤ H(S|Sm, x−∞) 0 0 ≤ h(Pc(Sm 6= S|x−∞)) + Pc(Sm 6= S|x−∞) · log(|TD| − 1), (92) where h(p) =∆ −p log(p) − (1 − p) log(1 − p) for p ∈ [0, 1] is the binary entropy function. T 0 The convergence with probability one in theorem 8 of Sm(x |x−∞) to S for irreducible and minimal FSMX sources (S, ΘS ), implies convergence in probability, i.e.

T 0 0 0 lim Pa(Sm(x |x−∞) 6= S|x−∞, S, ΘS ) = 0, for any x−∞. (93) T →∞

T 0 T 0 Taking into account that Pc(x1 |x−∞, S) is a Dirichlet-weighting of Pa(x1 |x−∞, S, ΘS ) ,see the last part of subsection 7.1, and that the set of non-irreducible or non-minimal sources has measure 0, we may conclude that

T 0 0 0 lim Pc(Sm(x |x−∞) 6= S|x−∞, S) = 0, for all S ⊂ TD and x−∞. (94) T →∞

If we average over all models S ⊂ TD we finally obtain

0 0 lim Pc(Sm 6= S|x−∞) = 0, for all x−∞. (95) T →∞ Substitution of (95) in (92) yields

T 0 0 lim H(S|x , x−∞) = 0, for all x−∞. (96) T →∞ Now we are finished if we note that, for any sequence of uniquely decodable codes,

 T 0 T 0  lim inf max ρ(x1 |x−∞, S, ΘS ) − d(x1 |x−∞, S) − ΓD(S) T →∞ T T 0 S,ΘS ,x1 :Pa(x1 |x−∞,SΘS )>0 T 0 ≥ lim inf −H(S|x , x−∞) T →∞ T 0 0 = lim −H(S|x , x−∞) = 0, for all x−∞. (97) T →∞

35 2 Theorem 12 says that, for asymptotically large T , no matter what code is used, there T is always a sequence x1 and a source (S, ΘS ) such that 1 1 π  ρ(xT |x0 , S, Θ ) ≥ X log(a + b ) + log + Γ (S), (98) 1 −∞ S 2 s s 2 2 D s∈S:as+bs>0

0 given any x−∞. From inspection of the proof of theorem 5 it follows, that for asymptotically T large T , the weighted method, achieves for all sources (S, ΘS ) and sequences x1 with T 0 Pa(x1 |x−∞, S, ΘS ) > 0 1 1  ρ(xT |x0 , S, Θ ) < X log(a + b ) + log π + Γ (S) + 2, (99) 1 −∞ S 2 s s 2 D s∈S:as+bs>0

0 for any x−∞. We have used here, that the asymptotically, for large a + b, the redundancy 1 is not larger than 2 log(π(a + b)), which can be derived as in the proof of lemma 1. Now we observe a gap of 2 bits (coding redundancy) and in addition to that per parameter a gap of half a bit.

11 Conclusion

We have investigated the context tree weighting procedure, which was presented recently, at the IEEE International Symposium on in San Antonio, Texas (see [24]). Compression and complexity of this new technique appear to be as desired. Analysis is simple. Moreover, a modification of this method, the context tree maximizing proce- dure, leads to a model estimator, that converges with probability one for all large enough sequence lengths, to the actual model. Much more can be written about recursive weighting methods, and further results will be given in [25]. We mention a modification that makes it unnecessary to have access 0 to x−∞, and an implementation that realizes infinite context tree depth D, which is still linear in T . This implementation makes it possible to guarantee, that the weighted method achieves entropy for any stationary and ergodic source. It turns out that recursive weighting techniques exist for model classes that are much more general than the FSMX model class. Finally we remark that the methods, that are described here for binary data, can easily be extended to the non-binary case.

12 Acknowledgement

This research was carried out while the second author visited the Information Theory Group at Eindhoven University. The authors thank the Eindhovens Hogeschoolfonds for supporting this visit.

36 A Bounds for the Krichevsky-Trofimov estimator

Proof : Define P (a, b) ∆(a, b) =∆ e . (100) √ 1 a a b b a+b ( a+b ) ( a+b ) First we assume that a ≥ 1. Consider ∆(a + 1, b) aa(a + 1/2) a + b + 1 = · ( )a+b+1/2. (101) ∆(a, b) (a + 1)a+1 a + b To analyze (101) we define, for t ∈ [1, ∞), the functions tt(t + 1/2) t + 1 f(t) =∆ ln and g(t) =∆ ln( )t+1/2. (102) (t + 1)t+1 t The derivatives of these functions are df(t) t 1 dg(t) t + 1 t + 1/2 = ln + and = ln − . (103) dt t + 1 t + 1/2 dt t t(t + 1)

1/2 Take α = t+1/2 and observe that 0 < α ≤ 1/3. Then from t 1 − α α3 α5 1 ln = ln = −2(α + + + ···) ≤ −2α = − , (104) t + 1 1 + α 3 5 t + 1/2

df(t) we obtain that dt ≤ 0. Therefore aa(a + 1/2) aa(a + 1/2) 1 ≥ lim = . (105) (a + 1)a+1 a→∞ (a + 1)a+1 e Similarly from t + 1 α3 α5 2α t + 1/2 ln = 2(α + + + ···) ≤ 2(α + α3 + α5 + ···) = = , (106) t 3 5 1 − α2 t(t + 1)

dg(t) we may conclude that dt ≤ 0. This results in a + b + 1 a + b + 1 ( )a+b+1/2 ≥ lim ( )a+b+1/2 = e. (107) a + b a+b→∞ a + b Combining (105) and (107) yields that

∆(a + 1, b) ≥ ∆(a, b) for a ≥ 1. (108)

Next we investigate the case where a = 0. Note that this implies that b ≥ 1, and consider ∆(1, b) 1 b + 1 = · ( )b+1/2. (109) ∆(0, b) 2 b

37 dg(t) If we again use the fact that dt ≤ 0, we find that e ∆(1, b) ≥ · ∆(0, b). (110) 2 Inequality (108) together with (110), now imply that

∆(a + 1, b) ≥ ∆(a, b) for a ≥ 0. (111)

Therefore ∆(1, 0) = ∆(0, 1) ≤ ∆(a, b) ≤ lim ∆(a, b). (112) a→∞,b→∞ The lemma now follows from the observation ∆(1, 0) = ∆(0, 1) = 1/2, and the fact that

(2a)! (2b! 1 s a · b · a+b 2 lim ∆(a, b) = lim 2 a! 2 b! 2 (a+b)! = . (113) a→∞,b→∞ a→∞,b→∞ √ 1 a a b b π a+b ( a+b ) ( a+b ) This limit was evaluated using the Stirling approximation. Note that (112) implies that the bounds are tight. 2

B Updating properties of weighted probabilities

Proof : By induction on l(s). Observe that the lemma holds for l(s) = D. To see this note that in that case the weighted probabilities are determined only by the Krichevsky- t−1 Trofimov estimator. If s is a prefix of x−∞ then

D t−1 0 D t−1 0 Pw,s(x1 ,Xt = 0|x−∞) + Pw,s(x1 ,Xt = 1|x−∞) t−1 0 t−1 0 t−1 0 t−1 0 = Pe(as(x1 |x−∞) + 1, bs(x1 |x−∞)) + Pe(as(x1 |x−∞), bs(x1 |x−∞) + 1) t−1 0 t−1 0 = Pe(as(x1 |x−∞), bs(x1 |x−∞)) D t−1 0 = Pw,s(x1 |x−∞). (114)

t−1 If s is not a prefix of x1 then

D t−1 0 t−1 0 t−1 0 Pw,s(x1 ,Xt = 0|x−∞) = Pe(as(x1 ,Xt = 0|x−∞), bs(x1 ,Xt = 0|x−∞)) t−1 0 t−1 0 = Pe(as(x1 |x−∞), bs(x1 |x−∞)) D t−1 0 = Pw,s(x1 |x−∞), (115)

D t−1 0 D t−1 0 and, similarly we can show, that Pw,s(x1 ,Xt = 1|x−∞) = Pw,s(x1 |x−∞). Now assume that the lemma holds for l(s) = d, 0 < d ≤ D. Consider nodes correspond- t−1 ing to strings s with l(s) = d − 1. First we assume that s is a postfix of x−∞. Then 0s is t−1 t−1 a postfix of x−∞ and 1s not, or vice versa. Let 0s be a postfix of x−∞. Then

D t−1 0 D t−1 0 Pw,s(x1 ,Xt = 0|x−∞) + Pw,s(x1 ,Xt = 1|x−∞)

38 1 1 = P (a (xt−1|x0 ) + 1, b (xt−1|x0 )) + P (a (xt−1|x0 ), b (xt−1|x0 ) + 1) 2 e s 1 −∞ s 1 −∞ 2 e s 1 −∞ s 1 −∞ 1 + P D (xt−1,X = 0|x0 ) · P D (xt−1,X = 0|x0 ) 2 w,0s 1 t −∞ w,1s 1 t −∞ 1 + P D (xt−1,X = 1|x0 ) · P D (xt−1,X = 1|x0 ) 2 w,0s 1 t −∞ w,1s 1 t −∞ 1 1 = P (a (xt−1|x0 ), b (xt−1|x0 )) + P D (xt−1,X = 0|x0 ) · P D (xt−1|x0 ) 2 e s 1 −∞ s 1 −∞ 2 w,0s 1 t −∞ w,1s 1 −∞ 1 + P D (xt−1,X = 1|x0 ) · P D (xt−1|x0 ) 2 w,0s 1 t −∞ w,1s 1 −∞ 1 = P (a (xt−1|x0 ), b (xt−1|x0 )) 2 e s 1 −∞ s 1 −∞ 1   + P D (xt−1,X = 0|x0 ) + P D (xt−1,X = 1|x0 ) · P D (xt−1|x0 ) 2 w,0s 1 t −∞ w,0s 1 t −∞ w,1s 1 −∞ 1 1 = P (a (xt−1|x0 ), b (xt−1|x0 )) + P D (xt−1|x0 ) · P D (xt−1|x0 ) 2 e s 1 −∞ s 1 −∞ 2 w,0s 1 −∞ w,1s 1 −∞ D t−1 0 = Pw,s(x1 |x−∞). (116) The induction hypothesis is used to obtain the second and fourth equality. The proof is t−1 t−1 analogous when 1s is a postfix of x−∞. Secondly we assume that s is not a postfix of x−∞. t−1 Then 0s nor 1s are a postfix of x−∞. Thus

D t−1 Pw,s(x1 ,Xt = 0) 1 = P (a (xt−1,X = 0|x0 ), b (xt−1,X = 0|x0 )) 2 e s 1 t −∞ s 1 t −∞ 1 + P D (xt−1,X = 0|x0 ) · P D (xt−1,X = 0|x0 ) 2 w,0s 1 t −∞ w,1s 1 t −∞ 1 1 = P (a (xt−1|x0 ), b (xt−1|x0 )) + P D (xt−1|x0 ) · P D (xt−1|x0 ) 2 e s 1 −∞ s 1 −∞ 2 w,0s 1 −∞ w,1s 1 −∞ D t−1 0 = Pw,s(x1 |x−∞). (117)

D t−1 0 D t−1 0 Similarly we can show that Pw,s(x1 ,Xt = 1|x−∞) = Pw,s(x1 |x−∞). 2

C Weighting properties of the coding distribution

Proof : To prove (56) we state a hypothesis : For a node s in the weighted context tree TD with l(s) = d we have that

D X −ΓD−d(U) Y Pw,s = 2 Pe(aus, bus), (118) U⊂TD−d u∈U where the summation is over (complete and proper) postfix sets U. By induction we show that this hypothesis holds for 0 ≤ d ≤ D. For d = D, the hypothesis holds. Next assume that it also holds for 0 < d ≤ D. Now consider a node s

39 with l(s) = d − 1, then 1 1 P D = P (a , b ) + P D · P D w,s 2 e s s 2 w,0s w,1s 1 = P (a , b ) 2 e s s     1 X −ΓD−d(V) Y X −ΓD−d(W) Y +  2 Pe(av0s, bv0s)  2 Pe(aw1s, bw1s) 2 V⊂TD−d v∈V W⊂TD−d w∈W

−1 X −1−ΓD−d(V)−ΓD−d(W) Y = 2 Pe(as, bs) + 2 Pe(aus, bus) V,W⊂TD−d u∈V×0∪W×1

X −ΓD−d+1(U) Y = 2 Pe(aus, bus). (119) U⊂TD−d+1 u∈U

We have used the induction hypothesis in the second step of the derivation. Conclusion is that the hypothesis also holds for d − 1, and by induction for all 0 ≤ d ≤ D. The hypothesis for d = 0 proves (56). The proof of (57) is similar. The hypothesis

X 2−ΓD−d(U) = 1, (120)

U⊂TD−d where the summation is over complete and proper postfix sets U, holds for d = D. Assume that it holds for 0 < d ≤ D, then

X X 2−ΓD−d+1(U) = 2−ΓD−d+1({λ}) + 2−1−ΓD−d(V)−ΓD−d(W)

U⊂TD−d+1 V,W⊂TD−d     1 1 X −Γ (V) X −Γ (W) = +  2 D−d  ·  2 D−d  2 2 V⊂TD−d W⊂TD−d 1 1 = + = 1, (121) 2 2 and the hypothesis therefore holds for all 0 ≤ d ≤ D, which proves (57). 2

D Maximizing properties of the coding distribution

Proof : To prove the lemma we state a hypothesis : For a node s in the maximized context tree TD with l(s) = d we have that

D D −ΓD−d(Sm,s) Y −ΓD−d(U) Y Pm,s = 2 Pe(aus, bus) = max 2 Pe(aus, bus), (122) U⊂T D D−d u∈Sm,s u∈U where the maximum is over (complete and proper) postfix sets U.

40 By induction we show that this hypothesis holds for 0 ≤ d ≤ D. For d = D, the hypothesis holds. Next assume that it also holds for 0 < d ≤ D. Consider a node s with l(s) = d − 1, then 1 P D = max(P (a , b ),P D · P D ) m,s 2 e s s m,0s m,1s 1 = max(P (a , b ), 2 e s s ! ! −ΓD−d(V) Y −ΓD−d(W) Y max 2 Pe(av0s, bv0s) max 2 Pe(aw1s, bw1s) ) V⊂T W⊂T D−d v∈V D−d w∈W  

−1 −1−ΓD−d(V)−ΓD−d(W) Y = max 2 Pe(as, bs), max 2 Pe(aus, bus) V,W⊂T D−d u∈V×0∪W×1

−ΓD−d+1(U) Y = max 2 Pe(aus, bus). (123) U⊂T D−d+1 u∈U We have used the induction hypothesis in the second step of the derivation. To see D D D D D that Sm,s ‘achieves’ Pm,s observe that {λ} yields Pm,s if Pe(as, bs) ≥ Pm,0s · Pm,1s and that D D D D D Sm,0s × 0 ∪ Sm,1s × 1 achieves Pm,s if Pe(as, bs) < Pm,0s · Pm,1s. Conclusion is that the hypothesis also holds for d − 1, and by induction for all 0 ≤ d ≤ D. The hypothesis for d = 0 proves the lemma. 2

E Convergence of context tree maximizing estimator

Proof : The proof of this theorem consists of several steps. a) Consider FSMX source (S, ΘS ) and its associated finite state machine defined source. For all states s0 ∈ S∗ there is a probability Q(s|s0) corresponding to a transition to any other state s ∈ S∗. These transition probabilities determine a stationary state distribution Q(s), s ∈ S∗, which is the solution of

X Q(s0)Q(s|s0) = Q(s), for all s ∈ S∗ s0∈S∗ and X Q(s) = 1. (124) s∈S∗ Note that not all Q(s) for s ∈ S∗ can be 0. Since all these state s form an irreducible set, this implies that all Q(s) > 0 for s ∈ S∗. b) The stationary state distribution Q(s), s ∈ S∗ determines the stationary sequence (and substring) distribution

t X t t t Q(x1) = Q(s)Pa(x1|s, S, ΘS ), for all x1 ∈ {0, 1} and t = 0, 1, ···. (125) s∈S∗ This stationary distribution makes the irreducible Markov source stationary and ergodic. Therefore the ergodic theorem (see e.g. Gray and Davisson [8], or Shields [21]) holds.

41 The ergodic theorem says that as/T and bs/T converge, with probability one for all suffi- ciently large T , to Q(s0) and Q(s1) respectively, with respect to the stationary probability Q(·). We are interested however, in convergence with respect to the actual distribution 0 ∗ Pa(·|x−∞, S, ΘS ). Since all Q(s) > 0 for s ∈ S we may conclude that 1 P (xt |x0 , S, Θ ) = P (xt |s, S, Θ ) ≤ Q(xt ) if s ∈ S∗ is a postfix of x0 . (126) a 1 −∞ S a 1 S Q(s) 1 −∞

This implies, together with the ergodic theorem, that also as/T and bs/T converge with probability one to Q(s0) and Q(s1) respectively, with respect to the actual probability 0 0 Pa(·|x−∞, S, ΘS ), give any sequence x−∞ of past symbols. c) Consider a source with actual model S and an estimated alternative source with model W ⊂ TD, not equal to S. We can now partition the model S in three parts. The first part consists of postfixes (strings) s ∈ S that are correctly estimated by the alternative model, hence for such a string s there is an w ∈ W such that w = s. The second part consists of strings that are underestimated by W, i.e. for each such s there exists a w ∈ W such that w is a postfix of s. All strings s that are underestimated by the same w ∈ W form a proper subtree S|w, which is complete in the sense that all semi-infinite sequences ending with w must have a postfix in S|w. The third part consists of those strings in S that are overestimated by W. For such an s there exist strings w ∈ W, such that s is a postfix of all these w’s. All strings w that overestimate s make up the proper and ‘complete’ subtree W|s. d) First consider the set of strings S|w ⊂ S that are underestimated by w ∈ W. Since all Q(s) > 0 for s ∈ S∗, also Q(s) > 0 for s ∈ S, and Q(w) > 0. This implies that, with probability one, as + bs > 0 for all s ∈ S|w and aw + bw > 0, for all large enough T . We investigate only this case therefore. If the alternative string w is not preferred by the context tree maximizing procedure, over the set S|w then 1 1 log + ΓD(W|w) > log Q + ΓD(S|w) (127) Pe(aw, bw) s∈S|w Pe(as, bs) Q (1 − θ )as θbs ( aw )aw ( bw )bw s∈S|w s s aw+bw aw+bw ⇐⇒ log + log + ΓD(W|w) aw a bw b ( ) w ( ) w Pe(aw, bw) aw+bw aw+bw Q as bs s∈S|w(1 − θs) θs > log Q + ΓD(S|w) s∈S|w Pe(as, bs) ! aw + bw bw X as bs 1 ⇐= h( ) + log(1 − θs) + log θs + log(aw + bw) T aw + bw s∈S|w T T 2T   1 π ΓD(W|w) X 1 1 ΓD(S|w) + log + > log(as + bs) + + . (128) 2T 2 T s∈S|w 2T T T

The cost ΓD(S|w) of a postfix set S|w is defined as |S|w| − 1 + |{s : s ∈ S|w, l(s) 6= D}|, similar to the definition of ΓD(S) in (46). Now consider the two ‘entropy’-terms in (128).

42 For these two terms we obtain, using a) that

! aw + bw bw X as bs h( ) + log(1 − θs) + log θs T aw + bw s∈S|w T T   X Q(s) X Q(s) → Q(w) h( θs) − h(θs) > 0. (129) s∈S|w Q(w) s∈S|w Q(w)

where the convergence is with probability one, for sufficiently large T . To understand that the difference between the entropy is non-zero, note that Q(s) > 0 for all s ∈ S|w, and (consequently) Q(w) > 0. Furthermore observe that not all θs for s ∈ S|w can be equal, since the source is minimal. Therefore the entropies cannot be the same. All other terms in (128) converge, with probability one, to 0. Therefore inequality (128) is satisfied with probability one. This implies that also (127) is satisfied for all sufficiently large T , with probability one. e) Next we consider a string s ∈ S which is overestimated by the set of strings W|s ⊂ W. Note first that if only one w ∈ W|s has probability Q(w) > 0, there will be no overestimation. Therefore we consider the case where at least two strings w in W|s have non-zero probability Q(w). Denote by W+|s the set of w’s with positive probability. If the context tree maximizing method prefers the string s over the set W|s then 1 1 log Q + ΓD(W|s) > log + ΓD(S|s) (130) w∈W|s Pe(aw, bw) Pe(as, bs) a b Q aw aw bw bw s s + ( ) ( ) (1 − θs) θs w∈W |s aw+bw aw+bw ⇐⇒ log + log + ΓD(W|s) Q aw a bw b Q + ( ) w ( ) w + Pe(aw, bw) w∈W |s aw+bw aw+bw w∈W |s as bs (1 − θs) θs > log + ΓD(S|s) Pe(as, bs)   X bw X 1 1 π ⇐= − (aw + bw)d( kθ(s)) + log(aw + bw) + log w∈W+|s aw + bw w∈W+|s 2 2 2 1 +Γ (W|s) > log(a + b ) + 1 + Γ (S|s}). (131) D 2 s s D Now we consider the ‘divergence’-term in (131). Note that

( bw − θ )2 X bw X aw+bw s 0 ≤ (aw + bw)d( kθs) ≤ log e · (aw + bw) . (132) w∈W+|s aw + bw w∈W+|s θs(1 − θs)

The last inequality in (132) follows from d(pkq) ≤ log e · (p − q)2/(q(1 − q)), which can be regarded as the upper bound counterpart of Pinsker’s inequality (see Csisz´arand K¨orner[5], p. 58) for the binary case. Now we apply the ‘law of the iterated logarithm’ (see Renyi [14], P p. 337) to the ‘variance’-term in (132). This term is over bounded by w∈W+|s 2 ln ln(aw + bw) with probability one, for all sufficiently large T . Since there are at least two w’s in

43 W|s with Q(w) > 0 the left-hand side of (131) increases essentially at least as fast as 1 log T , while the right-hand side increases like 2 log T . Conclusion is that inequality (131) is satisfied with probability one, which implies that (130) is also satisfied with probability one, for all large enough T . f) The rest of the proof is straightforward now. Each alternative model contains at least an underestimating string or an overestimating set of strings. With probability one the alternative description length is larger than the description length corresponding to the actual model however. Therefore with probability one the context tree maximizing method refuses to choose the alternative model. The number of alternatives is |TD| − 1, which is finite. Therefore the actual model is chosen by the context tree maximizing method, with probability one, for all sufficiently large T . 2

F A condition to avoid deeper visits

Proof : We prove by induction that the lemma holds. For nodes s with l(s) = D − 1 we have

D D Pm,0s · Pm,1s = Pe(a0s, b0s)Pe(a1s, b1s)

≤ Pe(a0s, 0)Pe(0, b0s)Pe(a1s, 0)Pe(0, b1s)

≤ Pe(as, 0)Pe(0, bs). (133)

For nodes s with 0 ≤ l(s) < D − 1 we get (by using the induction hypothesis) that

D D Pm,00s · Pm,10s ≤ Pe(a0s, 0)Pe(0, b0s). (134)

If we combine this with the basic inequality

Pe(a0s, b0s) ≤ Pe(a0s, 0)Pe(0, b0s), (135) we obtain that 1 P D ≤ P (a , 0)P (0, b ). (136) m,0s 2 e 0s e 0s Similarly, we find 1 P D ≤ P (a , 0)P (0, b ). (137) m,1s 2 e 1s e 1s Combining 136 and 137 yields 1 1 P D · P D ≤ P (a , 0)P (0, b ) P (a , 0)P (0, b ) m,0s m,1s 2 e 0s e 0s 2 e 1s e 1s 1 ≤ P (a , 0)P (0, b ), (138) 4 e s e s and the hypothesis holds for 0 ≤ l(s) < D − 1. 2

44 G Suboptimal recursive model selection

Proof : The set of nodes s ∈ TD that are processed satisfies two conditions. First, if a node is processed, all its ancestors are processed. Secondly, if a node is processed, the other child of its parent is also processed. If a processed node s does not satisfy condition (72) then, using lemma 4 we find that 1 P D = P (a , b ) m,s 2 e s s 1 ≥ 2−α(as+bs) max(P (a , b ),P D,opt · P D,opt) 2 e s s m,0s m,1s −α(as+bs) −ΓD−d(U) Y = 2 max 2 Pe(aus, bus), (139) U⊂T D−d u∈U

D,opt D,opt where Pm,0s and Pm,1s are the maximized probabilities that would occur in the optimal maximizing method. They are connected with the minimum description length, which explains the last step of (139). The processed nodes for which (72) is not satisfied are called boundary nodes. Processed nodes that are not boundary nodes have a 0-child and a 1-child, which both are processed. We now state the hypothesis : For a processed node s in the suboptimally maximized context tree TD with l(s) = d we have that

D −ΓD−d(Sm,s) Y −α(as+bs) −ΓD−d(U) Y Pm,s = 2 Pe(aus, bus) ≥ 2 max 2 Pe(aus, bus), (140) U⊂T D D−d u∈Sm,s u∈U where the maximum is over (complete and proper) postfix sets U. By induction we show that this hypothesis holds for 0 ≤ d ≤ D. For d = D, the hypothesis clearly holds. Next assume that it also holds for processed nodes with 0 < d ≤ D. Now consider a processed node s with l(s) = d − 1, then if s is a boundary node the hypothesis holds by 139. If s is not a boundary node, it has children that were processed, and for these children the hypothesis (140) holds. Therefore 1 P D = max(P (a , b ),P D · P D ) m,s 2 e s s m,0s m,1s 1 ! −α(as+bs) −α(a0s+b0s) −ΓD−d(V) Y ≥ max(2 Pe(as, bs), 2 · max 2 Pe(av0s, bv0s) V⊂T 2 D−d v∈V ! −α(a1s+b1s) −ΓD−d(W) Y · 2 · max 2 Pe(aw1s, bw1s) ) W⊂T D−d w∈W

−α(as+bs) −ΓD−d+1(U) Y = 2 · max 2 Pe(aus, bus). (141) U⊂T D−d+1 u∈U The induction hypothesis is again used in the second step of the derivation. Note that D D also Sm,s ‘achieves’ Pm,s. Conclusion is that the hypothesis also holds for d − 1, and by induction for all 0 ≤ d ≤ D. The hypothesis for d = 0 proves the lemma. 2

45 References

[1] N. Abramson, Information Theory and Coding, New York: McGraw-Hill, 1963, pp. 61-62.

[2] A.R. Barron, “Logically smooth density estimation,” Ph.D. dissertation, Dept. Elect. Eng. Stanford Univ., Stanford, CA, Aug. 1985.

[3] A.R. Barron and T.M. Cover, “Minimum Complexity Density Estimation,” IEEE Trans. Inform. Theory, vol. IT-37, pp. 1034-1054, July 1991.

[4] T.M. Cover and J.A. Thomas, Elements of Information Theory. New York : John Wiley, 1991.

[5] I. Csisz´arand J. K¨orner, Information Theory, Coding Theorems for Discrete Memo- ryless Systems, Budapest : Akad´emiaiKiad´o,1981.

[6] L.D. Davisson, “Universal Noiseless Source Coding,” IEEE Trans. Inform. Theory, vol. IT-19, pp. 783-795, Nov. 1973.

[7] R.G. Gallager, Information Theory and Reliable Communication, New York: Wiley, 1968.

[8] R.M. Gray and L.D. Davisson, Random Processes : A Mathematical Approach for Engineers,” Englewood Cliffs : Prentice-Hall, 1986.

[9] D.A. Huffman, “A Method for the Construction of Minimum Redundancy Codes,” Proc. IRE, vol.40, pp. 1098-1101, Sept. 1952. Reprinted in Key Papers in the Devel- opment of Information Theory, (D. Slepian, Ed.) New York: IEEE Press, 1974, pp. 47-50.

[10] F. Jelinek, Probabilistic Information Theory, New York: McGraw-Hill, 1968, pp. 476- 489.

[11] J.C. Kieffer, “Strongly Consistent Code-Based Indentification and Order Estimation for Constrained Finite-State Model-Classes,” to appear in IEEE Trans. Inform. The- ory, vol. IT-39, May 1993.

[12] R.E. Krichevsky and V.K. Trofimov, “The Performance of Universal Encoding,” IEEE Trans. Inform. Theory, vol. IT-27, pp. 199-207, March 1981.

[13] R. Pasco, Source Coding Algorithms for Fast Data Compression, Ph.D. thesis, Stan- ford University, 1976.

[14] A. R´enyi, Wahrscheinlichkeitsrechnung mit einem anhang ¨uber informationstheorie, Berlin: VEB Deutscher Verl. Wissenschaften, 1962.

46 [15] J. Rissanen, “Generalized Kraft Inequality and Arithmetic Coding,” IBM J. Res. Devel., vol. 20, p. 198, 1976.

[16] J. Rissanen, “A Universal Data Compression System,” IEEE Inform. Theory, vol. IT- 29, pp.656-664, September 1983.

[17] J. Rissanen, “Complexity of Strings in the Class of Markov Sources,” IEEE Trans. Inform. Theory, vol. IT-32, pp. 526-532, July 1986.

[18] J. Rissanen, Stochastic Complexity in Statistical Inquiry. Teaneck, NJ : World Scien- tific Publ., 1989.

[19] J. Rissanen and G.G. Langdon, Jr., “Universal Modeling and Coding,” IEEE Trans. Inform. Theory, vol. IT-27, pp. 12-23, January 1981.

[20] C.E. Shannon, “A Mathematical Theory of Communication,” Bell Sys. Tech. Journal, vol. 27, pp. 379-423, July, 1948. Reprinted in Key Papers in the Development of Information Theory, (D. Slepian, Ed.) New York: IEEE Press, 1974, pp. 5-18.

[21] P.C. Shields, “The Ergodic and Entropy Theorems Revisited,” IEEE Trans. Inform. Theory, vol. IT-33, pp. 263-266, March 1987.

[22] Y.M. Shtarkov, “Coding of Discrete Sources with Unknown Statistics,” presented at Colloquia Math. Soc. Janos Bolyai, 16, Topics in Information Theory, Keszthely, Hungary, 1975.

[23] M.J. Weinberger, A. Lempel, and J. Ziv, “A Sequential Algorithm for the Universal Coding of Finite Memory Sources,” IEEE Trans. Inform. Theory, vol. IT-38, pp. 1002-1014, May 1992.

[24] F.M.J. Willems, Y.M. Shtarkov and Tj.J. Tjalkens, “Context Tree Weighting : A Sequential Universal Source Coding Procedure for FSMX Sources,” IEEE It. Symp. Inform. Theory, San Antonio, Texas, January 17-22, 1993, p. 59.

[25] F.M.J. Willems, Y.M. Shtarkov and Tj.J. Tjalkens, “The Context Tree Weighting Method : Extensions,” in preparation.

47