Convergence of Binarized Context-tree Weighting for Estimating Distributions of Stationary Sources

Badri N. Vellambi and Marcus Hutter Australian National University {badri.vellambi, marcus.hutter}@anu.edu.au Abstract This work investigates the convergence rate of learning the stationary distribution of finite-alphabet stationary ergodic sources using a binarized context-tree weighting approach. The binarized context-tree weighting (CTW) algorithm estimates the stationary distribution of a symbol as a product of conditional distributions of each component bit, which are determined in a sequential manner using the well known binary context-tree weighting method. We establish that CTW algorithm is a consistent estimator of the stationary distribution, and that the worst-case L1- prediction error between the CTW and frequency estimates using n source symbols each of which q  k log n when binarized consists of k > 1 bits decays as Θ 2 n . Keywords: Context-tree weighting, KT estimator, frequency estimator, binarization, stationary distribution, tree source, stationary ergodic source, convergence rate, worst-case bounds

1. Introduction The binary context-tree weighting (CTW) algorithm proposed in (Willems et al., 1993, 1995) is a universal scheme that uses a Bayesian mixture model over all context trees of up to a certain design depth to assign probabilities to sequences. The probability estimate and the weights corresponding to each tree assigned by the CTW algorithm is derived by an efficient algorithm that uses the well known Krichevsky-Trofimov (KT) estimator that models source bits as having a Bernoulli distribution whose parameter has a Dirichlet prior (Krichevsky and Trofimov, 1981). The CTW algorithm offers two attractive features: optimal data compression for binary tree (FSMX) sources (Rissanen, 1984, Theorem 1); • superior redundancy bounds and amenability to analysis as opposed to other universal data • compression schemes such as the Lempel-Ziv algorithm (Ziv and Lempel, 1977, 1978), e.g., using (Abramson, 1963), it can be shown that for any binary tree source P with C context parameters, the redundancy ρ of the binary CTW algorithm is bounded by

1 C n n ρ(x1:n) := L(x1:n) log2 log2 + (2C 1) + + 2, x1:n 0, 1 . (1) − P (x1:n) ≤ 2 C − K ∈ { } The first is the asymptotically dominant optimal redundancy achievable for the source P , and the second term (2C 1) is the model redundancy, i.e., the price paid for weighting various tree models as opposed− to using the actual source model if it were indeed known. An extension of CTW to sources over non-binary alphabets was proposed in (Tjalkens et al., 1993a,b), and a redundancy bound similar to (1) for a tree source with C contexts over a finite alphabet was derived; the asymptotically dominant term in this bound for a general alphabet A C( 1) n A was established to be |A|−2 log2 C( 1) . In (Tjalkens et al., 1993a), a decomposition-based ap- proach that uses binarization to map symbols|A|− from large alphabets to bit strings was also summarily proposed; this approach was later analyzed in greater detail in (Tjalkens et al., 1994). There are several motivating factors for taking a closer look at binarization.

1 First, many data sources of interest such as text, image, and audio files are factored and repre- sented into smaller units (such as bits or bytes) for digital storage. Hence, despite the fact that a source of practical interest might be distributed over a large alphabet, its symbols can effectively be viewed as a tuple of binary random variables. Second, the decomposition of larger alphabets into appropriate binary strings translates the CTW algorithm on the large alphabet to a sequence of binary CTW problems. This transla- tion in turn has the potential of reducing the number of model parameters and consequently of- fer improved redundancy guarantees (Tjalkens et al., 1994, 1997; Begleiter and El-Yaniv, 2006). This is better understood via an example. Consider an i.i.d. source over a hexadecimal alphabet whose distribution is given in the table in Fig. 1. While no specific structure is appar- 0 9/500 4 189/1000 8 81/1000 12 27/800 ent from the table, the example corresponds to 1 1/50 5 21/125 9 9/1000 13 3/800 a binary tree source with each source symbol 2 7/125 6 9/250 10 126/500 14 9/320 comprising of four bits, the first of which is an i.i.d. Bernoulli random variable with parame- 3 3/125 7 27/500 11 27/250 15 27/320 ter 0.6. For i = 2, 3, 4, the ith bit potentially 1st bit 2nd bit 3rd bit 4th bit depends on the realization of the first i 1 bits, − th (1, 0.25) (111, 0.75) and the conditional distribution of the i bit (11, 0.75) is a Bernoulli distribution that depends on its e e (011, 0.6) e context, e.g., the third bit is a Bernoulli ran- (01, 0.3) (e, 0.6) (01, 0.3) dom variable with parameter 0.75 if the first (0, 0.8) (0, 0.75) (0, 0.1) two bits are both ones. The naïve CTW over the hexadecimal alphabet looks for no struc- Figure 1: Hexadecimal source P and its tree model. ture within each symbol, and hence must learn 15 parameters, whereas the dominant trees for each of the 4 bits in the binarized factored 0 100 data1data1data1data1data1E[ Pˆ P ] CTW together have to learn only 10 parame- k CTW k ˆ ters, which yields improvements in terms of re- data2E[ P# P ] k k dundancy. In general, for a tree source with al- phabet and with C contexts, the asymptotic dominantA term for redundancy can reduce from error C( 1) log C 1 n P 2 i n -11 |A|− to d |A|e ,

L log log 1010 2 2 C( 1) i=1 2 2 Ci L1-error O i |A|−1 (n where Ci 2 − C is the number of contexts in 1 2 ≤ th )decay the tree of the i bit in the binary decomposi- Average tion. The scaling factor of the log2 n can poten- C( 1) C log d 2 |A|e tially reduce from |A|−2 to 2 , which 2 1010-2 can be significant when compressing sources 0 1 2 3 4 5 100 101 102 103 104 105 with large alphabets. As an illustration, again 10 10 10 n 10 10 10 n consider the i.i.d source P of Fig. 1. Let Pˆ Figure 2: An illustration of the performance of the # denote the frequency estimate of P . Let Pˆ binarized CTW algorithm. CTW denote the binarized factored CTW (CTW es- timate), which is a product of four estimators for each of the component bits; the ith bit, i = 1,..., 4, is estimated using a binary CTW estimator of depth i 1 and the appropriate context (for the com- − plete definition, see Sec. 3). Fig. 2 presents the variation of mean L1-prediction error with the number of samples n; the following observations can be made.

CTW is a better estimator than a naïve frequency (ML) estimator when the size n of data • is small, or sufficiently large (n . 30 or n & 450 in the above example). The improved prediction error for small data size can be attributed to the fact Bayesian averaging inherited by the CTW algorithm from the CTW algorithm whereas the frequency estimator does not have enough samples to predict accurately. When the data size is large, the contribution

2 of the tree model that best describes the data dominates the CTW algorithm. The CTW estimator therefore offers a scaling improvement depending of the number of parameters needed to describe the ‘best’ tree model, which is, in general, fewer than that required to naïvely describe the distribution completely. The CTW estimator, on average, offers similar asymptotic convergence rate as the frequency • 1 estimator, which in this example (and for any finite-alphabet i.i.d. source) decays as n− 2 . Note that while there is a potential benefit in terms of redundancy, the computational cost of implementing the naïve CTW algorithm over the larger alphabet vs the binarized CTW are similar. In (Tjalkens et al., 1994, 1997), the focus is primarily to identify binarized decompositions that efficiently translate the elements of the larger alphabet as a binary string and offer good redundancy guarantees, and not on the estimation error or convergence rates offered by the binarized CTW algorithm. Recently, Veness et al. explored the idea that a good data compression scheme is a good estimator, and applied it to the problem of policy evaluation and control in reinforcement learning problems (Veness et al., 2015). This work translates policy evaluation and on-policy control in reinforcement learning to the estimation of the stationary distribution of a hidden Markov model whose state corresponds to the concatenation of the state, action and return in the reinforcement learning problem. The following summarizes the main result of this approach. A consistent estimator of the stationary distribution of the state of the derived hidden Markov • model, whose estimation error converges stochastically to zero suitably fast results in a con- sistent estimator of the state-action value function. It was observed via simulations in (Veness et al., 2015) that binarizing large state spaces, and using a binarized factored CTW offers two benefits: excellent convergence rate in estimating the stationary distribution of the underlying process; • the ability to handle considerably larger state spaces than the frequency estimator in practice. • In this work, we assume that the source symbols take values from a vector binary alphabet1 (i.e., 0, 1 k for some k N), and derive some fundamental properties of the binarized CTW estimate {of the} source distribution,∈ thereby providing a theoretical reasoning to why the binarized CTW estimate performs well in simulations (Veness et al., 2015). In specific, we establish the following.

The worst-case L1-prediction error between the binarized CTW and frequency (ML) estimates • for the stationary distribution of a stationary ergodic source over 0, 1 k for some k > 1 is q  { } k log n Θ 2 n ; The binarized CTW method yields a consistent estimator of the stationary distribution of • stationary ergodic sources; and Any (Bayesian) dependency structure that exists in the binary representation of source symbols • can be naturally incorporated in the binarized CTW algorithm. The rest of the paper is organized as follows. Section 2 details the definitions and notation used in this work and Section 3 defines the binarized CTW algorithm. Section 4 presents a discussion and relevant results regarding the contribution of various context trees in the CTW estimate, and Section 5 presents the main results pertaining to binarized CTW, and finally, Section 6 concludes this work with some future directions. Most proofs are relegated to Appendices A-E.

k 1. Throughout this work, we assume that the alphabet Z of the source is mapped to {0, 1} for some k ∈ N via a meaningful binarization procedure. All of the results hold even if we partially factorize the alphabet into a finite Cartesian product of finite sets. In this setting, the appropriate non-binary CTW must be used for each component. However, for the sake of simplicity, we assume complete binarization of Z.

3 2. Notation and Definitions

k Definition 1 Given sequence z1:n where each symbol zi is an element of a vector alphabet , string k A a ≤ , and m a , we let ∈ A ≥ k k m a k m # a := i : z = bac for some b −k k and c − . (2) m |{ i ∈ A ∈ A }| k Specifically, if a , then #ka denotes the frequency of the vector symbol a appearing in the ∈ A k0 length-n sequence z1:n (which is tacit), and if a for k0 < k, then #ka denotes the sum of the frequencies of all vectors ending in a that appear∈ in A the length-n sequence.

Example 2 Let k = 3, n = 12, = 0, 1 , and A { }

z1:n = (000)(111)(011)(100)(101)(101)(000)(111)(011)(001)(001)(111). (3)

Then,

#3000 = 2 #3001 = 2 #3010 = 0 #3011 = 2 #3100 = 1 #3101 = 2 #3110 = 0 #3111 = 3 #300 = 3 #301 = 4 #310 = 0 #311 = 5 #30 = 3 #31 = 8 . (4) #200 = 4 #201 = 2 #210 = 3 #211 = 3 #20 = 7 #21 = 5 #10 = 6 #11 = 6

Since the CTW estimator uses a weighted average of all tree sources to represent its estimate, we required the following notation for binary trees.

Definition 3 For k N, let k denote the set of all context trees of depth at most k, i.e., a subset k ∈ T T 0, 1 ≤ is an element of iff the following two conditions hold: ⊆ { } Tk 1. No element of T is a proper suffix (postfix) of another element of T , i.e., if x, y T and x = cy, then x = y; and ∈ 2. Every string of length k has a suffix that lies in T , i.e., for every b 0, 1 k, there exists (a unique) a T and binary string c such that b = ca. ∈ { } ∈ k 2 We specifically denote T ∗ := 0, 1 to represent the complete tree of depth k. k { } ∈ Tk P a Remark 4 For any k N and T k, a T 2−k k = 1. ∈ ∈ T ∈

Lastly, natural logarithms and logarithms to the base 2 are denoted by log and log2, respectively. 2. The term complete in this work refers solely to the membership of all strings of length k in the tree, as opposed to the second condition which is termed as completeness in (Rissanen, 1986; Willems et al., 1993).

4 3. Binarized Context-tree Weighting Estimator

In this work, we simply assume that the random process (source) Zi i N under study is stationary, ergodic, and takes values over a vector binary alphabet 0, 1 k for{ some} ∈ k 1. For i = 1, . . . , k, th { } ≥ we let Bi,` i N to denote the i component random process. The goal is to study the conver- gence properties{ } ∈ of estimating the stationary distribution of the random process Z using the bi- narized context-tree weighting (CTW) estimator defined below. Given n samples of the source z1:n = ((b11, . . . , b1k), (b21, . . . , b2k),..., (bn1, . . . , bnk)), the aim is to estimate the stationary distri- bution. A natural and simple way to estimate the distribution is by the empirical frequency of the appearances of various symbols (binary tuples B1 B2 B3 B4 Context Context Context of length k), which is well known frequency esti- z n+1 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 mator and is denoted by Pˆ . The frequency es- z # n 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 timator assigns zero probability to (infrequent) zn 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 symbols that haven’t appeared in the data sam- . 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 . 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 ple z1:n. This is remedied in the Krichevsky- ...... Trofimov (KT) or Dirichlet or the Add-half Es- History History History History timator that models the source as an i.i.d multi- Figure 3: An illustration of the binarized CTW. nomial/categorical distribution with a Dirich- let prior, and assigns to each source sequence a probability using the prior predictive distribution. In the binary case, the KT-estimate for a sequence with #10 zeros and #11 ones equals

1 1 Γ(#10 + )Γ(#11 + ) P (# 0, # 1) := 2 2 , (5) e 1 1 πΓ(n + 1)

R ∞ t 1 x where the Γ(t) := 0 x − e− dx is the well known Gamma function. The binarized CTW (CTW) estimator, illustrated in Fig. 3, estimates each of the k bits sequen- tially by using k binary CTW estimators. The CTW estimator uses the KT-estimator (equivalently, the binary CTW estimator of memory/order 1) to estimate the distribution of the first bit as

1 P (# 0 + β = 0 , # 1 + β = 1 ) #1β + Pˆ(1) (B = β; z ) := e 1 1 = 2 , β 0, 1 , (6) CTW 1 1:n J K J K Pe(#10, #11) n + 1 ∈ { } where #1β, by Definition 1, is the number of symbols in z1:n with the prefix β 0, 1 , and is the Iverson bracket which returns one if the logical proposition (within the brackets)∈ { } is true,J·K and zero otherwise. Note that the estimate Pˆ(1) depends only on the bits (b , b , . . . , b ). For CTW 11 21 n1 ` > 1, the (conditional) distribution of `th bit is estimated sequentially using the binary CTW estimator of memory/order ` using the first ` 1 bits as its context. Specifically, for any context ` 1 − th β1:` 1 0, 1 − , the (conditional) distribution of the ` bit is given by the following weighted average− ∈ of { appropriate} KT-estimates

 1  (`) X (`) #`ξT (β1:` 1)β + Pˆ (B = β B = β ; z ) = Λ − 2 , β 0, 1 , (7) CTW ` 1:` 1 1:` 1 1:n T | − − #` 1ξT (β1:` 1) + 1 ∈ { } T `−1 − − ∈T

(`) where ξT (β1:` 1) is the unique element of T that is also a suffix of β1:` 1, and ΛT is the normalized weight corresponding− to tree T assigned by the binary CTW estimator.− For each context tree (`) T ` 1, this normalized weight Λ is given by ∈ T − T ω P (`)(T, z ) Λ(`) := T n 1:n , (8) T P (`) 0 ω 0 P (T , z ) T `− T n 0 1:n ∈T 1

5 where for any T `, ωT is non-negative part of the weight that the CTW algorithm associates with ∈ T (`) a tree T that is independent of the sequence z1:n (Willems et al., 1995), and Pn (T, z1:n), which we term the sequence-dependent part of the weight, is given by

(`) Y Pn (T, z1:n) := Pe(#`a0, #`a1), (9) a T ∈ where Pe( , ) is as defined in (5). A closer look at (6)-(9) reveals that just as in the case of the first bit, the CTW· · estimate for the conditional distribution of the `th bit depends only on the context and the partial history (bi1, . . . , bi,`), i = 1, . . . , n. Lastly, the k (conditional) distributions are then pieced together to obtain the CTW estimate for the stationary distribution of the source using

k ˆ Y ˆ(`) k PCTW(β1:k; z1:n) := P (B` = β` B1:` 1 = β1:` 1 ; z1:n), β1:k 0, 1 . (10) CTW | − − ∈ { } `=1 Two things must be remarked at this juncture. First, the CTW estimator is a biased estimator of the distribution of the source. The bias is inherited directly from the the KT estimator, and can be established by simply considering a binary i.i.d. Bernoulli source with parameter θ = 1 , for which 6 2 # 1 + 1 nθ + 1 ˆ E 1 2 2 E P (Z = 1; z1:n) = = = θ. (11) CTW n + 1 n + 1 6 Second, the binarized CTW estimator defined above learns the stationary distribution of the source process, and does not consider the statistical correlation between symbols of the source. A tree source taking values in 0, 1 k that has C contexts, the longest of which is m symbols can be fully modeled using the binarized{ } CTW estimator with km factors in (10).

4. Comparing Contributions of Trees in CTW Since the binarized CTW algorithm estimates the distribution of all but the first bit as a weighted average (i.e., mixture) of KT estimates of corresponding different tree sources, the behavior of the CTW estimate will be shaped by the contri- 00 ` =4 000 000 00 butions of the dominant context trees, which 100 a = 0 100 0 need to be identified. One way to identify such 10 10 trees is to develop tools to compare the ratio of 010 010 e weights of any two trees. To do just that, we 110 110 e 01 01 proceed in three stages. In each stage, we use 001 001 the bound developed in the earlier stage for a 101 11 101 limited class of pairs of trees to generalize to a 011 11 1 011 1 bound for a larger class of pairs of trees; at the 111 111 end of the third stage, we will have the required

T = 0, 01, 011, 111 T 0 = 00, 10, 01, 011, 111 result to compare the weight of any tree with { } { } that of the complete tree. Figure 4: An example of trees compared in (12). Consider the CTW estimate of the `th bit for some ` 2, . . . , k . Let T ` 1 be any context tree, and let a T with a < ` 1. Let context ∈ { } ∈ T − ∈ k k − tree T 0 be obtained from T by replacing the node a by its two children nodes 1a and 0a as illustrated in Fig. 3, i.e., T 0 = (T a ) 0a, 1a . In the first stage, we compare the sequence-dependent part \{ } ∪ { } of CTW weights Pn(T, z1:n) and Pn(T 0, z1:n) corresponding to two trees T and T 0. Using (9), we see that the ratio of these two quantities is

P (`)(T, z ) P (# a0, # a1) n 1:n = e ` ` . (12) (`) P (# 0a0, # 0a1)P (# 1a0, # 1a1) Pn (T 0, z1:n) e ` ` e ` `

6 Since #`a0 = #`0a0 + #`1a0 and #`a1 = #`0a1 + #`1a1, it is only natural to study the following quantity so as to compare the ratio of the weights of two such trees. Let for a, b, c, d N 0 , ∈ ∪ { } Pe(a + c, b + d) ηa,b,c,d := . (13) Pe(a, b)Pe(c, d)

Using the Stirling’s approximation (Bollobas, 1998), one can relate ηa,b,c,d to the binary entropy function h(x) := x log x (1 x) log (1 x) to show that if a + b > 0 and c + d > 0, − 2 − − 2 − s 2π(a + b)(c + d) (a+b+c+d)h a+c +(a+b)h a +(c+d)h c η = λ e− ( a+b+c+d ) ( a+b ) ( c+d ), (14) a,b,c,d abcd (a + b + c + d)

1 1 for some e− 2 λabcd e 2 , and ηa,b,c,d = 1 otherwise. A derivation of (14) is given in Lemma 14 in ≤ ≤ a c Appendix A. Observe that when b and d are quite different, the ratio can be exponentially small; thus, the sequence-dependent weight for T 0 can by larger than that of T (by an exponential factor). a In other words, the context tree T 0 better explains the data z1:n than does tree T . Similarly, when b c and d are close, the exponential term does not play a significant role, and the sequence-dependent weight for T is larger than that of T 0 by a polynomial factor, and hence T is a better model for the data z1:n than T 0. Now, for the second stage, we can, by re- 000 00 ` =4 000 00 peated application of the above result, com- 100 a = 0 100 0 10 pare the sequence-dependent part of the weights 010 10 010 e of two trees related as in Fig. 3. To illus- 110 e 110 01 01 trate this idea, suppose that we are inter- 001 001 ested in CTW estimate for the 4th bit, and 101 101 11 11 suppose that we would like to compare the 011 011 1 1 weights of trees T := 0, 01, 011, 111 and T 00 := 111 111 { } T = 0, 01, 011, 111 T 00 = 000, 010, 100, 110, 01, 011, 111 000, 010, 100, 110, 01, 011, 111 given in Fig. 4. { } { } {We can compare the sequence-dependent} part of Figure 5: Illustration of a pair of trees compared weights of T and T by bounding three ratios of in Lemma 5. 00 the sequence-dependent part of weights: (a) the ratio corresponding to trees T and T , where T := 00, 10, 01, 011, 111 ; (b) the ratio corresponding a a { } to trees Ta and Tb, where Tb := 000, 100, 10, 01, 011, 111 , and (c) the ratio corresponding to trees { th} Tb and T 00. More generally, in the CTW estimate of the ` bit for some ` = 2, . . . , k, let T,T 00 ` 1 be two trees such that the former contains, say, a node a, and the latter contains instead of∈a T,− all nodes at depth ` 1 which have a as its suffix, e.g., in the illustration in Fig. 4, a = 0, and T 00 contains 00a, 01a−, 10a, and 11a instead of a. The ratio of the sequence-dependent part of weights of trees T and T 00 is then Q (`) Pe(#`b0, #`b1) Pn (T, z1:n) b T Pe(#`a0, #`a1) a := = ∈ = (15) (`) Q P (# b0, # b1) Q P (# ba0, # ba1) Pn (T 00, z1:n) e ` ` e ` ` b T 00 b 0,1 m ∈ ∈{ }

Using (14), we can relate a to the number of occurrences of various substrings in a form that is amenable to further analysis.

k <` Lemma 5 Let k ` 2 and z1:n be a sequence over 0, 1 and a 0, 1 such that #` 1a > 0. ≥ ≥ { } ∈ { } − Let a be as in (15) and m := ` 1 a . Then, − − k k m 2 −1 1 Q p (2πe) 2 #` 1ba √#`−1a m − b 0,1 :#`−1ba>0 a ( ∈{ } ). (16) ≤  2 P #`ba0 #`a0 exp 4 #` 1ba # ba # a m − `−1 `−1 b 0,1 :#`− ba>0 · − ∈{ } 1

7 Proof The proof proceeds by expressing a as a product of η-terms, and then using the concavity of the binary entropy function to obtain the quadratic term in the exponent in the denominator. For a complete proof, see Appendix B. Finally, in the last and third stage, we can repeatedly use Lemma 5 to compare the sequence- ` 1 dependent part of weights of any tree T ` 1 with that of the complete tree T`∗ 1 = 0, 1 − . ∈ T − − { }

k Lemma 6 Let k ` 2 and z1:n be a sequence over 0, 1 and T ` 1. Let ma := ` 1 a ma − ma <` ≥ ≥ { } 2∈ T−1 2 −1− − k k for a 0, 1 . Then, for some λa depending on z such that e− 2 λa e 2 , ∈ { } 1:n ≤ ≤ ! 2`−1−|T | Q λa Q p (2π) 2 #` 1c (`) √#`−1a `− − a T :#`−1a>0 c 0 1 1 c Pn (T, z1:n) ∈ , :#`−1 >0 ( ∈{ } !). (17) P (`)(T , z ) ≤  2 n `∗ 1 1:n P P #`ba0 #`a0 − exp 4 #` 1ba #`−1ba #`−1a ma − a T b 0,1 :#`− ba>0 · − ∈ ∈{ } 1 Proof See Appendix C As in the case of Lemma 5, Lemma 6 presents an analytically tractable bound for the ratio of source-dependent part of the weights of a tree with that of a complete tree that involves explicitly the number of occurrences of various substrings. This last result, and especially the exponent in the denominator of (17), will be key in the derivation of an upper bound in the following section.

5. New Results Having proven the required machinery, we are now equipped to state our results pertaining to the CTW estimator. We first present an upper bound for the worst-case L1-error between the CTW and frequency estimates, followed by a lower bound established by a suitable choice for z1:n.

Theorem 7 (Upper Bound) Let = 0, 1 k for some k N. Then, Z { } ∈  

X #kc ˆ (18) ∆k := max n  PCTW(Z = c ; z1:n)  z1:n | − n ∈Z c 0,1 k ∈{ }  1  2n k = 1 r   k−2 2πe5n (19) k−1 2 log . ≤ 2 2k−1  ∆k 1 + + k > 1 − 2n n

Proof The proof proceeds by induction, and uses (6)-(10) to relate the worst-case L1-error ∆k k k 1 for sequences over 0, 1 to worst-case L1-error ∆k 1 for sequences over 0, 1 − and the square root of exponent in{ the} denominator of (17). The proof− is then complete{ by using} (17) and some standard information-theoretic bounds. A complete proof can be found in Appendix D. Since Theorem 7 quantifies a worst-case bound (i..e, holds for all sequences), and since the frequency estimator is a consistent estimator of the stationary distribution of any finite-alphabet stationary ergodic source, the following inferences ensue.

Corollary 8 The worst-case L1-error between the binarized CTW and frequency estimates of the stationary distribution for a stationary ergodic source over 0, 1 k is { } k 1 r k ! 2 − 1 log n √2 √2 ∆k − + − (1 + o(1)). (20) ≤ 2n n √2 1 −

8 Corollary 9 The binarized context-tree weighting estimator CTW is a consistent estimator of the stationary distribution of any finite-alphabet stationary ergodic source.

At this point, one might suspect the tightness of the worst-case bound, especially since for i.i.d. and uniformly-ergodic Markov sources, by use of Hoeffding’s inequality (Glynn and Ormoneit, 2002), one can show that the expected L1-error between binarized CTW (or the frequency) estimator and 1 the actual stationary distribution of the source exhibit n− 2 decay. Hence, for these sources,

#kc 1 ˆ 2 E P (Z = c ; z1:n) = Θ(n− ). (21) CTW | − n

However, the following result identifies sequences z1:n for which the CTW L1-error between the CTW and frequency estimates is of the same order as the upper bound of Theorem 7.

Theorem 10 (Lower Bound) Let = 0, 1 k for some k 2. Then, for each  > 0, there exists n0 Z { } ≥ n0 N and z1:n0 such that ∈ ∈ Z r X #kc k 2 log n0 PCTW(Z = c ; z1:n0 ) 2 − (1 ) . (22) | − n0 ≥ − n0 c 0,1 k ∈{ } k k Proof Let z1:2kn be a sequence of length 2 n over 0, 1 with n σ√n log n occurrences of each b 0, 1 k of even weight, and n + σ√n log n occurrences{ } of each−b b 0, 1 k cof odd weight, where ∈ {1 } b c ∈ { } σ = 2 √1 α and α (0, 1). The proof establishes that (22) holds for this choice z1:2kn. Details can be found− in Appendix∈ E. By a careful investigation of the proof of the above theorem, the following inference can be made. If instead of the choice of z k in the proof of Theorem 10, z k is chosen such that the number 1:2 n  1:2 n of occurrence of each symbol was n + o √n log n , the resulting L1-error can be shown to decay as p 1   o n− log n . On the other hand, if the number of occurrence of each symbol is n + ω √n log n , e ceases to be the dominant tree, and consequently, the L1-error can be shown to decay as { } 1 O(n− 2 ). The empirical frequency of the symbols in z1:2kn is chosen so that the order of decay of the L1-predictive error between the CTW and frequency estimates of the lower bound in Theorem 10 matches that of the upper bound in Theorem 7, thereby yielding the following result.

Corollary 11 Let = 0, 1 k for k 2. Then, Z { } ≥    r 1 n X #kc 1 lim inf max Pˆ (Z = c ; z ) . (23)  n  k CTW 1:n  2 ≤ n z1:n 2 log n | − n ≤ √2 1 →∞ ∈Z c 0,1 k ∈{ } − It must be remarked that the proof of Theorem 10 identifies sequences whose lengths are multiples of the alphabet size. However, using an appropriate ‘continuity argument’ through appropriate choices for lengths that are not multiples of the alphabet size, it can be shown that the above corollary holds even when lim inf is replaced by a lim operator.

5.1 Extensions of CTW. The formulation of the binarized CTW and the bounds established in Theorems 7 and 10 assume no prior (knowledge of any) dependency structure between the bits in the binary representation (B1,...,Bk) of each random variable of the process under study. In other words, the CTW derives a Bayesian estimate of the stationary distribution using the most general decomposition of the joint probability of B1,...,Bk, i.e., k random variables representing the bits are modeled by the Bayesian network given in Fig. 6.

9 One of the main motivations for analyzing the k edges 1 1 convergence of binarized CTW algorithm is for

edges edges ... 3 k ... k policy evaluation in reinforcement learning prob- lems. Extensive simulations revealed the effective- B1 B2 B3 ... Bk ... ness of a binarized factored CTW-based approach k 2 edges ... to learn an appropriate conditional distribution and consequently the action-state value function in Figure 6: The Bayesian Network underlying a variety of arcade games such as PONG, FREEWAY, CTW decomposition in (10). or Q*BERT. In such applications, the random pro- cess under study corresponds to the state of the game at discrete intervals of time, and the state at each instant of time is an image of pixels (e.g., in the Arcade Learning Environment (Bellemare et al., 2013) used for simulations in (Veness et al., . 2015), the state comprises of 160 210 seven-bit .. pixels) that is naturally represented× by a vector or matrix of binary random variables as depicted in Fig. 7. This binary representation allows for com- ponent bits of the state to have a natural corre- lation/dependency structure intrinsic to the game Figure 7: Bayesian network model of an image. such as the one depicted in Fig. 7. Via simulations, it was shown that incorporating such dependencies offers excellent estimation of state distribution and value function for policy evaluation. In general, any known dependency structure between the bits that is representable by a Bayesian network can be incorporated into the binarized CTW algorithm by modifying (10) as follows3. B

k ˆ n Y ˆ(`) n k P B (β1:k; b`,1:k `=1) := P (B` = β` Bpa(`) = βpa(`) ; bi,pa(`) ` i=1), β1:k 0, 1 , (24) CTW { } CTW | { ∪{ }} ∈ { } `=1 where for an index i, pa(i) denotes the set of B B3 2 indices of all parent nodes of Bi in . The main B ˆ difference between the binarized CTW PCTW defined in Section 3 and ˆ is that for , PCTWB ` > 1 binarized CTW defined in Section 3 learns a B1 B4 model for B` as a Bayesian mixture of all tree Figure 8: A Bayesian Network example. sources with context (B1,...,B` 1) and depth ` 1, whereas (24) incorporates− the dependecy − B1 B2 B3 B4 B1 B2 B1 B3 B1 B2 B4 specified by to only learn the conditional dis- z B n+1 0 1 1 0 0 1 1 0 0 1 0 0 1 0 tribution of Bi given Bpa(`) using the history zn 1 0 0 1 1 0 0 1 1 0 1 1 0 1 n n b`,pa(i) i `=1, and context b`,pa(i) `=1, ig- zn 1 0 0 1 1 0 0 1 1 0 1 1 0 0 1 { ∪{ }} { } noring the realizations of other bits. To achieve . 1 1 1 1 1 1 1 1 1 1 1 1 1 1 . this, the component estimator for the ith bit 1 0 0 1 1 0 0 1 1 0 1 1 0 1 . . . . computes a Bayesian mixture over all context . . . . History History History History trees of depth pa(i) considering only the con- | | text Bpa(i) and ignoring the non-parent bits. Figure 9: An example to illustrate an extension of As an illustration, consider the setting where binarized CTW. k = 4, and the dependency between the four bits of Z are given by the Bayesian network in Fig. 8. Suppose also that the realization of the process is as in Fig. 9. The distribution of B1 is gleaned simply from the realizations in the first column

3. The random variables B1,...,Bk are assumed to be ordered in such a way that the parent set of each Bi is a subset of {B1,...,Bi−1} for any i > 1.

10 corresponding to the process B n . The conditional distribution of B given B is gleaned { `,1}`=1 2 1 from B n using a Bayesian mixture of two tree sources. While in Pˆ , the conditional { `,1:2}`=1 CTW distribution of B3 given (B1,B2) is given by a Bayesian mixture of five tree sources of depth at most 2, ˆ incorporates the dependency given in Fig. 8 by ignoring ; it learns the distribution of PCTWB B2 B given B as a Bayesian mixture over two tree sources. If the realization z = b n is 3 1 1:n { `,1:4}`=1 typical, the conditional independence of B2 and B3 given B1 implied by the dependency structure will be reflected in the empirical frequencies, and in the estimates ˆ and ˆ ; however, the PCTW PCTWB dominant tree in ˆ will be complete tree of depth 2, whereas in ˆ , the dominant tree will PCTW PCTWB be complete tree of depth 1. Thus, in general, the binarized CTW variant ˆ , being a Bayesian PCTWB mixture over simpler models that incorporates the natural dependency structure in (B1,...,Bk), ˆ will yield some computational benefits over PCTW. Similar to Theorems 7 and 10, we can establish the following result quantifying the -prediction error between ˆ and the maximum-likelihood L1 PCTWB estimate PˆML, defined by B

PˆML, ( ; z1:n) := argmax P (z1:n), (25) B · P ( ) ∈P B where ( ) is the set of all probability distributions that satisfy the Markov/dependency conditions P B given by the Bayesian network . Note that PˆML, ( ; z1:n) is simply the frequency estimate when imposes no Markov conditions amongB B ,...,B ,B i.e.,· corresponds to the network in Fig. 6. B 1 k B Theorem 12 Given Bayesian network consisting of k binary random variables, and = 0, 1 k, B Z { }   r ! X log n max PˆB (c ; z ) PˆML (c ; z ) = Θ . (26) n  CTW 1:n , 1:n  z1:n | − B n ∈Z c 0,1 k ∈{ } The proof traces those of Theorems 7 and 10, and is omitted. The upper bound follows almost identically from that of Theorem 7 by appropriately generalizing the counting operator #` to a counting operator for subsets of 1, . . . , k that returns the number occurrences of a string corre- sponding to locations specified by{ the subset.} The lower bound follows by replacing (82) in the proof of Theorem 10 by

(  P  an if bk + i pa(k) bi is even Mk(b) = ∈ . (27) bn otherwise

6. Closing Remarks This paper proves that the binarized CTW estimator is a consistent estimator of the stationary distribution of a stationary ergodic source by deriving order-tight bounds on the worst-case L1- prediction error between the binarized CTW estimator and the frequency estimator. The next steps would be to investigate the average rate of convergence and the concentration of the L1-prediction error between the binarized CTW estimator and the frequency estimator for various classes of stationary, ergodic sources. Extension of binarized CTW to incorporate dependencies specified by Markov random fields will also be a valuable direction to explore.

Acknowledgments

This work was supported by the Australian Research Council Discovery Projects DP120100950 and DP150104590.

11 References N. Abramson. and coding. McGraw-Hill, New York, 1963. T. M Apostol. Mathematical analysis; 2nd ed. Addison-Wesley Series in Mathematics. Addison- Wesley, 1974.

R. Begleiter and R. El-Yaniv. Superior guarantees for sequential prediction and via alphabet decomposition. J. Mach. Learn. Res., 7:379–411, December 2006. M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, jun 2013. Bela Bollobas. Modern graph theory. Springer, 1998. P. W. Glynn and D. Ormoneit. Hoeffding’s inequality for uniformly ergodic markov chains. Statistics & Probability Letters, 56(2):143–146, 2002.

R. Krichevsky and V. Trofimov. The performance of universal encoding. IEEE Transactions on Information Theory, 27(2):199–207, Mar 1981. J. Rissanen. Universal coding, information, prediction, and estimation. IEEE Transactions on Information Theory, 30(4):629–636, Jul 1984. J. Rissanen. Complexity of strings in the class of Markov sources. IEEE Transactions on Information Theory, 32(4):526–532, Jul 1986. T. J. Tjalkens, Y. M. Shtarkov, and F. Willems. Context tree weighting: Multi-alphabet sources. Proc. 14th Symposium on Information Theory, Benelux, pages 128–135, 1993a. T. J. Tjalkens, F. Willems, and Y. M. Shtarkov. Sequential weighting algorithms for multi-alphabet sources. Proc. 6th Swedish-Russian Workshop on Information Theory, Mölle, Sweden, pages 22– 27, Aug 1993b. T. J. Tjalkens, F. Willems, and Y. M. Shtarkov. Multi-alphabet universal coding using a binary decomposition context tree weighting algorithm. Proc. 15th Symposium on Information Theory, Benelux, pages 259–265, 1994.

T. J. Tjalkens, P. A. J. Volf, and F. Willems. A context-tree weighting method for text generating sources. Proc. Data Compression Conference (DCC ’97), pages 472–472, March 1997. J. Veness, M. G. Bellemare, M. Hutter, A. Chua, and G. Desjardins. Compress and control. Proc. Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI’15), pages 3016–3023, 2015.

F. Willems, Y. M. Shtarkov, and T. J. Tjalkens. Context tree weighting: a sequential universal source coding procedure for FSMX sources. Proc. IEEE International Symposium on Information Theory, pages 59–59, Jan 1993. F. Willems, Y. M. Shtarkov, and T. J. Tjalkens. The context-tree weighting method: basic properties. IEEE Transactions on Information Theory, 41(3):653–664, May 1995.

J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337–343, May 1977. J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. IEEE Trans- actions on Information Theory, 24(5):530–536, Sept 1978.

12 Appendix A. Relevant Technical Lemmas

Lemma 13 For a, b, c, d N 0 , ∈ ∪ { } (2a + 2c)!(2b + 2d)!(a + b)!(c + d)!a!b!c!d! η = (28) a,b,c,d (2a)!(2b)!(2c)!(2d)!(a + b + c + d)!(a + c)!(b + d)!

1   (2`)! 2 1 1 3 1 Proof The claim follows by the use of π− Γ ` + = ` = 2` for ` N 0 . 2 2 · 2 ··· − 2 2 `! ∈ ∪ { }

Using the above result and Stirling’s approximation (Bollobas, 1998)

x+ 1 x 1 x+ 1 x √2πx 2 e− xΓ(x) e 12x √2πx 2 e− , x > 0, (29) ≤ ≤ we can then relate ηabcd with the binary entropy function h(x) := x log x (1 x) log(1 x) in the following manner. − − − −

1 1 Lemma 14 Let a, b, c, d N 0 . Then, for some e− 2 λabcd e 2 ∈ ∪ { } ≤ ≤ ( q a+c a c 2π(a+b)(c+d) (a+b+c+d)h( a+b+c+d )+(a+b)h( a+b )+(c+d)h( c+d ) a+b>0 λabcd e− , c+d>0 ηa,b,c,d = (a+b+c+d) . (30) 1, otherwise Proof We proceed in a case-by-case fashion. If a, b, c, d > 0, then we can directly apply Stirling’s approximation to establish the claim. • If a + b = 0 or c + d = 0, then it is straightforward to verify that ηabcd = 1 from the definition • of P ( , ). Note that this case subsumes the case when at least three of a, b, c and d equal zero. · · If exactly one of a or b or c or d is zero, due to the symmetry in Pe( , ), we can, without loss • of generality, assume that a = 0. In this case, we can see from (28) that· · (2b + 2d)!(c + d)b!2d! η = (31) a,b,c,d (2b)!(2d)!(b + c + d)!(b + d)! r (29) πb(c + d) (b+c+d)h c +(c+d)h c = λ e− ( b+c+d ) ( c+d ), (32) abcd b + c + d

1 1 1 1 1 + 1 + 1 + 1 where e− 24b − 24d − 12(b+c+d) − 12(b+d) λ e 24(b+d) 12(c+d) 6b 12d . ≤ abcd ≤ Lastly, if exactly two a or b or c or d is zero, and neither a + b nor c + d is zero, due to the • symmetry in Pe( , ), we only need to consider two cases: (a) a = c = 0, b, d > 0, or (b) a = d = 0, b, c > 0· .· When Case (a) holds, r (2b + 2d)!b!2d!2 (29) πbd η = = λ , (33) a,b,c,d (2b)!(2d)!(b + d)!2 abcd b + d

1 1 1 1 + 1 + 1 where e− 24b − 24d − 6(b+d) λ e 24(b+d) 6b 6d , and when Case (b) holds ≤ abcd ≤ r b!c! (29) 2πbc (b+c)h c η = = λ e− ( b+c ), (34) a,b,c,d (b + c)! abcd b + c

1 1 1 + 1 where e− 12(b+d) − 12(b+d) λ e 12b 12d . ≤ abcd ≤

Note that by the concavity of the binary entropy function, the exponent is always negative. Thus, the following upper bound trivially follows.

13 q πe(a+b+c+d) Remark 15 For a, b, c, d N 0 with a + b + c + d > 0, ηa,b,c,d . ∈ ∪ { } ≤ 2 Lemma 16 Fix σ (0, 1 ). Let a = d = n σ√n log n , and b = c = 2n a . Then, ∈ 2 n n − b c n n − n η rπ lim anbncndn = . (35) n 1 2σ2 →∞ n 2 − 2 Proof Since Γ(` + 1) = `Γ(`) = `! for ` N, one can use (28) and (29) to obtain ∈ rπn n4n η = λ , (36) anbncndn n 2an 2bn 2 an bn

−7 1 where λ (e 48n , e 6n ). Hence, n ∈ P (a + c , b + d ) lim e n n n n (37) n 1 2σ2 →∞ n 2 − Pe(an, bn) Pe(cn, dn) 2 · rπ n2σ n4n = lim (38) n 2an 2bn 2 →∞ an bn 2 rπ n2σ 1 = lim (39) n 2n 2 σ√n log n 2n+2 σ√n log n 2  σ√n log n  − b c  σ√n log n  b c →∞ 1 b c 1 + b c − n n 2 2 rπ n 2σ n4σ = lim − (40) n 2n 2 σ√n log n 4 σ√nn log n 2  σ√n log n 2  − b c  σ√n log n  b c →∞ 1 b c 1 + b c − n2 n Note that

2  √ 2  2σ 2 bσ n log nc n 2σ log n (2n 2 σ√n log n ) log 1 2 lim − = lim e− − − b c − n n 2n 2 σ√n log n n  σ√n log n 2  − b c →∞ 1 b c →∞ − n2  √ 2  2  2σ2 log n (2n 2 σ√n log n ) bσ n log nc +O (log n) = lim e− − − b c − n2 n2 = lim eo(1) = 1. (41) n n →∞ →∞ Similarly,

4σ2  √  n 4σ2 log n 4 σ√n log n log 1+ bσ n log nc lim = lim e − b c n n  4 σ√n log n n →∞ σ√n log n b c →∞ 1 + b n c     p σ√n log n log n = lim exp 4σ2 log n 4 σ n log n b c + O (42) n →∞ − b c n n     p σ√n log n 1 log n lim exp 4σ2 log n 4(σ n log n 1) − + O (43) n ≤ →∞ − − n n ( r ) log n 4 log n = lim exp 8σ + O = 1. (44) n →∞ n − n n

Using (41) and (44) in (40) completes the proof.

14 Appendix B. Proof of Lemma 5

If a = ` 1, then m = 0 and a = 1, and the claim is trivially true. If a < ` 1, then m 1 k k − k k − ≥ and we can rewrite a as the following telescopic product.  Q  m Pe(#`ca0, #`ca1) Y c 0,1 i−1 a =  ∈{ }  , (45)  Q P (# ba0, # ba1)  i=1 e ` ` b 0,1 i ∈{ } 0 i 1 where we let 0, 1 := e . Note that the term within the brackets has a product of 2 − probabil- ities in the numerator,{ } and{ } 2i probabilities in the denominator, which can be appropriately grouped i 1 into 2 − appropriate η terms as follows.

m m Y Y Pe(#`ca0, #`ca1) Y Y (46) a = Q = η#`0ca0,#`0ca1,#`1ca0,#`1ca1. Pe(#`bca0, #`bca1) i=1 c 0,1 i−1 i=1 c 0,1 i−1 ∈{ } b 0,1 ∈{ } ∈{ } Using the result in Lemma 14, we obtain the following

 #`a0  #k−1a h # a p 2m−1 e− · `−1 Y #` 1ba a = (2π) 2 λa − , (47) p  #`ba0  #`−1ba h #` 1a m #`− ba b 0,1 :#`−1ba>0 e− · 1 − ∈{ }

2m−1 2m−1 where e− 2 λa e 2 . Now, by the mean value theorem (Apostol, 1974), we can expand the ≤ ≤ #`a0 binary entropy function h( ) around as follows. For each x, there exists a βx such that · #k−1a        2 #`a0 #`a0 #`a0 #`a0 h(x) = h + h0 x + h00 (βx) x , #` 1a #` 1a − #` 1a − #` 1a − − − − which when employed in (47) yields

2m−1 Q p (2π) 2 λa #` 1ba m − b 0,1 :#`−1ba>0 a = ( ∈{ } ). (48)  2 p P #`ba0 #`a0 #` 1a exp #` 1ba h00 (βb) # ba # a − m − `−1 `−1 − b 0,1 :#`− ba>0 · − ∈{ } 1 Note that in the above manipulation, the first order term is absent because     X #`ba0 #`a0 X #`a0 X #` 1ba = #`ba0 #` 1ba (49) m − · #` 1ba − #` 1a m − #` 1a m − b 0,1 − − b 0,1 − b 0,1 ∈{ } ∈{ } ∈{ } #`−1ba>0 #`−1ba>0 #`−1ba>0   #`a0 = #`a0 #` 1a = 0. (50) − #` 1a − − 1 Lastly, the claim follows from (48) by using the fact that h00(x) = x(1 x) 4 for any x [0, 1]. − − ≥ ∈ Appendix C. Proof of Lemma 6 The proof follows from the following arguments. Y Pn(T, z1:n) = Pe(#`a0, #`a1) (51) a T ∈

15 Y  Y  = a Pe(#`ba0, #`ba1) (52) a T b 0,1 `−1−kak ∈ ∈{ }  Y  Y  Y  = a Pe(#`ba0, #`ba1) (53) a T a T b 0,1 `−1−kak ∈ ∈ ∈{ }  Y  Y = a Pe(#`c0, #`c1) (54) a T c 0,1 `−1 ∈ ∈{ } Y = Pn(T`∗ 1, z1:n) a. (55) − a T ∈ The result then follows from the fact that Y Y p Y p #` 1ba = #` 1c , (56) − − a T b 0,1 ma c 0,1 `−1 ∈ ∈{ } ∈{ } m m `−1 Y 2 a −1 P 2 a −1 P 2`−1−kak |T | 2 −|T | (2π) 2 = (2π) a∈T 2 = (2π) a∈T 2 − 2 = (2π) 2 , (57) a T ∈ where the last step follows from the tightness of Kraft’s inequality for any T (see Remark 4). ∈ T`

Appendix D. Proof of Theorem 7 We proceed by induction. Suppose that k = 1. Then, for any z n and c , 1:n ∈ Z ∈ Z # c + 1 n # c ˆ #kc k 2 #kc 2 k 1 max PCTW(Z = c ; z1:n) = max = max − = . (58) z1:n | − n z1:n n + 1 − n z1:n n(n + 1) 2(n + 1) Now, suppose that the claim is true for alphabets 0, 1 ` for 1 ` < k. Now consider the CTW k { } ≤ k k 1 k estimator for a random process over 0, 1 . Let functions φ : 0, 1 0, 1 − , and ψ : 0, 1 { } { } → { } { } → 0, 1 be defined by φ(c1, . . . , ck) = (c1, . . . , ck 1) and ψ(c1, . . . , ck) = ck, respectively. By an abuse of{ notation,} let us allow φ and ψ to extend− to sequences over 0, 1 k to output sequences over k 1 { } 0, 1 − and 0, 1 , respectively, by operating over blockwise over chunks of k bits at a time. { } { } Pˆ (Z = c; z ) = Pˆ (φ(Z) = φ(c) ; φ(z ))Pˆ(k) (ψ(Z) = ψ(c) φ(c); z ). (59) CTW | 1:n CTW | 1:n CTW | 1:n k 1 Since by assumption, the claim holds for all sequences of length n over 0, 1 − , we have { }

X #kc ∆ = max Pˆ (Z = c ; z ) (60) k n CTW 1:n z1:n | − n ∈Z c 0,1 k ∈{ }  X #k 1φ(c) (k) #kc  max − Pˆ (ψ(Z) = ψ(c) φ(c); z1:n) z1:n CTW  k n | − n   c 0,1   ∈{ }   | {z }  (61) ≤ =:δk(z1:n)    P #k−1b  + max Pˆ (φ(Z) = b ; φ(z1:n)) z CTW n 1:n b 0,1 k−1 | − ∈{ } = ∆k 1 + max δk(z1:n), (62) − z1:n where in (61), we have used the triangle inequality in conjunction with the fact that the estimate Pˆ (ψ(Z) = ψ(c) φ(c); z ) 1. We then proceed as follows. CTW | 1:n ≤

X #k 1φ(c) (k) #kc δk(z1:n) = − Pˆ (ψ(Z) = ψ(c) φ(c); z1:n) (63) n CTW | − n c 0,1 k ∈{ } #kc>0

16

1 X #k 1φ(c) X (k) #kξT (φ(c))ψ(c) + 2 #kc = − ΛT (64) k n #k 1ξT (φ(c)) + 1 − n c 0,1 T k−1 − ∈{ } ∈T #kc>0

 1  X #k 1φ(c) X (k) #kξT (φ(c))ψ(c) + 2 #kξT (φ(c))ψ(c) − ΛT ≤ k n #k 1ξT (φ(c)) + 1 − #k 1ξT (φ(c)) c 0,1 T k−1 − − ∈{ } ∈T #kc>0

X #k 1φ(c) X (k) #kξT (φ(c))ψ(c) #kc + − ΛT (65) k n #k 1ξT (φ(c)) − n c 0,1 T k−1 − ∈{ } ∈T #kc>0

#k−1ξT (φ(c)) X #k 1φ(c) X (k) 2 #kξT (φ(c))ψ(c) = − ΛT − k n #k 1ξT (φ(c))(#k 1ξT (φ(c)) + 1) c 0,1 T k−1 − − ∈{ } ∈T #kc>0

X #k 1φ(c) X (k) #kξT (φ(c))ψ(c) #kc + − ΛT (66) k n #k 1ξT (φ(c)) − #k 1φ(c) c 0,1 T k−1 − − ∈{ } ∈T #kc>0

X #k 1φ(c) X (k) #kξT (φ(c))ψ(c) #kξT (φ(c))ψ(c) = − ΛT − k n 2#k 1ξT (φ(c))(#k 1ξT (φ(c)) + 1) c 0,1 T k−1 − − ∈{ } ∈T #kc>0

X #k 1φ(c) X (k) #kξT (φ(c))ψ(c) #kc + − ΛT (67) k n #k 1ξT (φ(c)) − #k 1φ(c) c 0,1 T k−1 − − ∈{ } ∈T #kc>0

X #k 1φ(c) X (k) 1 − ΛT ≤ k n 2(#k 1ξT (φ(c)) + 1) c 0,1 T k−1 − ∈{ } ∈T #kc>0

X X (k) #k 1φ(c) #kξT (φ(c))ψ(c) #kc + ΛT − (68) k n #k 1ξT (φ(c)) − #k 1φ(c) c 0,1 T k−1 − − ∈{ } ∈T #kc>0

X X X (k) #k 1ba = ΛT − ma 2n(#k 1a + 1) T k−1 a T b 0,1 − ∈T ∈ ∈{ } #k−1ba>0

X X X (k) #k 1ba #ka0 #kba0 + 2ΛT − (69) ma n #k 1a − #k 1ba T k−1 a T b 0,1 − − ∈T ∈ ∈{ } #k−1ba>0

X X (k) #k 1a ΛT − ≤ 2n(#k 1a + 1) T k−1 a T − ∈T ∈ v u  2 u X X X (k) #k 1ba #ka0 #kba0 + u 4ΛT − (70) t ma n #k 1a − #k 1ba T k−1 a T b 0,1 − − ∈T ∈ ∈{ } #k−1ba>0

17 v u (k)  2 E[ T ] u X X X 4Λ #k 1ba #ka0 #kba0 | | + u T − , (71) u ≤ 2n ma n #k 1a − #k 1ba tT k−1 a T b 0,1 − − ∈T ∈ ∈{ } #k−1ba>0 where

(a) in (63), we use the fact that the summand is zero whenever #kc = 0, and hence the introduction of the constraint #kc > 0 preserves the equality; (b) in (64), we use the CTW estimate for the kth bit given by (7);

(c) in (67), we have used the equality #k 1ξT (φ(c)) = #kξT (φ(c)0 + #kξT (φ(c)1; −

(d) in (69), we introduce ma := k 1 a (as in Lemma 6) when we split the sum over c into − − k k two depending on the value of ξT (φ(c)) and its prefix, and the factor of two in (69) appears as ψ(c) can take one of two values for which the absolute value terms are identical; and (e) in (70), we have used Jensen’s inequality exploiting the convexity of the square function. Now, note that the summation within the square-root operation in (71) is precisely that (17) of Lemma 6. Rearranging terms in (71) using (17) then yields v u Q p# c u 2k−1−|T | k 1 (k) k−1 − [ T ] u X Λ (2π) 2 c 0,1 :#k−1c>0 E u T ∈{ } (72) δk(z1:n) | | + log P (T,z ) . u n 1:n #k−1a ≤ 2n n ∗ Q √ tT k−1 Pn(T ,z1:n) ∈T k λa a T :#k− a>0 ∈ 1 To simplify the above expression, we use the following bounds.

1. the product of the square-root terms for any T k 1 can be upper bounded using ∈ T − Q p Q p #k 1c #k 1c − − c 0,1 k−1 c 0,1 k−1 ∈{ } ∈{ } r 2k−1 #k−1c>0 #k−1c>0  n  max . (73) Q p n Q p k 1 ≤ z1:n ≤ 2 #k 1a ∈Z #k 1a − − T k−1 − a T ∈T a T #k−∈1a>0 #k−∈1a>0

2. The term involving the product of λa’s, a T , can be bounded using Lemma 6 as follows. ∈   (k) (k) k 1 a k 1 X ΛT Y X ΛT X 2 − −k k 1 2 − E[ T ] log  λa − = − | | . (74) n ≤ n 2 2n T a T :#k− a>0 T a T ∈ 1 ∈ 3. The term involving the ratio of the sequence-dependent part of the CTW weights can be bounded as follows.

(k) (k) (k) (k) ! X ΛT Pn (T, z1:n) X ΛT wT Pn (T, z1:n) log log + log w ∗ (75) (k) (k) Tk−1 − n P (T , z ) ≤ − n w ∗ P (T , z ) T n k∗ 1 1:n T Tk−1 n k∗ 1 1:n − − P (k) k 1 P (k) (k) w 0 P (T , z ) 2 Λ log Λ (a) T n 0 1:n − T T 1 T 0 − T log (k) + ≤ −n w ∗ P (T , z ) n Tk−1 n k∗ 1 1:n − 2k 1 H(T ) (b) 2k = − + , (76) n n ≤ n

18 2k−1 where in (a) we use the fact that the CTW weight for Tk∗ 1 is at least 2− , and that − (k) X wT 0 Pn (T 0, z1:n) 1 (77) (k) = (k) 1, 0 w ∗ P (T , z ) Λ ∗ ≥ T Tk−1 n k∗ 1 1:n T − k−1

2k−1 k 1 and in (b) we use the fact that k 1 2 (and therefore H(T ) log k 1 2 − ). |T − | ≤ ≤ |T − | ≤ 4. Finally, the product of the 2π factors are bounded as follows.

k− (k) 2 1−|T | (k) P 2 P k 1 ΛT log(2π) ΛT (2 − T ) log(2π) k 1 T T − | | (2 − E[ T ]) log(2π) = − | | . (78) n ≤ 2n 2n Employing the above four bounds in (72) then yields

r k 1 k k 2 k 2 k 1 E[ T ] 2 − E[ T ] 2 2 − log n 2 − log 2 − δ (z ) | | + − | | log(2πe) + + (79) k 1:n ≤ 2n 2n n n − n k 1 r k 2 k 2 4 k 2 k 2 k 1 2 − 2 log(2πe) + 2 log(e ) + 2 log n 2 log(2 ) + − − − − − − (80) ≤ 2n n s k 1 k 2 2πe5n  2 2 log k−1 = − + − 2 . (81) 2n n

Lastly, since the bound in (81) is devoid of any dependency on z1:n, the claim follows from (62).

Appendix E. Proof of Theorem 10

1 Fix n N, and σ = 2 √1 α for some α (0, 1). Let an := n σ√n log n , and bn = 2n an. Let ∈ k − ∈ − b c − Mk : 0, 1 R be defined by { } →  an if b is of even weight Mk(b) = . (82) bn otherwise

k k k Let z1:2kn be a sequence of length 2 n over 0, 1 with Mk(b) occurrences of each b 0, 1 . Then, { } ∈ { }

k c

19 and if T is such that ξ (φ(b)) = k 1, then by the choice of z , k T k − 1:n 1 #kξT (φ(b))ψ(b) + 1 (83) an n bn n 2 = | − | = − . (87) #k 1ξT (φ(b)) + 1 − 2 2n + 1 2n + 1 −

(k)   Pn (T, z1:n) Θ(n ) a = k 2 = k k − . (90) (k) Θ(√n) a < k 2 Pn (T 0, z1:n) k k − Now, divide the proof into two cases depending on whether k = 2 or k > 2. (2)  2 Suppose k = 2. Then, from (90), it follows that Λ e = 1 Θ(n− ), and hence, for any b 0, 1 , { } − ∈ { } 1 (84) (85)  # b +  ˆ #2b , 1 (2) 1 (2) 2 2 #kb PCTW(Z = b ; z1:4n) = Λ e + ΛT ∗ (91) | − 4n 2 { } 2 1 2n + 1 − 4n  (2) (k) n #2b Λ e (2n + 1) + ΛT ∗ = | − | · { } 1 (92) 4n(2n + 1) σ√n log n = (1 + o(1)). (93) 4n

Pn(T,z1:n) 1 Now, if k > 2, then from (90), it follows that for any T k 1 e , = O(n− 2 ), which ∈ T − \ {{ }} Pn( e ,z1:n) (k) 1 { } 2 k then entails that Λ e = 1 O(n− ). Thus, for any b 0, 1 , { } − ∈ { }

#kb (84),(85) 1 1 1 #kb Pˆ (Z = b ; z k ) = (1 O(n− 2 )) + O(n− 2 ) CTW | 1:2 n − 2kn 2k − − 2kn

n #2b 1 σ√n log n = | − | + O(n− 2 ) = (1 + o(1)). (94) 2kn 2kn It then follows from (93) and (94) that for any k 2, ≥

X #kb σ√n log n Pˆ (Z = b ; z k ) = (1 + o(1)) (95) CTW | 1:2 n − 2kn n b r log(2kn) = 2k 2(1 α) (1 + o(1)). (96) − − 2kn The proof is then complete by choosing α small enough and n sufficiently large so that the scaling term √1 α(1 + o(1)) > √1 . − −

20