Convergence of Binarized Context-tree Weighting for Estimating Distributions of Stationary Sources

Badri N. Vellambi, Marcus Hutter Australian National University {badri.vellambi, marcus.hutter}@anu.edu.au

C( 1) n Abstract—This work investigates the convergence rate of learn- to be |A|−2 log2 C( 1) . In [7], a decomposition-based ing the stationary distribution of finite-alphabet stationary er- approach that uses binarization|A|− to map symbols from large godic sources using a binarized context-tree weighting approach. alphabets to bit strings was also summarily proposed; this The binarized context-tree weighting (CTW) algorithm estimates the stationary distribution of a symbol as a product of conditional approach was later analyzed in greater detail in [9]. distributions of each component bit, which are determined in There are several motivating factors for taking a closer look a sequential manner using the well known binary context- into binarization. First, many data sources of interest such as tree weighting method. We establish that CTW algorithm is a text, image, and audio files are factored and represented into consistent estimator of the stationary distribution, and that the smaller units (such as bits or bytes) for digital storage. Hence, worst-case L1-prediction error between the CTW and frequency estimates using n source symbols each of which when binarized despite the fact that a source of practical interest might be q  distributed over a large alphabet, its symbols can effectively consists of k > 1 bits decays as Θ 2k log n . n be viewed as a tuple of binary random variables. Second, the decomposition of larger alphabets into appropriate binary I.INTRODUCTION strings translates the CTW algorithm on the large alphabet to a The binary context-tree weighting (CTW) algorithm pro- sequence of binary CTW problems. This translation in turn has posed in [1], [2] is a universal scheme that the potential of reducing the number of model parameters and uses a Bayesian mixture model over all context trees of up to a can consequently offer improved redundancy guarantees [9]– certain design depth to assign probabilities to sequences. The [11]. This is better understood via an example. Consider an probability estimate and the weights corresponding to each tree i.i.d. source over a hexadecimal alphabet whose distribution assigned by the CTW algorithm is derived efficiently using the well known Krichevsky-Trofimov (KT) estimator that models 0 9/500 4 189/1000 8 81/1000 12 27/800 source bits as having a Bernoulli distribution whose parameter 1 1/50 5 21/125 9 9/1000 13 3/800 has a Dirichlet prior [3]. The CTW algorithm offers two 2 7/125 6 9/250 10 126/500 14 9/320 attractive features: 3 3/125 7 27/500 11 27/250 15 27/320 optimal compression for binary tree sources [4, Thm. 1]; 1st bit 2nd bit 3rd bit 4th bit • (1, 0.25) (111, 0.75) superior redundancy bounds and amenability to analysis (11, 0.75) • e e (011, 0.6) e as opposed to other universal compression schemes such (01, 0.3) (e, 0.6) (01, 0.3) as the Lempel-Ziv algorithm [5], [6], e.g., it was shown (0, 0.8) (0, 0.75) (0, 0.1) in [2, Thm. 2] that for any binary tree source P with C P ( ) context parameters, the redundancy ρ( ) := log · 0 2 PCTW ( ) 100 data1data1data1data1data1E[ Pˆ P ] of the binary CTW algorithm is bounded· by · k CTW k ˆ data2E[ P# P ] k k 1 n  max ρ(x1:n) C 2 log2 C + 1 + (2C 1) + 2. (1) x1:n ≤ −

The first term is the asymptotically dominant optimal re- error 1 -11

L 1010 dundancy achievable for the source P , and the second term L1-error O (n 1 (2C 1) is the model redundancy, i.e., the price paid for 2 )decay weighting− various tree models as opposed to using the actual Average source model if it were indeed known.

2 An extension of CTW to sources over non-binary alphabets 1010-2 0 1 2 3 4 5 was proposed in [7], [8], and a redundancy bound simi- 100 101 102 103 104 105 10 10 10 n 10 10 10 lar to (1) for a tree source with C contexts over a finite n alphabet was derived; the asymptotically dominant term Fig. 1. Performance of the binarized CTW algorithm for a tree source. in this boundA for a general alphabet was established A is given in the table in Fig. 1. While no specific structure This work was supported by the Australian Research Council Discovery is apparent from the table, the example corresponds to a Projects DP120100950 and DP150104590. binary tree source with each source symbol comprising of four bits, the first of which is an i.i.d. Bernoulli random model whose state corresponds to the concatenation of the variable with parameter 0.6. For i = 2, 3, 4, the ith bit state, action and return in the reinforcement learning problem. potentially depends on the realization of the first i 1 bits, The following summarizes the main result of this approach. − and the conditional distribution of the ith bit is a Bernoulli A consistent estimator of the stationary distribution of • distribution that depends on its context, e.g., the third bit is the state of the derived hidden Markov model, whose a Bernoulli random variable with parameter 0.75 if the first estimation error converges stochastically to zero suitably two bits are both ones. The naïve CTW over the hexadecimal fast results in a consistent estimator of the state-action alphabet looks for no structure within each symbol, and hence value function. must learn 15 parameters, whereas the dominant trees for Simulations in [12] revealed two benefits from binarizing each of the 4 bits in the binarized factored CTW together large state spaces, and using a binarized factored CTW: have to learn only 10 parameters, which yields improvements excellent convergence rate in estimating the stationary in terms of redundancy. In general, for a tree source with • distribution of the underlying process; alphabet and with C contexts, the asymptotic dominant the ability to effectively handle considerably larger state A C( 1) n • term for redundancy can reduce from |A|−2 log2 C( 1) |A|− spaces than the frequency estimator. log2 C n to Pd |A|e i log , where C 2i 1C is the number i=1 2 2 Ci i − In this work, we assume that the source symbols take values th ≤ of contexts in the tree of the i bit in the binary decompo- from a vector binary alphabet1 (i.e., 0, 1 k for some k { } ∈ sition. The scaling factor of the log2 n can potentially reduce ), and derive some fundamental properties of the binarized C( 1) C log N d 2 |A|e from |A|−2 to 2 , which can be significant when CTW estimate of the source distribution, thereby providing compressing sources with large alphabets. As an illustration, a theoretical reasoning to why the binarized CTW estimate ˆ again consider the i.i.d source P of Fig. 1. Let P# denote performs well in simulations [12]. We establish the following. ˆ the frequency estimate of P . Let PCTW denote the binarized The worst-case L -prediction error between the binarized • 1 factored CTW (for the complete definition, see Sec. III). Fig. 1 CTW and frequency (ML) estimates for the stationary presents the variation of mean L1-prediction error with the distribution of a stationary ergodic source over 0, 1 k number of samples n; the following observations can be made. q  { } for some k > 1 is Θ 2k log n ; CTW is a better estimator than a naïve frequency (ML) n • The binarized CTW method yields a consistent estima- • estimator when the size n of data is small, or suf- tor of the stationary distribution of stationary ergodic ficiently large (n . 30 or n & 450 in the above sources; and example). The improved prediction error for small data Any (Bayesian) dependency structure that exists in the • size can be attributed to Bayesian averaging inherited by binary representation of source symbols can be naturally the CTW algorithm from the CTW algorithm whereas incorporated in the binarized CTW algorithm. the frequency estimator does not have enough samples The rest of the paper is organized as follows. Section II to predict accurately. When the data size is large, the presents the definitions and notation used, and Section III contribution of the tree model that best describes the defines the binarized CTW algorithm. Section IV presents rel- data dominates the CTW algorithm. The CTW estimator evant background results regarding the contribution of various therefore offers a scaling improvement depending of the context trees in the CTW estimate, and finally, Section V number of parameters needed to describe the ‘best’ tree presents the main results pertaining to binarized CTW. For model, which is, in general, fewer than that required to want of space, all proofs are relegated to the full version [13]. naïvely describe the distribution completely. On average, the CTW and frequency estimators offer sim- OTATION AND EFINITIONS • II.N D ilar asymptotic convergence rates, which in this example

zn 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 pieced together to obtain the CTW estimate for the stationary k zn 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 distribution of the source as follows. For each β1:k 0, 1 , . 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ∈ { } . k History History 1 0 0 1 1 0 0 1 History 1 0 0 1 1 0 0 1 Y (`) History ˆ ˆ PCTW(β1:k; z1:n) , P (B` = β` B1:` 1 = β1:` 1 ; z1:n). CTW | − − Fig. 2. Some details of the binarized CTW estimator. `=1 (7) 1 Two things must be remarked at this juncture. First, the CTW for some λabcd e 2 , and ηa,b,c,d = 1 otherwise. A | | ≤ a c estimator is a biased estimator of the distribution of the source. derivation of (10) is given in [13]. Observe that when b and d The bias is inherited directly from the KT estimator, and can are quite different, the sequence-dependent weight for T 0 can be seen by simply considering a binary i.i.d. Bernoulli source by larger than that of T by an exponential factor, and thus, 1 with parameter θ = 2 , for which the context tree T 0 better explains the data z1:n than tree T 6 does. When a and c are close, the exponential term does not # 1 + 1 nθ + 1 b d ˆ E 1 2 2 E P (Z = 1; z1:n) = = = θ. (8) play a significant role, and the sequence-dependent weight for CTW n + 1 n + 1 6 T is larger than that of T 0 by a polynomial factor, and hence Second, the binarized CTW estimator defined above learns T is a marginally better model for the data z1:n than T 0. the stationary distribution of the source process, neglecting Now, for the second stage, we can, by repeated application the statistical correlation between symbols of the source. of the above result, compare the sequence-dependent part Correlation between symbols can be incorporated into the of the weights of two trees T , T 00 related as follows. Let binarized CTW estimator by allowing more factors in (7). T,T 00 ` 1 be two trees such that the former contains, ∈ T − IV. COMPARING CONTRIBUTIONSOF TREESIN CTW say, a node a, and the latter contains instead of a, all Since the binarized CTW algorithm estimates the distri- nodes at depth ` 1 which have a as its suffix. Consider − bution of all but the first bit as a weighted average (i.e., the example in Fig. 3, where a = 0, and T 00 contains mixture) of KT estimates of corresponding different tree 00a, 01a, 10a, and 11a instead of a. We can compare the sources, the behavior of the CTW estimate will be shaped by sequence-dependent part of weights of T and T 00 by bounding three ratios of the sequence-dependent part of weights of: (a) ` =4 00 000 000 00 000 00 trees T and Ta := 00, 10, 01, 011, 111 ; (b) trees Ta and 100 a = 0 { } 100 0 100 0 , and (c) trees and . 010 10 10 10 Tb := 000, 100, 10, 01, 011, 111 Tb T 00 010 010 { } th e More generally, in the CTW estimate of the bit for some 110 110 e 110 e ` 01 01 01 001 001 001 ` = 2, . . . , k, the ratio of the sequence-dependent part of 101 11 101 101 011 11 11 weights of trees T and T 00 (related in the above manner) is 1 011 1 011 1 111 111 111 (`) Q T = 0, 01, 011, 111 T 0 = 00, 10, 01, 011, 111 T 00 = 000, 010, 100, 110, 01, 011, 111 Pn (T, z1:n) b T Pe(#`b0, #`b1) { } { } { } ∈ (11) a := (`) = Q , Fig. 3. An example of trees compared in the first and second stages. Pn (T , z1:n) b T 00 Pe(#`b0, #`b1) 00 ∈ the contributions of the dominant context trees, which need which can be bounded as follows. to be identified. One way to identify such trees is to develop Lemma 1: Let k ` 2 and z1:n be a sequence over k ≥<` ≥ tools to compare the ratio of weights of any two trees. To do 0, 1 and a 0, 1 such that #` 1a > 0. Let a be as { } ∈ { } − just that, we proceed in three stages. In each stage, we use the in (11) and m := ` 1 a . Then, m − − k k bound developed in the earlier stage for a limited class of pairs 2 −1 1 Q p (2πe) 2 #` 1ba of trees to generalize to a bound for a larger class of pairs of √#`−1a m − b 0,1 :#`− ba>0  ∈{ } 1 . trees; after the third stage, we will have the required result to a   2  ≤ P #`ba0 #`a0 exp 4 #` 1ba compare the weight of any tree with that of the complete tree. #`−1ba #`−1a th b 0,1 m − − Consider the CTW estimate of the ` bit for ` > 1. Let ∈{ } #`−1ba>0 T ` 1 be any context tree, and let a T with a < ∈ T − ∈ k k Proof: The proof proceeds by expressing a as a product ` 1. Let context tree T 0 be obtained from T by replacing a−by its two children nodes 1a and 0a as in Fig. 3, i.e., of η-terms, and then exploiting the concavity of the binary entropy function. For a complete proof, see [13]. T 0 = (T a ) 0a, 1a . In the first stage, by using (6), we see that\{ the} ratio∪ { of the} sequence-dependent part of CTW Finally, in the third stage, we can repeatedly use Lemma 1 to weights P (T, z ) and P (T , z ) is compare the sequence-dependent part of weights of any tree n 1:n n 0 1:n ` 1 T ` 1 with that of the complete tree T`∗ 1 = 0, 1 − . (`) ∈ T − − { } Pn (T, z1:n) Pe(#`a0, #`a1) Lemma 2: Let k ` 2 and z1:n be a sequence over (`) = . k ≥ ≥ <` Pn (T , z ) Pe(#`0a0, #`0a1)Pe(#`1a0, #`1a1) 0, 1 and T ` 1. Let ma := ` 1 a for a 0, 1 . 0 1:n { } ∈ T − − −k k ∈ { } Since # a0 = # 0a0+# 1a0 and # a1 = # 0a1+# 1a1, Then, for some λa depending on z1:n such that log λa ` ` ` ` ` ` ma (`) | | ≤ 2 1 Pn (T,z1:n) one needs to study the following quantity so as to compare the 2− , the ratio (`) ∗ is bounded above by Pn (T`−1,z1:n) ratio of the weights of two trees. Let for a, b, c, d N 0 , ∈ ∪ { } 2`−1−|T |   Q λa Q p Pe(a + c, b + d) (2π) 2 #` 1c ηa,b,c,d := . (9) √#`−1a `− − P (a, b)P (c, d) a T c 0,1 1 e e #`−∈1a>0 ∈{ } #`−1c>0 Using Stirling’s approximation, one can relate ηa,b,c,d and the ( ). (12)  2 binary entropy function h as follows. If a+b > 0 and c+d > 0, P P #`ba0 #`a0 exp 4 #` 1ba # ba # a m − `−1 `−1 (a+b)h a +(c+d)h c a T b 0,1 a − e ( a+b ) ( c+d ) ∈ ∈{ } η = λ , (10) #`−1ba>0 a,b,c,d abcd q a+c (a+b+c+d) (a+b+c+d)h( a+b+c+d ) 2π(a+b)(c+d) e Proof: For a complete proof, see [13]. Having proven the required machinery, we are now A. Extensions of CTW. equipped to state our results pertaining to the CTW estimator. The formulation of the binarized CTW and the bounds in Theorems 1 and 2 assume no prior dependency structure V. NEW RESULTS between the bits B1,...,Bk in the binary representation of each symbol of the random process under study. Any known We first present an upper bound for the worst-case L1-error dependency structure between the bits representable by a between the CTW and frequency estimates, followed by a Bayesian network can be incorporated to define an appropri- lower bound established by a suitable choice for z1:n. B ate binarized CTW estimate PˆB ( ; z1:n) by modifying (7). Theorem 1 (Upper Bound): Let = 0, 1 k, k . Then, CTW · | N In general, the binarized CTW variant ˆ will be a Bayesian Z { } ∈ PCTWB X #kc mixture over all tree sources that satisfy the dependencies ∆ := max Pˆ (Z = c ; z ) k n CTW 1:n imposed by , and will yield some computational benefits z1:n | − n ∈Z c 0,1 k ˆ B ∈{ } over PCTW. Similar to Theorems 1 and 2, we can establish the ( 1 2n k = 1 following result quantifying the L1-prediction error between k−1 q 2 2k−2 2πe5n  . PˆB and the maximum-likelihood estimate ≤ ∆k 1 + + log k−1 k > 1 CTW − 2n n 2 ˆ PML, ( ; z1:n) := argmaxP ( ) P (z1:n), (14) Proof: The proof proceeds by induction, and uses (3)-(7) B · ∈P B k to relate the worst-case L1-error ∆k for sequences over 0, 1 where ( ) is the set of all probability distributions that { k }1 P B to worst-case L1-error ∆k 1 for sequences over 0, 1 − . satisfy the Markov/dependency conditions of the Bayesian The proof is then complete− by using (12) and some{ standard} network . When imposes no Markov conditions among B ˆ B information-theoretic bounds. For a full proof, see [13]. B1,...,Bk, PML, ( ; z1:n) is simply the frequency estimate. B · Since Theorem 1 quantifies a worst-case bound that holds for Theorem 3: Given Bayesian network consisting of k binary random variables, and = 0, 1 k,B all sequences, and since the frequency estimator is a consistent Z { } q estimator of the stationary distribution of any finite-alphabet  log n  max PˆB (c ; z ) PˆML (c ; z ) = Θ . n CTW 1:n , 1:n n stationary ergodic source, the following two inferences ensue. z1:n | − B ∈Z Corollary 1: REFERENCES k 1 r √ k √  [1] F. Willems, Y. M. Shtarkov, and T. J. Tjalkens, “Context tree weighting: 2 − 1 log n 2 2 a sequential universal source coding procedure for FSMX sources,” Proc. ∆k − + − (1 + o(1)), k N. ≤ 2n n √2 1 ∈ IEEE International Symposium on , p. 59, Jan 1993. − [2] ——, “The context-tree weighting method: basic properties,” IEEE Corollary 2: The binarized context-tree weighting estimator Transactions on Information Theory, vol. 41, no. 3, pp. 653–664, May 1995. CTW is a consistent estimator of the stationary distribution of [3] R. Krichevsky and V. Trofimov, “The performance of universal encod- any finite-alphabet stationary ergodic source. ing,” IEEE Transactions on Information Theory, vol. 27, no. 2, pp. 199– At this point, one might suspect the tightness of the worst- 207, Mar 1981. [4] J. Rissanen, “Universal coding, information, prediction, and estimation,” case bound, especially since for i.i.d. and finite-order ergodic IEEE Transactions on Information Theory, vol. 30, no. 4, pp. 629–636, Markov sources one can show using, say, Hoeffding’s inequal- Jul 1984. ity that the expected L -error between binarized CTW (or the [5] J. Ziv and A. Lempel, “A universal algorithm for sequential data 1 compression,” IEEE Transactions on Information Theory, vol. 23, no. 3, frequency) estimator and the actual stationary distribution of pp. 337–343, May 1977. 1 the source decays as n− 2 . Hence, for these sources, [6] ——, “Compression of individual sequences via variable-rate coding,” IEEE Transactions on Information Theory, vol. 24, no. 5, pp. 530–536, Sept 1978. #kc 1 ˆ 2 E PCTW(Z = c ; z1:n) = Θ(n− ). (13) [7] T. J. Tjalkens, Y. M. Shtarkov, and F. Willems, “Context tree weighting: | − n Multi-alphabet sources,” Proc. 14th Symposium on Information Theory, Benelux, pp. 128–135, 1993. However, the following result identifies sequences z1:n for [8] T. J. Tjalkens, F. Willems, and Y. M. Shtarkov, “Sequential weight- which the L1-error between the CTW and frequency estimates ing algorithms for multi-alphabet sources,” Proc. 6th Swedish-Russian is of the same order as the upper bound of Theorem 1. Workshop on Information Theory, Mölle, Sweden, pp. 22–27, Aug 1993. [9] ——, “Multi-alphabet universal coding using a binary decomposition k Theorem 2 (Lower Bound): Let = 0, 1 for k 2. context tree weighting algorithm,” Proc. 15th Symposium on Information Z { } n ≥ Theory, Benelux, pp. 259–265, 1994. Then, for  > 0, there exist n N and z1:n such that ∈ ∈ Z [10] T. J. Tjalkens, P. A. J. Volf, and F. Willems, “A context-tree weighting q method for text generating sources,” Proc. Data Compression Confer- X #kc k−2 ˆ 2 (1 ) log n ence (DCC ’97), p. 472, March 1997. PCTW(Z = c ; z1:n) − . | − n ≥ n [11] R. Begleiter and R. El-Yaniv, “Superior guarantees for sequential pre- c 0,1 k ∈{ } diction and via alphabet decomposition,” J. Mach. Learn. Res., vol. 7, pp. 379–411, Dec. 2006. Proof: The proof establishes that the above bound using [12] J. Veness, M. G. Bellemare, M. Hutter, A. Chua, and G. Desjardins, a sequence of length 2kn over 0, 1 k with n σ√n log n “Compress and control,” Proc. Twenty-Ninth AAAI Conference on Arti- { k } − b c ficial Intelligence (AAAI’15), pp. 3016–3023, 2015. occurrences of each b 0, 1 of even weight, and n + [13] B. N. Vellambi and M. Hutter, “Convergence of binarized context-tree ∈ { } k σ√n log n occurrences of each b 0, 1 of odd weight, weighting for estimating distributions of stationary sources.” whereb σ = c1 √1 α and α (0, 1).∈ For { details,} see [13]. Online: http://www.hutter1.net/publ/convbinctwx.pdf. 2 − ∈