Convergence of Binarized Context-Tree Weighting for Estimating Distributions of Stationary Sources
Total Page:16
File Type:pdf, Size:1020Kb
Convergence of Binarized Context-tree Weighting for Estimating Distributions of Stationary Sources Badri N. Vellambi, Marcus Hutter Australian National University {badri.vellambi, marcus.hutter}@anu.edu.au C( 1) n Abstract—This work investigates the convergence rate of learn- to be |A|−2 log2 C( 1) . In [7], a decomposition-based ing the stationary distribution of finite-alphabet stationary er- approach that uses binarization|A|− to map symbols from large godic sources using a binarized context-tree weighting approach. alphabets to bit strings was also summarily proposed; this The binarized context-tree weighting (CTW) algorithm estimates the stationary distribution of a symbol as a product of conditional approach was later analyzed in greater detail in [9]. distributions of each component bit, which are determined in There are several motivating factors for taking a closer look a sequential manner using the well known binary context- into binarization. First, many data sources of interest such as tree weighting method. We establish that CTW algorithm is a text, image, and audio files are factored and represented into consistent estimator of the stationary distribution, and that the smaller units (such as bits or bytes) for digital storage. Hence, worst-case L1-prediction error between the CTW and frequency estimates using n source symbols each of which when binarized despite the fact that a source of practical interest might be q distributed over a large alphabet, its symbols can effectively consists of k > 1 bits decays as Θ 2k log n . n be viewed as a tuple of binary random variables. Second, the decomposition of larger alphabets into appropriate binary I. INTRODUCTION strings translates the CTW algorithm on the large alphabet to a The binary context-tree weighting (CTW) algorithm pro- sequence of binary CTW problems. This translation in turn has posed in [1], [2] is a universal data compression scheme that the potential of reducing the number of model parameters and uses a Bayesian mixture model over all context trees of up to a can consequently offer improved redundancy guarantees [9]– certain design depth to assign probabilities to sequences. The [11]. This is better understood via an example. Consider an probability estimate and the weights corresponding to each tree i.i.d. source over a hexadecimal alphabet whose distribution assigned by the CTW algorithm is derived efficiently using the well known Krichevsky-Trofimov (KT) estimator that models 0 9/500 4 189/1000 8 81/1000 12 27/800 source bits as having a Bernoulli distribution whose parameter 1 1/50 5 21/125 9 9/1000 13 3/800 has a Dirichlet prior [3]. The CTW algorithm offers two 2 7/125 6 9/250 10 126/500 14 9/320 attractive features: 3 3/125 7 27/500 11 27/250 15 27/320 optimal compression for binary tree sources [4, Thm. 1]; 1st bit 2nd bit 3rd bit 4th bit • (1, 0.25) (111, 0.75) superior redundancy bounds and amenability to analysis (11, 0.75) • e e (011, 0.6) e as opposed to other universal compression schemes such (01, 0.3) (e, 0.6) (01, 0.3) as the Lempel-Ziv algorithm [5], [6], e.g., it was shown (0, 0.8) (0, 0.75) (0, 0.1) in [2, Thm. 2] that for any binary tree source P with C P ( ) context parameters, the redundancy ρ( ) := log · 0 2 PCTW ( ) 100 data1data1data1data1data1E[ Pˆ P ] of the binary CTW algorithm is bounded· by · k CTW − k ˆ data2E[ P# P ] k − k 1 n max ρ(x1:n) C 2 log2 C + 1 + (2C 1) + 2: (1) x1:n ≤ − The first term is the asymptotically dominant optimal re- error 1 -11 L 1010− dundancy achievable for the source P , and the second term L1-error O (n − 1 (2C 1) is the model redundancy, i.e., the price paid for 2 )decay weighting− various tree models as opposed to using the actual Average source model if it were indeed known. 2 An extension of CTW to sources over non-binary alphabets 1010−-2 0 1 2 3 4 5 was proposed in [7], [8], and a redundancy bound simi- 100 101 102 103 104 105 10 10 10 n 10 10 10 lar to (1) for a tree source with C contexts over a finite n alphabet was derived; the asymptotically dominant term Fig. 1. Performance of the binarized CTW algorithm for a tree source. in this boundA for a general alphabet was established A is given in the table in Fig. 1. While no specific structure This work was supported by the Australian Research Council Discovery is apparent from the table, the example corresponds to a Projects DP120100950 and DP150104590. binary tree source with each source symbol comprising of four bits, the first of which is an i.i.d. Bernoulli random model whose state corresponds to the concatenation of the variable with parameter 0:6. For i = 2; 3; 4, the ith bit state, action and return in the reinforcement learning problem. potentially depends on the realization of the first i 1 bits, The following summarizes the main result of this approach. − and the conditional distribution of the ith bit is a Bernoulli A consistent estimator of the stationary distribution of • distribution that depends on its context, e.g., the third bit is the state of the derived hidden Markov model, whose a Bernoulli random variable with parameter 0.75 if the first estimation error converges stochastically to zero suitably two bits are both ones. The naïve CTW over the hexadecimal fast results in a consistent estimator of the state-action alphabet looks for no structure within each symbol, and hence value function. must learn 15 parameters, whereas the dominant trees for Simulations in [12] revealed two benefits from binarizing each of the 4 bits in the binarized factored CTW together large state spaces, and using a binarized factored CTW: have to learn only 10 parameters, which yields improvements excellent convergence rate in estimating the stationary in terms of redundancy. In general, for a tree source with • distribution of the underlying process; alphabet and with C contexts, the asymptotic dominant the ability to effectively handle considerably larger state A C( 1) n • term for redundancy can reduce from |A|−2 log2 C( 1) |A|− spaces than the frequency estimator. log2 C n to Pd jAje i log , where C 2i 1C is the number i=1 2 2 Ci i − In this work, we assume that the source symbols take values th ≤ of contexts in the tree of the i bit in the binary decompo- from a vector binary alphabet1 (i.e., 0; 1 k for some k f g 2 sition. The scaling factor of the log2 n can potentially reduce ), and derive some fundamental properties of the binarized C( 1) C log N d 2 jAje from |A|−2 to 2 , which can be significant when CTW estimate of the source distribution, thereby providing compressing sources with large alphabets. As an illustration, a theoretical reasoning to why the binarized CTW estimate ^ again consider the i.i.d source P of Fig. 1. Let P# denote performs well in simulations [12]. We establish the following. ^ the frequency estimate of P . Let PCTW denote the binarized The worst-case L -prediction error between the binarized • 1 factored CTW (for the complete definition, see Sec. III). Fig. 1 CTW and frequency (ML) estimates for the stationary presents the variation of mean L1-prediction error with the distribution of a stationary ergodic source over 0; 1 k number of samples n; the following observations can be made. q f g for some k > 1 is Θ 2k log n ; CTW is a better estimator than a naïve frequency (ML) n • The binarized CTW method yields a consistent estima- • estimator when the size n of data is small, or suf- tor of the stationary distribution of stationary ergodic ficiently large (n . 30 or n & 450 in the above sources; and example). The improved prediction error for small data Any (Bayesian) dependency structure that exists in the • size can be attributed to Bayesian averaging inherited by binary representation of source symbols can be naturally the CTW algorithm from the CTW algorithm whereas incorporated in the binarized CTW algorithm. the frequency estimator does not have enough samples The rest of the paper is organized as follows. Section II to predict accurately. When the data size is large, the presents the definitions and notation used, and Section III contribution of the tree model that best describes the defines the binarized CTW algorithm. Section IV presents rel- data dominates the CTW algorithm. The CTW estimator evant background results regarding the contribution of various therefore offers a scaling improvement depending of the context trees in the CTW estimate, and finally, Section V number of parameters needed to describe the ‘best’ tree presents the main results pertaining to binarized CTW. For model, which is, in general, fewer than that required to want of space, all proofs are relegated to the full version [13]. naïvely describe the distribution completely. On average, the CTW and frequency estimators offer sim- OTATION AND EFINITIONS • II. N D ilar asymptotic convergence rates, which in this example <k+1 1 Given a finite alphabet (set) and k N, let = (and for discrete memoryless sources) decay as . k 0 2 Ak 2 0 A pn ≤ := ; where := e and e A A [A[A [···[A kA f g Note that while there is a potential benefit in terms of is the empty string. For a string a ≤ , a denotes its redundancy, the computational cost of implementing the naïve length.