An Efficient Bayes Coding Algorithm for the Non-Stationary Source in Which Context Tree Model Varies from Interval to Interval

Koshi Shimada Shota Saito Toshiyasu Matsushima Department of Pure and Applied Mathematics Faculty of Informatics Department of Pure and Applied Mathematics Waseda University Gunma University Waseda University Tokyo, Japan Gunma, Japan Tokyo, Japan [email protected] [email protected] [email protected]

Abstract—The context tree source is a source model in which context tree models, which reduces the computational com- the occurrence probability of symbols is determined from a finite plexity from the exponential order to the polynomial order. past sequence, and is a broader class of sources that includes i.i.d. Now, a source model for source coding should be able to and Markov sources. The proposed source model in this paper represents that a subsequence in each interval is generated from describe in a concise mathematical manner, but also should a different context tree model. The Bayes code for such sources be better reflects the probability structure of the real data requires weighting of the posterior probability distributions for sequence to be compressed. For example, the context tree the change patterns of the context tree source and all possible source includes i.i.d. source and Markov source inherent in context tree models. Therefore, the challenge is how to reduce itself. It is a broader class of sources, and has been applied to this exponential order computational complexity. In this paper, we assume a special class of prior probability distribution of text data, for example. change patterns and context tree models, and propose an efficient On the other hand, there are cases where it is appropriate to Bayes coding algorithm whose computational complexity is the think of symbols as being generated according to a different polynomial order. context tree source for each interval, rather than modeling the entire data series as being generated according to a single I.INTRODUCTION context tree source. For example, in the case of the human genome, the DNA sequence consists of about 3 billion base The arithmetic codes asymptotically achieve the minimal pairs, and it is described as a pair of series of about 30 expected length for lossless source coding. The problem with billion in length with four different alphabets: A, G, T, and this method is that it cannot be used unless the probabilistic C. Although Markov source is sometimes assumed in DNA structure of the source is known in advance. Therefore, univer- sequence compression algorithms [6], it is known that there sal codes, which can be used when the probability distribution are genetic and non-genetic regions in the human genome, of the source is unknown, have been studied. which have different structural characteristics. Therefore, in The context tree source is one of the major source models this paper, we present a non-stationary source that context tree for universal coding, and the CTW (Context Tree Weighting) source changes from interval to interval. [1] is known as an efficient universal code for context tree An example of a non-stationary source where the source sources. The CTW method can be interpreted as a special changes from interval to interval is an i.p.i.d. (independently arXiv:2105.05163v2 [cs.IT] 13 May 2021 case of the Bayes code proposed by Matsushima and Hirasawa piecewise identically distributed) source [4], [5]. An i.p.i.d. [2]. The CTW method encodes the entire source sequence at source is a source consisting of an i.i.d. sequences of parame- once, which is not an efficient use of memory and causes ters that are different for each interval. It can be regarded as a underflow problem in calculation, whereas the Bayes code of special case of the proposed source in this paper. An efficient Matsushima and Hirasawa [2] can be encoded sequentially Bayes code for i.p.i.d. sources has already been proposed by and is free from these problems. It is known that the Bayes Suso et al. [8]. code has equal codeword length when encoded sequentially Assuming the source model that symbols are generated by and when the entire sequence is encoded at once [3]. different context tree sources in each interval, we present an However, in the Bayes code, the computational complexity efficient Bayes coding algorithm for it. In this algorithm, we of weighting by the posterior probability of the context tree use the prior probability of context tree models by Matsushima models increases exponentially according to the maximum and Hirasawa [2] and that of parameter change patterns by depth of the context tree models. Matsushima and Hirasawa [2] Suko et al. [8]. The proposed algorithm achieves a reduction developed the efficient Bayes coding algorithm by assuming in computational complexity from the exponential order to the an appropriate class of prior probability distributions for the polynomial order. II.NON-STATIONARY SOURCE THAT CONTEXT TREE where MODEL CHANGES FROM INTERVAL TO INTERVAL def T  θs = θ0|s, θ1|s, . . . , θ|X |−1|s , (4) X In this section, we present a non-stationary source that θ = 1, θ ∈ (0, 1) for each symbol a. (5) context tree model changes from interval to interval. The a|s a|s a∈X symbols are generated from different context tree models θ a ∈ X depending on the interval, as shown in Figure 1. Note that a|s is the occurrence probability of under the state corresponding to node s, where X denotes a source alphabet.

Fig. 1. Diagram of a non-stationary source with a context tree model that changes from interval to interval Fig. 2. Example of occurrence probability based on context tree model Now we define the change pattern of context tree models as follows. In the case where the change pattern c is specified in (c) advance, m is abbreviated as mt . Definition 1. The change pattern c is defined to indicate when tj j Furthermore, the parameters for the change pattern c are the context tree model has changed. That is, defined as follows.   def (c) (c) (c) def N c c = w1 , . . . , wt , . . . , wN ∈ CN = {0, 1} , (1) Definition 4. The parameter Θ for the change pattern c is ( defined as follows. (c) def 1 if the context tree model changes at time t, (c)  (c) (c) (c)  wt = c def n m o m m mt t t0 t1 |Tc|−1 0 otherwise. Θ = θ t ∈ Tc = θ , θ ,..., θ .

(C) (6) Now, for convenience, let w1 = 1. The length N of a t−1 source sequence is fixed. The set of all change patterns CN From Definition 3, using the state Sm(x ) correspond- is abbreviated as C from now on. Next, we define the set of ing to the source sequence (past context) xt−1 in a certain points in the context tree model where changes occur. context tree model m, the probability of occurrence at time t ∈ [tj, tj+1) is expressed as follows: Definition 2. Let Tc denote the set of points at which the  (c)  m (c)  t−1 m  parameter changes in the change pattern c. That is, t−1 tj tj p xt x , θ , m , c = p xt x , θ , mtj tj tj def n (c) o n (c) (c) (c) o Tc = t wt = 1 = t , t , . . . , t . (2) = θ t−1 . (7) 0 1 |Tc|−1 xt|Sm (x ) tj tj (c) N Therefore, the probability distribution of X = X1 ··· XN where tj is the j-th changing point in the change pattern c. in the change pattern c ∈ C is expressed by the following In other words, there are |Tc| − 1 parameter changes in c. equation. For convenience, let t(c) = 1, t(c) = N + 1. If the change 0 |Tc| |Tc|−1 (c)   N c  Y tj+1−1 mt pattern c is specified in advance, tj is abbreviated as tj. p x Θ , c = p x θ j , mt tj j From the j-th changing point tj to the j + 1-th changing j=0 point tj+1, symbols are generated according to a single context |Tc|−1 tj+1−1 (c)   (c) m (c) Y Y t−1 mt tj j tree model m . The parameter θ for this m is defined = p xt x , θ , mtj tj tj tj as in Definition 3 and an example is shown in Figure 2. j=0 t=tj |Tc|−1 tj+1−1 Definition 3. In the change pattern c, for the context tree Y Y = θ t−1 . (8) (c) xt|Sm (x ) model m in the interval [tj, tj+1), we denote the set of its tj tj tj j=0 t=tj (c) (c) m (c) leaf nodes as L . The parameter θ tj for m is defined (c) mtj tj Regarding a change pattern c, a context tree model mtj , and as follows: (c) m (c) tj (c) the parameter θ for m , we assume prior probability dis- def n o tj mt |X | (c) (c) θ j = θs ∈ (0, 1) s ∈ L , (3) (c) m (c) mtj tj tributions π(c), P (mtj |c), and w(θ |mtj , c) respectively. III.BAYES CODEFORTHE PROPOSED SOURCE On the other hand, we have p(xN ) In this section, we present the coding probability of the Z X X (c) (c) (c) Bayes code for the proposed source. = p(θm , m(c), c)p(xN |θm , m(c), c)dθm c∈C m(c)∈M Theorem 1. The coding probability of the sequential Bayes N P P R m(c) (c) t m(c) (c) m(c) Y (c) p(θ ,m ,c)p(x |θ ,m ,c)dθ code for the proposed source is = c m . P P R p(θm(c),m(c),c)p(xt−1|θm(c),m(c),c)dθm(c) t=1 c m(c)  (13) ∗ t−1 X t−1 X (c) t−1 AP (xt|x ) = π(c|x )  P (m |x , c) ∗ t−1 Hence, AP (xt|x ) given as follows minimizes the Bayes (c) c∈C m ∈M risk. Z  (c)   (c)  (c)i p x xt−1,θm ,m(c),c w θm xt−1,m(c),c dθm , ∗ t−1 t AP (xt|x ) (c) (c) (c) (9) P P R p(θm ,m(c),c)p(xt|θm ,m(c),c)dθm = c m(c) P P R m(c) (c) t−1 m(c) (c) m(c) c m(c) p(θ ,m ,c)p(x |θ ,m ,c)dθ where π(c|xt−1) is the posterior probability distribution of  the change pattern c, P (m(c)|xt−1, c) is that of the context X X = π(c|xt−1) P (m(c)|xt−1, c) tree model m(c), M is the set of all context tree models1  m(c) t−1 (c) c∈C m(c)∈M (see Example 1), and w(θ |x , m , c) is the posterior Z  (c)   (c)  (c)i (c) t−1 m (c) m t−1 (c) m probability distribution of the parameter θm . p xt x ,θ ,m ,c w θ x ,m ,c dθ . (14) Proof. The proof outline of Theorem 1 is the same as that t−1 of Theorem 2 in [3]. Now, let APp(xt|x ) be an arbitrary sequential coding probability. Taking the logarithmic loss of Example 1. When considering the case where X = {0, 1} the coding probability, the loss function is as follows: and the depth of a context tree model is at most two, there are five context tree models as shown in Figure 3 and M =  N m(c) (c)  V APp, x , θ , m , c {m1, m2, m3, m4, m5}. N  (c)  N m (c) Y t−1 = log p x θ , m , c − log APp(xt|x ). (10) t=1

We take the expectation with respect to the probability distri- bution of the source sequences, and obtain the risk function as follows:

 m(c) (c)  R APp, θ , m , c

 (c)  p xN θm , m(c), c X  (c)  = p xN θm , m(c), c log . QN t−1 xN t=1 APp(xt|x ) (11)

The Bayes risk is then obtained by taking the expectation with respect to the probability distribution of each parameter. Fig. 3. Overall view of the context tree models of 0-1 sequence in the maximum depth d = 2  X X (c) IV. EFFICIENT ALGORITHM FOR CALCULATING CODING BR(APp) = π(c) P (m |c)  PROBABILITYOFTHE BAYES CODE c∈C m(c)∈M Z  The coding probability of the Bayes code shown in Theorem  m(c) (c)  m(c) (c) m(c) R APp, θ , m , c w(θ |m , c)dθ . 1 is given by weighting the Bayes optimal coding probability of each change pattern c by the posterior probability distribu- (12) tion of the change pattern π(c|xt−1). That is, (9) is expressed as follows: 1 X To be more precise, the size of M (the total number of context tree AP∗(x |xt−1) = π(c|xt−1)AP (x |xt−1, c), (15) models) depends on the maximum depth of the context tree d. However, in t c t this paper, d is fixed. c∈C where Second, we assume the prior probability distribution of the   parameter θm for each context tree model m ∈ M as follows. t−1 X (c) t−1 APc(xt|x , c) = P m x , c m(c)∈M Assumption 2. For each leaf node s ∈ Lm, we assume Z  (c)   (c)  (c) the prior probability distribution of its parameter w(θs) is a t−1 m (c) m t−1 (c) m p xt x ,θ ,m ,c w θ x ,m ,c dθ . Dirichlet distribution   (16) Γ P|X |−1 β(i|s) |X |−1 i=0 Y β(i|s)−1 t−1 w(θs) = θi|s , (20) For this APc(xt|x , c), Matsushima and Hirasawa [2] Q|X |−1 Γ(β(i|s)) i=0 have already shown an algorithm that can calculate it an- i=0 alytically while reducing the amount of computation. This where Γ(·) is a Gamma function, X denotes a source alphabet, algorithm is explained in the next subsection. and β(i|s) denotes the parameter of the Dirichlet distribution. In addition, the prior probability distribution of θm is assume A. Efficient Bayes Coding Algorithm for Fixed Change Pattern to be the product of all w(θs). That is, First, the prior probability distribution for the context tree m Y w (θ |m) = w(θs). (21) model is assumed to be as follows. s∈Lm

Assumption 1. For the set of leaf nodes Lm in each context Then, Matsushima and Hirasawa [2] recursively compute tree model m ∈ M, let Im be the set of internal nodes. the Bayes coding probability for context tree sources as Assume that each node s has a hyper-parameter gs ∈ [0, 1] follows. and that the prior distribution of each model m is First, let L denote the set of all leaf nodes in the superposed Y Y context tree. The term “superposed context tree” refers to P (m) = gs¯ (1 − gs), (17) the tree structure that represents the superposition of the s¯∈Im s∈Lm entire possible context tree models. For example, a superposed where gs = 0 for the leaf node s at the maximum depth of a context tree for the entire context tree model shown in Figure context tree model. An example is shown in Figure 4. 3 is as shown in Figure 5. Remark 1. P (m) is a probability distribution, i.e. (17) satis- fies X P (m) = 1. (18) m∈M The proof of this fact is given by Nakahara and Matsushima [7]. For example, the prior probability for Figure 4 (correspond- ing to model m4 in Figure 3) is

Fig. 5. A superposed context tree for the entire context tree model in Figure P (m4) =gsλ (1 − gs0 )gs1 (1 − gs01 )(1 − gs11 ) |{z} |{z} 3 0 0

=gsλ (1 − gs0 )gs1 , (19) Now, we denote τt as the last change point of context tree model when the time is t. If τt is the j-th changing point in where sλ represents the root node. t−1 the sequence x , xt is generated according to the context

tree model mtj . In other words, since the context tree models before τt is irrelevant to xt. Next, a recursive function that calculates coding probability is defined as follows. Definition 5.

q˜s(xt|xτt ··· xt−1)  q (x |x ··· x ) if s ∈ L,  s t τt t−1 =def (1 − g )q (x |x ··· x ) (22) s|xτt ···xt−1 s t τt t−1 + g q˜ (x |x ··· x ) otherwise, s|xτt ···xt−1 schild t τt t−1 where

def β(xt|s) + N(xt|xτt ··· xt−1, s) s q (x |x ··· x ) = . Fig. 4. Example of a context tree model. The λ represents the root node, s t τt t−1 P|X |−1 and the internal nodes are sλ and s1. The leaf nodes are s0, s01, s11, but the i=0 {β(i|s) + N(i|xτt ··· xt−1, s)} hyper-parameters of s01, s11 are 0 because they exist at the maximum depth. (23) Note that N(a|xτt ··· xt−1, s) denotes the number of oc- Step i. Load xt.  currences of the symbol a ∈ X under the state s in the Step ii. The coding probability is calculated as follows: subsequence xτt ··· xt−1, and schild is the child node of s in t the superposed context tree with the context xτt ··· xt−1. The X p˜(x |xt−1) = q˜ (x |x ··· x )v(τ |xt−1). posterior hyper-parameter g is calculated as follows: t sλ t τt t−1 t s|xτt ···xt−1 τt=1 t ( Step iii. v(τt+1|x ) is calculated as follows: gs if t = 0, def • If τ = 1, 2, . . . , t, gs|x ···x = g q˜s (xt|xτ ···xt−1) t+1 τt t s|xτt ···xt−1 child t otherwise. q˜s(xt|xτt ···xt−1) t−1 t q˜sλ (xt|xτt ··· xt−1)v(τt|x ) (24) v(τt+1|x ) = (1 − α) t−1 . p˜(xt|x ) t • If τt+1 = t + 1, v(τt+1|x ) = α. In this case, the Bayes coding probability AP (x |xt−1, c) c t Step iv. Back to Step i. for the context tree source can be calculated as follows. We show that the above algorithm correctly computes the Theorem 2 (Matsushima and Hirasawa [2]).   Bayes coding probability for the proposed source.

t−1 APc(xt|x , c) =q ˜sλ (xt|xτt ··· xt−1). (25) Proof. (Outline only.)

∗ t−1 X t−1 t−1 AP (xt|x ) = π(c|x )APc(xt|x , c) B. Efficient Bayes Coding Algorithm for the Proposed Source c∈C X t−1 t−1 = π(c|x )˜qsλ (xt|τt, x ) In the previous subsection, we described that the algorithm c∈C by Matsushima and Hirasawa [2] reduces the computational t ( ) complexity of calculating AP (x |xt−1, c) for each change X t−1 X t−1 c t = q˜sλ (xt|τt, x ) π(c|x ) |X |d−1 pattern c from O(2 ) to O(d). However, (15) for the τt=1 c:τt∈Tc proposed source requires a computational effort of O(d · 2N ) t N X t−1 t−1 because the total number of change patterns is |C| = 2 for = q˜sλ (xt|τt, x )v(τt|x ) the length N of the source sequence. τt=1 t−1 In this subsection, we propose an algorithm to reduce the =p ˜(xt|x ). (28) computational complexity to O(d · N 2) using a class of prior probability distributions of change patterns by Suko et al. [8]. In their proposal of efficient Bayes coding for i.p.i.d. sources, V. EXPERIMENT they assumed a class that follows a Bernoulli distribution for the pattern of parameter changes. In this paper, we assume a We performed an experiment of running the proposed algo- similar class for the change pattern c. rithm on an artificially generated source sequence. The purpose of this experiment is to check the compression performance of Definition 6. For the change pattern c = the proposed algorithm for several settings of hyper-parameter (c) (c) (c) (c) α (see Definition 6). (w1 , . . . , wt , . . . , wN ), we assume that each wt (except t = 1) independently follows Bernoulli distribution First, we describe the source sequence we used in the (c) experiment. The length of the sequence is 300, and as shown Ber(α) for each wt (except t = 1). That is, in Figure 6, each of the 100 consecutive symbols is generated

def from a different context tree model. Since the context tree π(c) = α|Tc|−1(1 − α)N−|Tc|. (26)

In addition, Suko et al. [8] introduced a prior probability distribution of τt, and proposed an efficient Bayes code for i.p.i.d. sources. In the same way, we define the prior proba- bility distribution of the last change point v(τt) as follows:

def X Fig. 6. Context tree models that generate the source sequence we used in the v(τt) = π(c), where τt = 1, 2, . . . , t. (27) θ 1 − θ experiment. Note that x=1|sλ is given by x=0|sλ . c:τt∈Tc model of the source sequence changes every 100 symbols, the Finally, the efficient Bayes coding algorithm for the pro- appropriate hyper-parameter (the parameter of the Bernoulli posed model in this paper is shown below. distribution in Definition 6) α is 0.01. Therefore, in order to confirm the compression performance for the values of α, we The average of the estimate of the 10000 times is taken and conducted the following Experiment A. the result is shown in Figure 8. Experiment A In Experiment A, we observe the redundancy 1 log − H Xt  (29) 2 t−1 τt p˜(xt|x ) of the proposed algorithm, where H(·) denotes entropy rate. We set the hyper-parameter α with three values: α = 0.1, 0.01, and 0.001. We ran the algorithm 10000 times with each value of α. In the algorithm, we set gs = 0.5 (see Assumption 1) and β(0|s) = β(1|s) = 0.5 (see Assumption 2). The average redundancy of the 10000 times is taken and the result is shown in Figure 7.

Fig. 8. Average of the estimates of the last change point in the 10000 times trial When α = 0.01 and 0.001, it seems that the transition of t−1 arg max v(τt|x ) corresponds to the change of context tree τt t−1 model. When α = 0.1, however, arg max v(τt|x ) ' t, so τt it does not correspond to the change of context tree model. ACKNOWLEDGEMENT This work was supported in part by JSPS KAKENHI Grant Numbers JP17K06446, JP19K04914, and JP19K14989. REFERENCES Fig. 7. Average redundancy in the 10000 times trial [1] F. M. J. Willems, Y. M. Shtarkov and T. J. Tjalkens, “The Context- Tree Weighting Method: Basic Properties,” in IEEE Transactions on The redundancy jumps up instantly at each changing point , vol. 41, no. 3, pp. 653–664, May 1995, doi: 10.1109/18.382012. (t = 101, 201), but decreases gradually in any values of α. [2] T. Matsushima and S. Hirasawa, “A Class of Prior Distributions on (α = 0.1 vs. α = 0.01) Context Tree Models and an Efficient Algorithm of the Bayes Codes Assuming It,” 2007 IEEE International Symposium on Signal Processing When α = 0.1 the redundancy decreases more and Information Technology, Giza, 2007, pp. 938–941, doi: 10.1109/IS- rapidly. However, it converges to a smaller value SPIT.2007.4458049. when α = 0.01 at the end of each interval. [3] T. Matsushima, H. Inazumi and S. Hirasawa, “A Class of Distortionless Codes Designed by Bayes Decision Theory,” in IEEE Transactions on (α = 0.01 vs. α = 0.001) Information Theory, vol. 37, no. 5, pp. 1288–1293, Sept. 1991, doi: When α = 0.01, the redundancy decreases more 10.1109/18.133247. rapidly. Moreover, there is no slight difference of the [4] N. Merhav, “On the Minimum Description Length Principle for Sources with Piecewise Constant Parameters,” in IEEE Transactions on In- convergence value at the end of each interval. formation Theory, vol. 39, no. 6, pp. 1962–1967, Nov. 1993, doi: Therefore, it can be concluded that α = 0.01 seems the best 10.1109/18.265504. for compression performance. [5] G. I. Shamir and N. Merhav, “Low-complexity Sequential Lossless Cod- ing for Piecewise-stationary Memoryless Sources,” in IEEE Transactions In the Bayes code, we can observe the posterior probability on Information Theory, vol. 45, no. 5, pp. 1498–1519, July 1999, doi: of parameters. For example, the posterior probability of τt 10.1109/18.771150. indicates the characteristics of context tree model changes. [6] M. Duc Cao, T. I. Dix, L. Allison and C. Mears, “A Simple Statistical Algorithm for Biological Sequence Compression,” DCC’07, Snowbird, Next, we conducted the following Experiment B. UT, 2007, pp. 43–52, doi: 10.1109/DCC.2007.7. Experiment B [7] Y. Nakahara and T. Matsushima, “A Stochastic Model of Block t−1 Segmentation Based on the Quadtree and the Bayes Code for It,” The τt with the largest posterior probability v(τt|x ) is 2020 Conference (DCC), pp. 293–302, 2020, regarded as the estimate of the last change point from t, and we doi:10.1109/DCC47342.2020.00037. observe the transition of the estimate. As same as Experiment [8] T. Suko, T. Matsushima and S. Hirasawa, “Bayes Coding for Sources with Piecewise Constant Parameters,” Proceedings of the 26th Sympo- A, we set the three hyper-parameter α = 0.1, 0.01, and 0.001. sium on Information Theory and Its Applications, pp. 165–168, 2003. We ran the algorithm 10000 times with each hyper-parameter. (in Japanese)