Computation and Estimation of Generalized Rates for Denumerable Markov Chains Gabriela Ciuperca, Valérie Girardin, Loïck Lhote

To cite this version:

Gabriela Ciuperca, Valérie Girardin, Loïck Lhote. Computation and Estimation of Generalized En- tropy Rates for Denumerable Markov Chains. IEEE Transactions on , Institute of Electrical and Electronics Engineers, 2011, 57, pp.4026 - 4034. ￿10.1109/TIT.2011.2133710￿. ￿hal- 01082088￿

HAL Id: hal-01082088 https://hal.archives-ouvertes.fr/hal-01082088 Submitted on 12 Nov 2014

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 1 Computation and Estimation of Generalized Entropy Rates for Denumerable Markov Chains Gabriela Ciuperca, Valerie Girardin, Lo¨ıck Lhote

Abstract—We study entropy rates of random sequences for In this paper, we address the problems of computing and general entropy functionals including the classical Shannon and then estimating the (h, φ)-entropy rates of random sequences – Renyi´ and the more recent Tsallis and Sharma-Mittal especially Markov chains, taking values in countable spaces, ones. In the first part, we obtain an explicit formula for the either finite or denumerable. This entropy rate is defined as entropy rate for a large class of entropy functionals, as soon as the limit of the time average of the entropy of the considered the process satisfies a regularity property known in dynamical random sequence, that is as the entropy per unit time of the systems theory as the quasi-power property. Independent and sequence. identically distributed sequence of random variables naturally For an independent identically distributed (i.i.d.) sequence, satisfy this property. Markov chains are proven to satisfy it too, under simple explicit conditions on their transition probabilities. Shannon and Renyi´ entropy rates are well-known to be the All the entropy rates under study are thus shown to be either entropy of the common distribution. The Shannon entropy infinite or zero except at a threshold where they are equal to rate of an ergodic and homogeneous with a Shannon or Renyi´ entropy rates up to a multiplicative constant. countable state space is an explicit function of its transition In the second part, we focus on the estimation of the marginal distributions and stationary distribution; it is also known to generalized entropy and entropy rate for parametric Markov chains. Estimators with good asymptotic properties are built be related to the dominant eigenvalue of some perturbation of through a plug-in procedure using a maximum likelihood es- the transition matrix, a result proven for Renyi´ entropy in [18]. timation of the parameter. [12] deals with the denumerable case but the proofs contain Index Terms—entropy rate, entropy functional, parametric flaws (see end of Section IV-B). Up to our knowledge, no other Markov chain, plug-in estimation, Renyi´ entropy, Tsallis entropy. result exists in the literature concerning explicit determination of (h, φ)-entropy rates. The aim of the first part of the present paper is to fill this gap, with a particular interest to Markov I.INTRODUCTION chains. In [21], Shannon adapted to the field of information theory The entropy of the stationary distribution of a Markov chain the concept of entropy introduced by Boltzmann and Gibbs is the (asymptotic) entropy of the chain at equilibrium; if this in the XIX-th century. Entropy measures the randomness distribution is taken as initial distribution of the chain, its or uncertainty of a random phenomenon. It now applies to entropy is also the marginal entropy of the chain. In both cases, various areas such as information theory, finance, statistics, the entropy rate is more representative of the whole trajectory cryptography, physics, artificial intelligence, etc.; see [8] for of the sequence. Having the marginal entropy and entropy rate details. Renyi´ proposed in [19] a one parameter family of of Markov chains under an explicit form allows one to use entropies extending Shannon entropy to new applications. them efficiently in all applications involving Markov modeling. Since then, many different generalized entropies have been When only observations of the chain are available, the need defined to adapt to many different fields. Among them, Tsallis for estimation obviously appears. We consider the case of [23] or Sharma-Mittal [24] entropies are instances of what countable parametric chains, for which transition probabilities we will call (h, φ)-entropies, thus following [20] – where are functions of a finite set of parameters. Since the entropy is only parametric probability density functions are considered. an explicit function of the transition probabilities, and hence Precisely, we set of the parameters, plug-in estimators of the marginal entropy " # and entropy rate are obtained by replacing the parameters by X Sh(y),φ(x)(ν) = h φ(ν(i)) (1) their maximum likelihood estimators (MLE). i∈E Up to our knowledge, no result exists in the literature on the for any measure ν on a countable space E such that the estimation neither of the generalized entropy of the stationary quantity is finite. distribution nor of the generalized entropy rate of a countable Markov chain. Concerning Shannon entropy, see [5] for results G. Ciuperca is with the Laboratoire de Probabilite,´ Combinatoire et on the estimation of the marginal entropy through a Monte Statistique, Universite´ LYON I, Domaine de Gerland, Bat.ˆ Recherche B, 50 Carlo method, and [7] for the estimation of the marginal Av. Tony-Garnier, 69366 Lyon cedex 07, France, [email protected] lyon1.fr entropy and entropy rate of finite chains. The special case V. Girardin is with the Laboratoire de Mathematiques´ N. Oresme, of two-state Markov chains is studied in [11]. UMR6139, Campus II, Universite´ de Caen, BP5186, 14032 Caen, France, The paper is organized as follows. Basics on generalized [email protected] L. Lhote is with ENSICAEN, GREYC, Campus II, Universite´ de Caen, entropies and entropy rates are given in Section II. In Section BP5186, 14032 Caen, France, [email protected] III, the regularity property called quasi-power property is 2

h(y) φ(x) (h, φ) − entropies y −x log x Shannon (1948) (1 − s)−1 log y xs Renyi (1961) [t(t − r)]−1 log y xr/t Varma (1966) y (1 − 21−r)−1(x − xr) Havrda and Charvat (1967) (t − 1)−1(yt − 1) x1/t Arimoto (1971) (r − 1)−1[y(r−1)/(s−1) − 1] xs Sharma and Mittal 1 (1975) (r − 1)−1[exp(r − 1)y − 1] −x log x Sharma and Mittal 2 (1975) y −xr log x Taneja (1975) y (t − r)−1(xr − xt) Sharma and Taneja (1975) (r − 1)−1(1 − y) xr Tsallis (1988)

TABLE I SOME (h, φ)-ENTROPIES.

−1 stated and shown to induce convergence of the time average Renyi´ entropy is obtained for hs(y) = (1 − s) log y and s entropy to an explicit limit. In Section IV, mild conditions φs(x) = x with s > 0, that is are shown to be sufficient for countable Markov chains to 1 X satisfy the quasi-power property. Estimation of the generalized (ν) = (ν) = log ν(i)s; Rs Shs(y),φs(x) 1 − s marginal entropy and entropy rate for parametric countable i∈E Markov chains is studied in Section V. Shannon entropy corresponds to s → 1.Renyi´ entropy is additive, but is concave only for s ≤ 1. Note that [2] proves II.GENERALIZEDENTROPIESANDENTROPYRATES that additive entropies are necessarily non linear transforms of Renyi´ entropies. We refer to [13] for detailed applications of A. Generalized (h, φ)-entropies Renyi´ entropies. Standard thermodynamical extensivity is lost in strong mix- All throughout the paper, E will be a countable set, either ing, long range interacting or non-Markovian physical systems. finite or denumerable. Both functions h : → and φ : R+ R This led Tsallis to postulate in [24] a nonadditive general- [0, 1] → are twice continuously differentiable functions, R+ ization of Shannon entropy which now bears his name, thus with h monotonous and either φ concave or convex. Both φ allowing for superextensivity (when r < 1) or subextensivity and h ◦ φ will be supposed to be positive for simplification, (when r > 1). Note that Tsallis entropy equals Havrda-Charvat but all the results below can be adapted to negative functions. entropy up to a multiplicative term depending only on the We define the (h, φ)-entropy (ν) of any measure Sh(x),φ(y) parameter. Tsallis entropy involves the functions φ (x) = xr ν on E as in (1) if P φ(ν(i)) is finite, and as +∞ either. r i∈E for some positive r 6= 1, and h (y) = (r − 1)−1(1 − y), so For the sake of simplicity, we will suppose that (ν) r Sh(x),φ(y) that is nonnegative. " # The conditions on both φ and h are the usual conditions for 1 X (X) = (ν) = 1 − ν(i)r . entropy that are for instance satisfied by all the entropies of Tr Shr (y),φr (x) r − 1 Table I. The function h may not be positive (see for instance i∈E Renyi´ or Varma entropies) but the parameters in h and φ Tsallis entropy is concave and appears as the unique solu- behave such that h ◦ φ is indeed positive. Note also that h(x) tion of a generalized Khinchin’s set of conditions. Shannon is finite for all x ∈ R+, but that h may not be bounded on entropy corresponds to the value r = 1.Renyi´ entropy is a R+. monotonically decreasing function of Tsallis entropy, precisely −1 For a a random variable X with distribution ν, we set Rs(X) = (1 − s) log[1 − (s − 1)Ts(X)], but concavity is Sh(x),φ(y)(X) = Sh(x),φ(y)(ν). For a stationary random se- not preserved through monotonicity. See Tsallis [24] for details quence (Xn)n∈N with common distribution ν, we will call in statistical mechanics, and for hints for determining r from marginal entropy of the sequence the quantity Sh(x),φ(y)(ν) = fitting physical constraints. See [27] and the references therein Sh(x),φ(y)(Xn). for other applications in statistical mechanics, in thermody- Definition (1) includes all classical entropies. First, we get namics, in the study of DNA binding sites, etc. Shannon entropy for φ(x) = −x log x and h the identity Both Renyi´ and Tsallis entropies appear as particular cases function, so that of Sharma-Mittal entropy introduced in [23] with hs,r(y) = (r − 1)−1[1 − y(1−r)/(1−s)] φ (x) = xs X and s , that is S(ν) = Sy,−x log x(ν) = − ν(i) log ν(i).  1−r  i∈E " # 1−s 1 X (ν) = (ν) = 1 − ν(i)s . Ss,r Shs,r (y),φs(x) r − 1   Shannon entropy is concave and is additive (and fits well to i∈E extensive systems), meaning that the Shannon entropy of the product of marginal measures is the sum of the entropies of Renyi´ entropy corresponds to r → 1 and Tsallis entropy to the marginal measures. s → r. The case s → 1 is sometimes called Gaussian entropy; 3 precisely, III.QUASI-POWER PROPERTY AND (h, φ)-ENTROPY RATE 1 We will first introduce the quasi-power property and then lim Ss,r(X) = (1 − exp [(r − 1)S(X)]) . prove that the (h, φ)-entropy rate of a random sequence s→1 1 − r satisfying that property can be computed explicitly for a large In general, Sharma-Mittal entropy is neither extensive nor class of (h, φ) functions. The proof will involve the series concave. X Λ (s) = ν (in−1)s (4) A list of (h, φ)-entropies is given in Table I; we refer to n n 0 in−1∈En [15] and to the references therein for details. 0 for s > 0, and its formal derivatives for k ≥ 1, X Λ(k)(s) = [log ν (in−1)]kν (in−1)s, (5) B. Entropy rates n n 0 n 0 n−1 n i0 ∈E For a discrete-time process X = (X ) , under suitable n n∈N where ν denotes the marginal distribution of order n of the conditions (see [10]), the entropy of (X ,...,X ) divided n 0 n−1 random sequence X = (X ) ; in other words, ν (in−1) = by n converges to a limit called the entropy rate of the process, n n∈N n 0 P(X0 = i0,...,Xn−1 = in−1). In dynamical systems theory, say Hh(y),φ(x)(X). Precisely, Λn(s) is called the Dirichlet series of fundamental measures 1 of depth n. It is a central tool for studying general sources, in Hh(y),φ(x)(X) = lim Sh(y),φ(x)(X0,...,Xn−1). n→∞ n pattern matching or in the analysis of data structures; see for instance [6]. This series is also introduced in [17] (see V (n, s) See [9] for an interesting study of Tsallis and other non- page 36). extensive entropies and entropy rates. The simplest case of a random sequence is an i.i.d. sequence For all additive entropy functionals, the entropy rate of an with a non-degenerated distribution ν over a finite set E. X = (X ) ν n−1 i.i.d. sequence n n∈N with common distribution is Since its marginal distribution of order n is νn(i0 ) = the entropy of ν, so that in particular Hs(X) = Ss(ν) for ν(i0)ν(i1) . . . ν(in−1), the Dirichlet series Λn(s) defined in Renyi´ entropy and H(X) = S(ν) for Shannon entropy. For an (4) can be simply written as the n-th power of an analytic ergodic homogeneous Markov chain X with countable state function, precisely space, the Shannon entropy rate depends on the transition " #n probabilities P = (p(i, j))i,j∈E and stationary distribution X s Λn(s) = ν(i) . π = (π(i))i∈E (such that πP = π) through the well-known i∈E expression stated in [21], The quasi-power property next stated says that Λn(s) behaves X for more general random sequences satisfying it like the n-th H(X) = − π(i)p(i, j) log p(i, j). (2) power of some analytic function up to some error term. i,j∈E

Rached et al proved in [18] that the Renyi´ entropy rate of a Property 1 Let X = (Xn)n∈N be a random sequence taking finite Markov chain is values in a countable set E and let νn denote the marginal distribution of (X ,...,X ). Then X is said to satisfy 1 0 n−1 Hs(X) = log λ(s), (3) the quasi-power property with parameters [σ0, λ, c, ρ] if both 1 − s following conditions are fulfilled: n−1 s 1. sup n−1 n νn(i ) converges to zero when n tends to using that the perturbated matrix Ps = (p(i, j) )i,j∈E has a i0 ∈E 0 unique dominant eigenvalue λ(s) for any s > 0. We will show infinity. in Section IV-B that (3) holds true for a denumerable Markov 2. there exists a real number σ0 < 1, such that for all real chain too under mild regularity conditions. Note that by letting number s > σ0 and all integer n ≥ 0, the series Λn(s) defined s go to 1 in (3), since λ(1) = 1, Shannon entropy rate is also in (4) is convergent and satisfies 0 n−1 related to the dominant eigenvalue through H(X) = −λ (1). Λn(s) = c(s) · λ(s) + Rn(s), (6) A random sequence can also be described in terms of n−1 n−1 symbolic dynamical systems theory. We refer to [25] for the with |Rn(s)| = O ρ(s) λ(s) , where c and λ are definition in terms of dynamical systems of i.i.d. sequences strictly positive analytic functions for s > σ0, and λ is strictly (also called Bernoulli processes) and finite Markov chains, decreasing with λ(1) = c(1) = 1, and Rn is also analytic, and to [6] for Markovian dynamical systems with countable and finally ρ(s) < 1. state spaces. Both deal, among other topics, with the Shannon Obviously, any i.i.d. sequence taking values in a finite set entropy rate of processes by means of functional operators also E satisfies the quasi-power property for σ0 = 0 and functions called transfer operators. These operators play the same role as λ, c and ρ defined by perturbations of transition matrices do in [18]; they also have X s a unique dominant eigenvalue λ(s), and the Shanon entropy λ(s) = ν(i) , c(s) = λ(s) and ρ(s) = 0. rate is thus proven to be −λ0(1). In the next section, we i∈E will use similar operators techniques for determining (h, φ)- The result extends to the case of a denumerable set E as soon entropy rates. as some σ0 < 1 exists such that, for all s > σ0, the series 4

λ(s) converges. Note that Shannon entropy rate is then equal in the same way as for s = 1 the results listed in Table II 0 −1 to −λ (1) and Renyi´ entropy rate to (1 − s) log λ(s). for s < 1 and s > 1.  We can now state the main result for determinating the generalized entropy rates of random sequences satisfying the Table III shows applications of Theorem 1 to various quasi-power property. entropy rates for random sequences satisfying the quasi-power property with parameters [λ, c, ρ, σ0]. Remark that almost all

Theorem 1 Let X = (Xn)n∈N be a random sequence satis- the entropy rates are finite and non-zero only at a threshold fying the quasi-power property with parameters [σ0, λ, c, ρ]. where they are equal to Shannon or Renyi´ entropy rates up Suppose that to a multiplicative factor. Elsewhere, they are either null or s k infinite, which limits their practical interest in applications. φ(x) ∼ c1 · x · (− log x) (7) x→0 ∗ ∗ with s > σ0, c1 ∈ R+ and k ∈ N . Then Table II gives the IV. MARKOVCHAINSANDTHEQUASI-POWER PROPERTY entropy rate Hh,φ(X) according to the behavior of h around 0 Let us begin this section by connecting Markov chains to for s > 1 and around +∞ for s ≤ 1. dynamical sources. A dynamical source is defined by five objects: a countable alphabet E, a topological partition (I ) Note that condition (7) is satisfied for all the usual entropies i i∈E I = [0, 1] σ : I → E listed in Table I. of the interval , a coding function such σ(I ) = i i E f Proof: Point 1 of the quasi-power property and (7) that i for all symbols of , a density function 0 on I and finally a shift function T which is twice continuously together induce that for any ε > 0, there exists n0 ∈ N such n−1 n differentiable and strictly monotonous on each interval of the that for all n ≥ n0 and i0 ∈ E , partition. The random sequence X = (Xn) associated to the k n−1 s k n−1 n−1 n (1 − ε)(−1) c1νn(i0 ) log νn(i0 ) ≤ φ(νn(i0 )) dynamical source corresponds to the trace of the iterates T (x) for some x chosen according to the distribution f . Precisely, and 0 n n−1 k n−1 s k n−1 Xn = σ(T (X0)), φ(νn(i0 )) ≤ (1 + ε)(−1) c1νn(i0 ) log νn(i0 ). where X is a random variable with density function f on I. Therefore, 0 0 The dynamical source (T,I, E, (Ii)i∈E, f0) is Bernoulli if k (k) k (k) (1 − ε)(−1) c1Λn (s) ≤ Σn ≤ (1 + ε)(−1) c1Λn (s), (8) T is surjective (i.e., T (Ii) = I) and affine on each Ii and P n−1 if f0 is constant. Then, it is easy to check that the associated where we have set Σn = n−1 n φ(νn(i0 )) for simpli- i0 ∈E random sequence X is i.i.d.. The dynamical source is said to be fication. Due to the analyticity of all the functions involved in Markovian if the image of each interval is the union of images (6), for n large enough, of intervals of the partition. If, furthermore, T is piecewise   1  affine and f is constant on each interval of the partition, the Λ(k)(s) = c(s) · λ0(s)k · nk · λ(s)n−k−1 · 1 + O . 0 n n associated random sequence X is a Markov chain. Conversely, any Markov chain X over a countable state Putting this into (8) yields space E can be represented by a (non unique) Markovian 0 k k n−k−1 Σn ∼ c1 · c(s) · (−λ (s)) · n · λ(s) , (9) dynamical source. Let P = (p(i, j))i,j∈E denote the transition matrix of X (that is p(i, j) = (X = j|X = i)) and µ = with −λ0(s) = |λ0(s)| since λ is a decreasing function. Let us P n n−1 (µ(i)) its initial distribution (that is µ(i) = (X = i)). now study the three different cases concerning s. i∈E P 0 For instance, we can consider a topological partition (I ) First, suppose that s = 1. Since λ(1) = c(1) = 1, (9) i i∈E of I = [0, 1] and then a second one (I ) such that for simplifies into j|i i,j∈E all i, j ∈ E, Σ ∼ c · |λ0(1)|k · nk. n 1 [ |Ij|i| = p(i, j) · |Ii| and Ij|i = Ii, Since φ is a positive function, Σn converges polynomially j∈E to +∞. Depending on the conditions on h, this leads to the 1/k 0 next equivalences: h(Σn) ∼ c2 · c1 · |λ (1)| · n in Case (I), where I denotes the closure of the interval I. A dynamical h(Σn) ∼ o(n) in Case (II), and h(Σn) ∼ sn·n with sn → +∞ source simulating the Markov chain X is then given by the in Case (III). By definition, the (h, φ)-entropy rate is the limit following five elements: the alphabet, say E = {j|i, (i, j) ∈ 2 of h(Σn)/n when n tends to infinity, so the results given in E }, the topological partition (Ij|i)i,j∈E, the coding function Table II for s = 1 follow immediately. σ defined by σ(Ij|i) = j|i, the piecewise constant density For either s < 1 or s > 1, the function λ is strictly function f defined by f(Ii) = µ(i)/|Ii|, and finally the decreasing with λ(1) = 1. Hence λ(s) < 1 for s > 1 piecewise linear function T defined by T (Ij|i) = Ij. Note and λ(s) > 1 for s < 1, from which we deduce that that even if the state spaces of the Markov chain and of the + Σn tends exponentially to +∞ for s < 1 and to 0 for associated dynamical source seem different, a clear bijection s > 1. Depending on conditions on h, this leads to the next exists between both processes. equivalences: h(Σn) ∼ c2 ·log λ(s)·n in cases (IV) and (VII), Finally, note that [6] give sufficient conditions on count- h(Σn) ∼ o(n) in cases (V) and (VIII), and h(Σn) ∼ sn · n able Markovian dynamical systems ensuring the quasi-power with sn → +∞ in cases (VI) and (IX). Then, we get exactly property. The associated transfer operator (see (12) below) has 5

Value of s Condition on h Entropy rate Case 1/k 1/k 0 h(x) ∼ c2 · x c2 · c · λ (1) (I) x→+∞ 1 s = 1 h(x) = o(x1/k) 0 (II) x→+∞ x1/k = o(h(x)) +∞ (III) x→+∞ h(x) ∼ c2 · log x c2 · log λ(s) (IV ) x→0+ s > 1 h(x) = o(log x) 0 (V ) x→0+ log x = o(h(x)) +∞ (VI) x→0+ h(x) ∼ c2 · log x c2 · log λ(s) (VII) x→+∞ σ0 < s < 1 h(x) = o(log x) 0 (VIII) x→+∞ log x = o(h(x)) +∞ (IX) x→+∞

TABLE II VALUE OF THE ENTROPY RATE Hh,φ, ACCORDINGTOTHEBEHAVIOROF h

Entropy Parameters Entropy rate Shannon −λ0(1) Renyi´ s = 1 −λ0(1) 1 s 6= 1 log λ(s) 1 − s 1 Varma r = t − λ0(1) t2 1 r 6= t log λ(r/t) t(t − r) Havrda and Charvat r > 1 0 −1 r = 1 λ0(1) log 2 r < 1 +∞ Arimoto t > 1 +∞ t = 1 −λ0(1) t < 1 0 Sharma-Mittal 1 r < 1 +∞ r > 1 0 s = r = 1 −λ0(1) 1 r = 1 6= s log λ(s) 1 − s 1 0 Sharma-Mittal 2 1−r (exp[−(r − 1)λ (1)] − 1) Taneja r < 1 +∞ r = 1 −λ0(1) r > 1 0 Sharma and Taneja r < 1 or t < 1 +∞ r > 1 and t > 1 0 r = 1 and t > 1 0 r = 1 and t = 1 −λ0(1) r > 1 and t = 1 0 Tsallis r < 1 +∞ r = 1 −λ0(1) r > 1 0 TABLE III VALUESOFCLASSICALENTROPYRATESOFARANDOMSEQUENCESATISFYINGTHEQUASI-POWER PROPERTY WITH PARAMETERS [λ, c, ρ, σ0]. 6 only one eigenvalue with maximum modulus, isolated from the Before stating the main result of this section, let us prove remainder of the spectrum by a spectral gap. The dominant a technical lemma which essentially transforms Point C of eigenvector is strictly positive and a spectral decomposition Assumption 1 into a more convenient form. of the operators exists. Unfortunately, to exhibit the right topological partition (Ij)j∈E for which these conditions hold Lemma 1 If Assumptions 1 holds true, then for all s > σ0, 1 1 is quite difficult in terms of transition matrices. Therefore, in the operator Ps :(L , || . ||1) → (L , || . ||1) defined by the following sections, we prefer to state conditions especially Ps[v] = v · Ps is a compact operator on fitted to transition matrices under which we prove that the X L1 = {u = (u ) : kuk = |u | < +∞}. quasi-power property holds true for countable Markov chains. i i∈N 1 i i∈N A. Finite Markov chains Proof: For the sake of simplicity, since any denumerable state space E can be enumerated as a sequence E = (i ) Let X = (X ) be an ergodic Markov chain with finite k k∈K n n∈N with K = , we will here set E = , so that Point C of state space E, transition matrix P = (p(i, j)) and initial N N i,j∈E Assumption 1 takes the form distribution µ = (µ(i))i∈E. The marginal distribution νn of X s (X0,...,Xn−1) satisfies ∀ε > 0, ∀s > σ0, ∃N ∈ N, sup p(i, j) < ε. i∈ n−1 N j>N νn(i0 ) = µ(i0)p(i0, i1)p(i1, i2) . . . p(in−2, in−1). First, let us prove that for all s > σ0, there exists a sequence The series Λn(s) defined in (4) can be written in matrix form of integers Nk increasing to infinity such that n−1 Λn(s) = µs · Ps · 1, X 1 sup p(i, j)s < . (10) s i∈ k where Ps = (p(i, j) )i,j∈E and µs is the column vector N j>Nk (µ(i)s) . Since P is irreducible and aperiodic, the same is i∈E Point C of Assumption 1 says that for all k ∈ ∗, some true for P for any s. In particular, P has a unique dominant N s s N ∈ exists such that (10) holds true. If N is replaced eigenvalue with maximum modulus. This eigenvalue λ(s) is k N k by sup N , the inequality remains true and the sequence positive and its associated left and right eigenvectors, say l l≤k l s is clearly increasing. If N did not converge to infinity, some and r , are also positive in the sense that all their components k s j ∈ would exist such that j > N for all k, and hence, are positive. We deduce from these spectral properties that 0 N 0 k 1 X n−1 n−1 n−1 ≥ sup p(i, j)s ≥ sup p(i, j )s. v · P = λ(s) · < v, rs > ls + v · R (s), 0 s k i∈ i∈ N j>Nk N where the spectral radius of R(s) is strictly less than λ(s). k p(i, j ) This defines the functions λ, c and ρ of the quasi-power Letting tend to infinity, we would obtain that 0 is zero i property. They are analytic due to perturbation arguments that for all , which is untrue since the chain is irreducible. Now, let us prove that P is indeed a compact operator. Let are detailed in [14]. s (un) denote a sequence of elements un = (un) of L1 Note that this result was indirectly proven in [18], thus n i i∈N ||un|| ≤ 1 vn = un · P inducing the explicit determination of the Renyi´ entropy rate such that 1 and define s. By induction k s : → of finite Markov chains. on , we can build a sequence of functions k N N such that (sk+1(n))n is a subsequence of (sk(n))n and such that sk(n) for all i ≤ Nk, with Nk such as in (10), vi converges B. Denumerable Markov chains 1 to some vi. Then v = (vi)i∈N belongs to L ; indeed, for all Let X = (Xn) be a Markov chain with denumerable state ε > 0 and all M ∈ N, there exists Nk ∈ N, such that M < Nk space E, transition matrix P = (p(i, j))i,j∈E and initial and for n large enough, distribution µ = (µ(i))i∈E. The following assumptions will X X sk(n) X sk(n) be proven to be sufficient for X to satisfy the quasi-power |vi| ≤ |vi − vi | + |vi | property. i σ0, for the L1-norm. Since v belongs to L1, for all k ∈ N∗, there X s sup p(i, j) < ∞ exists Mk ∈ N such that i∈E j∈E X 1 |v | < . and i k X i>Mk µ(i)s < ∞; ∗ 0 0 i∈E Let us set k = max(k, k ), where k is such that Mk < Nk0 . Then C. for all ε > 0 and all s > σ0, there exists some A ⊂ E with a finite number of elements, such that X sk∗ (n) |vi − vi| ≤ X sup p(i, j)s < ε. i∈N i∈E P |vsk∗ (n) − v | + P |vsk∗ (n)| + P |v |. j∈E\A iNk∗ i i>Nk∗ i 7

For n˜ =n ˜(k) such that the first term in the sum is less than where η = sup(i,j)∈E2 p(i, j), so that s → λ(s) is strictly 1/k, we get decreasing. 

X sk∗ (˜n) 2 X sk∗ (n) |v − vi| ≤ + |v |. Then, applying Theorems 1 and 2 to φ satisfying (7), the i k i i∈N i>Nk∗ (h, φ)-entropy rate of the chain is given by

Finally, since by Point C of Assumption 1 again, 1 0 k k n−k−1 Hh(x),φ(y)(X) = lim h c1c(s)λ (s) n λ(s) . X s ∗ (n) X s ∗ (n) X n→+∞ n |v k | ≤ |u k | p(i, j)s i j Remarks i>N ∗ j∈ i>N ∗ k N k 1. In dynamical sources theory, the perturbated matrices P 1 1 s ≤ ||usk∗ (n)|| ≤ , are replaced for Bernoulli sources by the transfer operators j 1 k∗ k Gs defined by we obtain f(x) X sk∗ (˜n) 3 X |v − v | ≤ , Gs[f](y) = , i i k |T 0(x)|s i∈N x:T (x)=y which concludes the proof that Ps is compact.  and for Markovian sources by the secant operators Gs defined by Lemma 1 allows us to prove that denumerable Markov Gs[F ](y1, y2) = (12) chains satisfy the quasi-power property under mild conditions. s X X |x1 − x2| Note that all results remain true if Point A of Assumption 1 s F (x1, x2), |T (x1) − T (x2)| is replaced by: there exist n ∈ N and η < 1 such that all the i∈E (x1,x2)∈Ii(y1,y2) coefficients of P n are less than η. where Ii(y1, y2) = {(x1, x2) ∈ Ii|T (x1) = y1,T (x2) = y2}. 2. As noticed in the introduction, Golshani et al deal in Theorem 2 Let X = (X ) be an irreducible and aperi- n n∈N [12] with the denumerable case for the Renyi´ entropy rate odic Markov chain with transition matrix P = (p(i, j))(i,j)∈E2 but their proofs contain some errors. Indeed, they use results and initial distribution µ = (µ(i))i∈E. If Assumption 1 holds issued from the R-theory of non-negative matrices developed true, then X satisfies the quasi-power property. in [26]. In particular, they invoke the following asymptotic argument: if T is a positive irreducible and aperiodic matrix Proof: It follows from Lemma 1 that Ps is compact for 1 with radius of convergence R, then for all states i, j ∈ E, all s > σ0. Therefore, the spectrum of Ps over L is a the (i, j) coefficient of RnT n converges to some finite value sequence converging to zero. Hence, Ps has a finite number of eigenvalues with maximum modulus and there exists a spectral µi,j. Actually, in [12], the expression of the Renyi´ entropy rate P n n gap separating these dominant eigenvalues from the remainder involves the double sum Sn = i,j∈E(R T )i,jfj for large of the spectrum; details can be found in [14]. n, where (fj)j∈E is related to the initial distribution of the Markov chain, and T is a perturbation of the transition matrix Further, since X is irreducible and aperiodic, Ps is a non- negative irreducible and aperiodic infinite matrix, so has a of the chain. The authors exchange the limit with respect to n unique dominant eigenvalue λ(s) which, moreover, is posi- with the double infinite sum whereas the uniform convergence tive. We deduce from these spectral properties the following is not proven to hold true. On the contrary, [26, Theorem 6.2] states necessary and sufficient conditions to allow the change spectral decomposition of the iterates of Ps, when T is R-positive recurrent; unfortunately, these conditions n n n 1 u · Ps = λ(s) · u · Qs + u · Rs , u ∈ L , (11) involve the generally unknown R-invariant vectors, which makes them difficult to check in practice. Furthermore, even if where Q is the projector over the dominant eigenspace and s the transition matrix of the chain was supposed to be positive R is the projector over the remainder of the spectrum. In s recurrent, it would remain to prove that its perturbation T particular, the spectral radius of Rs can be written ρ(s) · λ(s) 1 shares the same property. with ρ(s) < 1. Finally, Λn(s) is given by the L -norm of µ · P n−1, µ = (µ(i)s) s s where s i∈E, so that In the above first part of the paper, we have explicitly ob- n−1 n−1 n−1 Λn(s) = λ(s) ||µs · Qs||1[1 + O(ρ(s) λ(s) )], tained the generalized (h, φ)-entropy rate of random sequences satisfying the quasi-power property. We have also given simple which means that X satisfies the quasi-power property. The assumptions on countable Markov chains for the quasi-power analyticity with respect to s of all the functions involved property to hold. In the second part below, we will focus on the in (11) is due jointly to the analyticity of s → Ps and to estimation of the entropy rates for parametric Markov chains perturbation arguments detailed in [14]. Moreover, for s < t, using the expressions given in Table III and plug-in estimators due to Point A of Assumption 1, built from the MLE of the parameter. X Λ (t) = Pr(in−1)t n 0 V. ESTIMATIONOFENTROPYFORDENUMERABLE in−1∈En 0 MARKOVCHAINS n(t−s) X n−1 s (n−1)(t−s) ≤ η Pr(i0 ) = η Λn(s), We suppose that the transition probabilities of the chain n−1 n d i0 ∈E depend on an unknown parameter θ ∈ Θ , where Θ is an 8 open subset of some Euclidean space and d ≥ 1. where n X The partial derivatives will be denoted with a subscript, as N (i, j) = 11 , for example f = ∂f/∂θ . The expectation under the value n {Xm−1=i,Xm=j} u u m=1 θ of the parameter will be denoted E θ. The true value of the parameter will be denoted by θ0. and n−1 X X Nn(i) = Nn(i, j) = 11Xm=i, i, j ∈ E, A. Estimation of the parameters j∈E m=0 Let X = (X0,...,Xn) be a sample of the chain. Let The estimators p (i, j) are strongly consistent and asymp- n+1 bn (x0, . . . , xn) ∈ E denote an observation; the associated totically normal. Precisely, when n tends to infinity, the log-likelihood is √ vector ( n[pbn(i, j) − p(i, j)]) converges in distribution to n−1 a centered Gaussian vector with covariances δ [δ p(i, j) − X ik jl log µ(x0) + log p(xm, xm+1; θ), p(i, j)p(i, l)]/π(i) for 1 ≤ i, j, k, l ≤ |E|. m=0 A natural estimator of the stationary distribution π is the where µ denotes the initial distribution of the chain. The infor- empirical estimator mation contained in the observation of this initial distribution N (i) π (i) = n , i ∈ E. does not increase with n. Hence, for a large sample theory, it bn n is convenient to consider the value of θ maximizing the pseudo It is strongly consistent and asymptotically normal. Precisely, log-likelihood when n tends to infinity, n−1 √ X n [π (i) − π(i)] −→L N 0, π(i)2[2 τ(i) − 1] − π(i) , log p(xm, xm+1; θ). bn E π m=0 where Eπτ(i) is the expectation of the return time τ(i) of the Asymptotic results on the MLE θbn of the parameter θ thus chain to state i when the initial distribution is π. obtained are proven in [3] under the following regularity These asymptotic properties derive from the law of large assumptions. numbers and central limit theorem for Markov chains; see [7] for details. Assumptions 2 For a finite chain with state space E, the transition prob- A. For any x, the set of y for which p(x, y; θ) > 0 does not abilities may also be functions of a number d of parameters strictly smaller than |E|(|E|−1). In this case, Points C to E of depend on θ.   θ d×d pu (i, j) B. For any (x, y), the partial derivatives pu(x, y; θ), Assumption 2 reduce to: for any , the matrix p puv(x, y; θ) and puvw(x, y; θ) exist and are continuous with has rank d; see [3]. respect to Θ. C. For all θ ∈ Θ, there exists a neighborhood N such that B. Estimation of marginal entropy for any u, v, x, y, the functions pu(x, y; θ) and puv(x, y; θ) are uniformly bounded in L1(µ(dy)) on N and Since the transition probabilities of the chain depend on θ, the stationary distribution also depends on θ. It is nat- 0 2 ˆ Eθ[ sup | pu(x, y; θ ) | ] < +∞. ural to consider the plug-in estimator Sh(y),φ(x)(π(θn)) of θ0∈N Sh(y),φ(x)(π(θ)). D. There exists α > 0 (possibly depending on θ) such that 2+α Let X be an ergodic homogeneous finite Markov Eθ[| pu(x, y; θ) | ] is finite, for u = 1, . . . , d. Theorem 3 E. The d × d Fisher information matrix σ(θ) = (σuv(θ)) is chain satisfying the quasi-power property. If Assumption 2 is satisfied, then the estimator (π(θˆ )) is strongly con- non singular, where σuv(θ) = Eθ[pu(x, y; θ)pv(x, y; θ)]. Sh(y),φ(x) n sistent. If, moreover, the differential function DθSh(y),φ(x)(π) 0 Proposition 1 If Assumptions 2 are satisfied, then a strongly is not null at θ , then the plug-in estimator is asymptotically √ normal. Precisely consistent MLE θbn of θ exists. Moreover, n(θbn − θn) is √ asymptotically centered and normal, with covariance matrix ˆ n[Sh(y),φ(x)(π(θn)) − Sh(y),φ(x)(π(θ))] → N (0, Σπ), σ−1(θ0). where Note that if n is large, there is exactly one MLE in N. We  t −1 0   refer to [16] for weaker differentiability assumptions on the Σπ = Dθ0 Sh(y),φ(x)(π) σ (θ ) Dθ0 Sh(y),φ(x)(π) . transition functions. Proof: We know from Proposition 1 that θ converges Any finite Markov chain can be considered as a parametric bn almost surely to θ0. Due to operators properties detailed in chain, with parameters θ = p(i, j), for i 6= j. The MLE of i,j [1] (see particularly p94), the eigenvector π is known to be the transition probabilities are the empirical ones, defined by a continuously differentiable function of the operator; using N (i, j) Point B of Assumption 2 shows that π(x, y; θ) is absolutely p (i, j) = n 11 , bn Nn(i)>0 Nn(i) continuous with respect to θ. The continuous mapping theorem 9

ˆ implies that Sh(y),φ(x)(π(θn)) converges almost surely to [4] Bourdon, J., Nebel M. E.and Vallee´ B., On the Stack-Size of General 0 Tries, ITA, vol. 35 (2), pp163-185, 2001. Sh(y),φ(x)(π(θ )). ˆ [5] Chauveau, D., and Vandekerkhove, P., Monte Carlo estimation of the Then, the normality of Sh(y),φ(x)(π(θn)) follows from entropy for Markov chains. Meth. Comp. Appl. Probab., vol. 9 (1), pp133– Proposition 1 by the delta method.  149, 2007. [6] Chazal, F., and Maume-Deschamps, V., Statistical properties of General Markov dynamical sources: applications to information theory Discrete Math. Theor. Comp. Sc. vol. 6 (2), pp283–314, 2004. [7] Ciuperca, G., and Girardin, V., Estimation of the Entropy Rate of a C. Estimation of entropy rates Countable Markov Chain Comm. Stat. Th. Methods, vol. 36, pp2543– Table III shows that when the (h, φ)-entropy rate is neither 2557, 2007. [8] Cover, L., and Thomas, J., Elements of information theory. Wiley series null nor infinite, only two cases happen. Either, the entropy in telecommunications, New-York, 1991. rate is equal to −λ0(1), that is to Shannon entropy rate, or [9] Furuichi, S., Information theoretical properties of Tsaliis entropies J. it is a simple function of Renyi´ entropy rate, that is of (1 − Math. Physics, vol. 47, 2006. −1 [10] Girardin, V., On the Different Extensions of the Ergodic Theorem of s) log λ(s). Therefore, we will only detail the estimation of Information Theory, in Recent Advances in Applied Probability, R. Baeza- Shannon and Renyi´ entropy rates. Yates, J. Glaz, H. Gzyl, J. Husler¨ and J. L. Palacios (Eds), Springer- The estimation of Shannon entropy rate has already been Verlag, San Francisco, pp163–179, 2005. [11] Girardin, V., and Sesbou¨e,´ A., Comparative Construction of Plug-in considered by two of the authors in [7], mainly for finite chains Estimators of the Entropy Rate of Two-State Markov Chains, Method. for which estimation is detailed under different schemes of Comput. Appl. Probab., V11, pp181–200, 2009. observation, with a plug-in method based on (2). It allowed [12] Golshani, L., Pasha, E., and Yari, G., Some properties of Renyi´ entropy and Renyi´ entropy rate, Inf. Sci., vol. 179 (14), pp2426–2433, 2009. them to prove the asymptotic normality of plug-in estimators [13] Harremoes,¨ P., Interpretations of Renyi´ Entropies and Divergences for finite chains but does not apply to the denumerable case. Physica A vol. 365 (1), pp57–62, 2006. We will solve the problem here, for any countable parametric [14] Kato, T., Perturbation Theory for Linear Operators, 2d edition, Springer- Verlag, Berlin, 1976. chains, by applying results from operators theory. [15] Menendez,´ M.L., Morales, D., Pardo, L., and Salicru,´ M., (h, Φ)-entropy ˆ 0 ˆ Let us define the plug-in estimators H(θn) = −λ (1; θn) differential metric Appl. Math., vol. 42 (2), pp81-98, 1997. ˆ [16] Prakasa Rao, B.L.S., Maximum Likelihood Estimation for Markov of Shannon entropy rate H(θ), and Hs(θn) = (1 − −1 ˆ Process. Ann. Inst. Stat. Math. vol. 24, pp333–345, 1972. s) log λ(s; θn) of Renyi´ entropy rate Hs(θ). [17] Rached, Z., Renyi’s Entropy for Discrete Markov Sources. Master of Science Project, September, 1998. Theorem 4 Let X be an ergodic homogeneous countable [18] Rached, Z., Alajaji, F., and Campbell, L. L., Renyi’s´ Entropy Rate for Discrete Markov Sources, Proc. CISS, pp613-618, 1999. Markov chain satisfying the quasi-power property. If Assump- [19] Renyi,´ A., On measures of information and entropy, Proc. 4th Berkeley ˆ ˆ tion 2 is satisfied, then the estimators H(θn) and H(s; θn) Symposium on Mathematics, Statistics and Probability, pp547-561, 1960. for s 6= 1, are strongly consistent and asymptotically normal. [20] Salicru,´ M., Menendez,´ M. L., Morales, D., and Pardo, L. Asymptotic distribution of (h, φ)-entropies, Comm. Stat. (Theory and Methods) vol. Precisely √ 22 (7), pp2015–2031, 1993. ˆ n[H(θn) − H(θ)] → N (0, Σ1), [21] Shannon, C., A mathematical theory of communication. Bell Syst.Techn. J. vol. 27, pp379–423 and pp623-656, 1948. where [22] Shao, J., Mathematical Statistics Springer-Verlag, New York, 2003. [23] Sharma, B.D., and Mittal, P., New non-additive measures of relative  t ∂ 0 −1 ∂ 0 information J. Comb. Inform. and Syst. Sci. vol.2, pp122–133, 1975. Σ1 = [−λ (1; θ)] σ (θ) [−λ (1; θ)] [24] Tsallis, C., Possible generalization of Boltzmann-Gibbs statistics, J. Stat. ∂θ ∂θ Physics vol. 52 pp479–487, 1988. √ ˆ 0 [25] Vallee,´ V., Dynamical Sources in Information Theory: Fundamental and n[Hs(θn) − Hs(θ )] → N (0, Σs), where Intervals and Word Prefixes, Algorithmica vol. 29, pp262–306, 2001. t [26] Vere-Jones, D., Ergodic properties of nonnegative matrices. I and II 1  ∂  ∂ Pacific J. Math. Σ = λ(s; θ) σ−1(θ) λ(s; θ). vol. 22 (2) pp361–386, 1967 and vol.26, issue 3, pp601- s (1 − s)2 ∂θ ∂θ 620, 1968. [27] Wachowiak, M.P., Smolikova, R., Tourassi, G.D., Elmaghray A.S., Proof: For a parametric chain depending on θ, let us Estimation of generalized entropies with sample spacing, Pattern Anal. s Applic. vol. 8, pp95–101, 2005. set ps(x, y, θ) = p(x, y, θ) . Due to operators properties (see again [1, p94]), the eigenvalue λ(s, θ) of the perturbated operator defined by Ps = (ps(x, y, θ)) and its derivative 0 λ (s, θ) are known to be continuous with respect to Ps. Point B of Assumption 2 induces that Ps too is a continuously differentiable function of θ. Therefore both λ(s; θ) and λ0(s; θ) are continuous with respect to θ. The results follow from the continuous mapping theorem and the delta method.

REFERENCES [1] Ahues, A., Largillier A. and Limaye B. V. Spectral Computations for Bounded Operators Chapman & Hall/CRC, 2001. [2] Amblard, P.-O., and Vignat, C., A note on bounded entropies, Physica A, vol. 365 (1) pp50–56, 2006. [3] Billingsley, P., Statistical Inference for Markov Processes The university of Chicago Press, 1961.