1

Metric and topological entropy bounds for optimal coding of stochastic dynamical systems Christoph Kawan and Serdar Yuksel¨

Abstract—We consider the problem of optimal zero-delay cod- The system (1) is connected over a possibly noisy channel ing and estimation of a stochastic over a with a finite capacity to an estimator, as shown in Fig. 1. The noisy communication channel under three estimation criteria estimator has access to the information it has received through concerned with the low-distortion regime. The criteria considered are (i) a strong and (ii) a weak form of almost sure stability the channel. A source coder maps the source symbols (i.e., of the estimation error as well as (ii) quadratic stability in state values) to corresponding channel inputs. The channel expectation. For all three objectives, we derive lower bounds on inputs are transmitted through the channel; we assume that the smallest channel capacity C0 above which the objective can be the channel is a discrete channel with input alphabet and achieved with an arbitrarily small error. We first obtain bounds output alphabet 0. M through a dynamical systems approach by constructing an M infinite-dimensional dynamical system and relating the capacity xt qt q0 with the topological and the entropy of this dynamical SYSTEM CODER CHANNEL t ESTIMATOR system. We also consider information-theoretic and probability-

theoretic approaches to address the different criteria. Finally, xˆt we prove that a memoryless noisy channel in general constitutes qt0 no obstruction to asymptotic almost sure state estimation with arbitrarily small errors, when there is no noise in the system. The results provide new solution methods for the criteria introduced Fig. 1. Coding and state estimation over a noisy channel with feedback (e.g., standard information-theoretic bounds cannot be applied for some of the criteria) and establish further connections We refer by a coding policy Π, to a sequence of functions between dynamical systems, networked control, and information (γe) which are causal such that the channel input at time theory, and especially in the context of nonlinear stochastic t t∈Z+ systems. t, qt , under Π is generated by a function of its local information,∈ M i.e., e e qt = γ ( ), t It I.INTRODUCTION e 0 where = x , q and qt , the channel input It { [0,t] [0,t−1]} ∈ M In this paper, we consider nonlinear stochastic systems given alphabet given by = 1, 2,...,M , for 0 t T 1. by an equation of the form M { } ≤ ≤ − Here, we use the notation x[0,t−1] = xs : 0 s t 1 for t 1. { ≤ ≤ − } xt+1 = f(xt, wt). (1) ≥ 0 The channel maps qt to qt in a stochastic fashion so that Here xt is the state at time t and (wt)t∈Z+ is an i.i.d. sequence 0 0 P (qt qt, q[0,t−1], q[0,t−1]) is a conditional probability measure of random variables with common distribution wt ν, | on 0 for all . If this expression is equal to ( 0 ), modeling the noise. In general, we assume that ∼ t Z+ P qt qt theM channel is said∈ to be memoryless, i.e., the past variables| f : X W X do not affect the channel output q0 given the current channel × → t input q . is a Borel measurable map, where ( ) is a complete t

arXiv:1612.00564v3 [math.OC] 26 Jun 2018 X, d and W a measurable space, so that for any The receiver, upon receiving the information from the channel, w W the map f( , w) is a homeomorphism of X. We generates an estimate xˆt at time t, also causally: An admissible ∈ · d further assume that x0 is a random variable on X with an causal estimation policy is a sequence of functions (γt )t∈Z+ d 0 associated probability measure π0, stochastically independent such that xˆt = γt (q[0,t]) with

of (wt)t∈Z+ . We use the notations d :( 0)t+1 0 x γt X, t . fw(x) = f(x, w), f (w) = f(x, w). M → ≥ x so that fw : X X and f : W X. For a given ε > 0, we denote by Cε the smallest channel → → capacity above which there exist an encoder and an estimator C. Kawan is with the Faculty of Computer Science and Mathematics, so that one of the following estimation objectives is achieved: University of Passau, 94032 Passau, Germany (e-mail: christoph.kawan@uni- passau.de). S. Yuksel¨ is with the Department of Mathematics and Statistics, Queen’s University, Kingston, Ontario, Canada, K7L 3N6 (e-mail: yuk- (E1) Eventual almost sure stability of the estimation error: [email protected]). This research was supported in part by the Natural There exists T (ε) 0 so that Sciences and Engineering Research Council (NSERC) of Canada. Some re- ≥ sults of this paper appeared in part at the 2017 IEEE International Symposium sup d(xt, xˆt) ε a.s. (2) on Information Theory. t≥T (ε) ≤ 2

(E2) Asymptotic almost sure stability of the estimation error: area have typically involved linear systems, and in the non-  linear case the studies have only been on deterministic systems P lim sup d(xt, xˆt) ε = 1. (3) t→∞ ≤ estimated/controlled over deterministic channels, with few exceptions. For linear systems, data-rate theorem type results (E3) Asymptotic quadratic stability of the estimation error in have been presented in [44], [39], [28], [23], [26]. expectation: The papers [18], [19], [34], [24], [25] studied state estimation 2 lim sup E[d(xt, xˆt) ] ε. (4) for non-linear deterministic systems and noise-free channels. t→∞ ≤ In [18], [19], Liberzon and Mitra characterized the critical data rate C0 for exponential state estimation with a given exponent A. Literature Review and Contributions α 0 for a continuous-time system on a compact subset K of≥ its state space. As a measure for C , they introduced a In a recent work [16], we investigated the same problem 0 quantity called estimation entropy h (α, K), which equals for the special case involving only deterministic systems and est the topological entropy on K in case α = 0, but for α > 0 discrete noiseless channels. In this paper, we will provide is no longer a purely topological quantity. The paper [15] further connections between the ergodic theory of dynamical provided a lower bound on h (α, K) in terms of Lyapunov systems and information theory by answering the problems est exponents under the assumption that the system preserves posed in the previous section and relating the answers to the a smooth measure. In [24], [25], Matveev and Pogromsky concepts of either metric or topological entropy. Our findings studied three estimation objectives of increasing strength for complement and generalize our results in [16] since here discrete-time non-linear systems. For the weakest one, the we consider stochasticity in the system dynamics and/or the smallest bit rate was shown to be equal to the topological communication channels. entropy. For the other ones, general upper and lower bounds As we note in [16], optimal coding of stochastic processes is were obtained which can be computed directly in terms of a problem that has been studied extensively; in information the linearized right-hand side of the equation generating the theory in the context of per-symbol cost minimization, in dy- system. namical systems in the context of identifying representational A further closely related paper is due to Savkin [36], which and equivalence properties between dynamical systems, and uses topological entropy to study state estimation for a class in networked control in the context of identifying informa- of non-linear systems over noise-free digital channels. In fact, tion transmission requirements for stochastic stability or cost our results can be seen as stochastic analogues of some of the minimization. As such, for the criteria laid out in (E1)-(E3) results presented in [36], which show that for sufficiently per- above, the results in our paper are related to the efforts in the turbed deterministic systems state estimation with arbitrarily literature in the following three general areas. small error is not possible over finite-capacity channels. See Dynamical systems and ergodic theory. Historically there Remark III.11 for further discussions with regard to [36]. has been a symbiotic relation between the ergodic theory of A related problem is the control of non-linear systems over dynamical systems and information theory (see, e.g., [38], communication channels. This problem has been studied in [6] for comprehensive reviews). Information-theoretic tools few publications, and mainly for deterministic systems and/or have been foundational in the study of dynamical systems, deterministic channels. Recently, [48] studied stochastic sta- for example the metric (also known as Kolmogorov-Sinai bility properties for a more general class of stochastic non- or measure-theoretic) entropy is crucial in the celebrated linear systems building on information-theoretic bounds and Shannon-McMillan-Breiman theorem as well as two important Markov-chain-theoretic constructions. However, these bounds representation theorems: Ornstein’s (isomorphism) theorem do not distinguish between the unstable and stable components and the Krieger’s generator theorem [9], [31], [32], [14], [6]. of the tangent space associated with a dynamical non-linear The concept of sliding block encoding [8] is a stationary system, while the entropy bounds established in this paper encoding of a dynamical system defined by the shift process, make such a distinction, but only for estimation problems and leading to fundamental results on the existence of stationary in the low-distortion regime. codes which perform as good as the limit performance of a sequence of optimal block codes. For topological dynamical Zero-delay coding over communication channels. In our systems, the theory of entropy structures and symbolic exten- setup, we have causality as a restriction in coding and decod- sions answers the question to which extent a system can be ing. Zero-delay coding is an increasingly important research represented by a symbolic system (under preservation of some area of significant practical relevance, as we review in [16]. topological structure), cf. [6] for an overview of this theory. Notable papers include the classical works by Witsenhausen Entropy concepts have extensive operational practical usage [43], Walrand & Varaiya [42] and Teneketzis [41]. The find- in identifying limits on source and channel coding for a large ings of [42] have been generalized to continuous sources in class of sources [38], [8], [10]. [47] (see also [20] and [3], where the latter imposes a structure apriori); and the structural results on optimal fixed-rate coding Networked control and stochastic stability under infor- in [43] and [42] have been shown to be applicable to setups mation constraints. In networked control, there has been a when one also allows for variable-length source coding in recurrent interest in identifying limitations on state estimation [12]. Structural results on coding over noisy channels have and control under information constraints. The results in this 3 been studied in [42], [41], [22], [45] among others. Related A. Entropy notions for dynamical systems work also includes [45], [22], [1], [11] which have primarily considered the coding of discrete sources. [45], [20], [1], An important concept used in this paper is the topological [3], [11] have considered infinite horizon problems and, in entropy of a dynamical system. If f : X X is a continuous → particular, [45] has established the optimality of stationary map on a metric space (X, d), and K X is a compact set, ⊂ and deterministic policies for finite aperiodic and irreducible we say that E K is (n, ε; f)-separated for some n N and ⊂ i i ∈ Markov sources. A related lossy coding procedure was in- ε > 0 if for all x, y E with x = y, d(f (x), f (y)) > ε for ∈ 6 troduced by Neuhoff and Gilbert [30], called causal source some i 0, 1, . . . , n 1 . We write rsep(n, ε, K; f) for the ∈ { − } coding, which has a different operational definition, since maximal cardinality of an (n, ε; f)-separated subset of K and delays in coding and decoding are allowed so that efficiencies define the topological entropy htop(f, K) of f on K by through entropy coding can be utilized. Further discussions 1 on the literature are available in [12], [20], [1], [29]. Among hsep(f, ε, K) := lim sup log rsep(n, ε, K; f), those that are most relevant to our paper is [21], where causal n→∞ n htop(f, K) := lim hsep(f, ε, K). coding under a high rate assumption for stationary sources ε↓0 and individual sequences was studied though only for a source coding context. If X is compact and K = X, we omit the argument K and call h (f) the topological entropy of f. Alternatively, one can The setup with Gaussian channels is a special case studied top define h (f, K) using (n, ε)-spanning sets. A set F X extensively for the coding of linear systems. We will not top (n, ε)-spans another set K X if for each x K ⊂there consider such a setup in this paper; though we note that is y F with d(f i(x), f i(y⊂)) ε for i = 0, 1,∈ . . . , n 1. explicit results have been obtained for a variety of criteria Letting∈ r (n, ε, K; f) (or r ≤ (n, ε, K) if the map −f is in the literature. span span clear from the context) denote the minimal cardinality of a Contributions. In view of this literature review, we make set which (n, ε)-spans K, the topological entropy of f on K the following contributions. We establish that for (E1), the satisfies topological entropy of a properly defined infinite-dimensional 1 dynamical system defining the stochastic evolution of the htop(f, K) = lim lim sup log rspan(n, ε, K; f). process provides lower bounds, for (E2) a lower bound is ε↓0 n→∞ n provided by the metric entropy and for (E3) the metric entropy If f : X X is a measure-preserving map on a probability of this system also provides a lower bound under a restriction → space (Ω, , µ), i.e., f∗µ = µ, its metric entropy hµ(f) is on the class of encoders considered. Through a novel analysis defined asF follows. Let be a finite measurable partition of which also recovers the widely studied linear case, we also X. Then the entropy of Af with respect to is defined by provide achievability results for the case where the only A stochasticity is in the communication channel and the initial n−1 1  _ −i  state. We also establish impossibility results when noise is hµ(f; ) := lim Hµ f . (5) A n→∞ n A such that the process is sufficiently mixing. We show that i=0 our results reduce to those reported in [16] for deterministic Here W denotes the join operation, i.e., Wn−1 f −i is the systems. An implication is that the rate bounds may not be i=0 partition of X consisting of all intersections of the formA A continuously dependent on the presence of stochastic noise, 0 f −1(A ) ... f −n+1(A ) with A . For any partition∩ i.e., an arbitrarily small noisy perturbation in the system 1 n−1 i of X, ∩H (∩) = P µ(B) log∈µ( AB) is the Shannon dynamics may lead to a discontinuous change in the rate µ B∈B entropyB of .B The existence− of the limit in (5) follows from a requirements for each of the criteria. Throughout the analysis, subadditivityB argument. The metric entropy of f is then defined we provide further connections between information theory by and dynamical systems by identifying the operational usage of entropy concepts for the three different estimation criteria. hµ(f) := sup hµ(f; ), A A the supremum taken over all finite measurable partitions of A II.PRELIMINARIES X. If f is continuous, X is compact metric and µ is ergodic, there is an alternative characterization of hµ(f) due to Katok Notation: All in this paper are taken to the base [13]: 2. By N we denote the set of positive integers. We write Z For any n N, ε > 0 and δ (0, 1) put for the set of all integers and Z+ = N 0 . By 1A we ∈ ∈ ∪ { } denote the characteristic function of a set A. We write Bε(x) rspan(n, ε, δ) for the open ball of radius ε > 0 centered at x N . If R := min rspan(n, ε; A): A X Borel, µ(A) 1 δ . f : X Y is a measurable map between measurable∈ spaces { ⊂ ≥ − } → (X, ) and (Y, ), we write f∗ for the push-forward operator F G Then for every δ (0, 1) it holds that associated with f on the space of measures on (X, ), i.e., ∈ F for any measure µ on (X, ), f∗µ is the measure on (Y, ) 1 F−1 G ( ) = lim lim sup log ( ) defined by (f∗µ)(G) := µ(f (G)) for all G . hµ f rspan n, ε, δ . ∈ G ε↓0 n→∞ n 4

Topological and metric entropy are related to each other via the which maps a pair (x0, w¯) with w¯ = (wt)t∈Z+ to the trajectory variational principle [27]: For a continuous map f : X X (xt)t∈Z+ obtained by xt+1 := f(xt, wt). We claim that this on a compact metric space X, → map is measurable and its associated push-forward operator on measures maps π νZ+ to µ. To prove that G is measurable, htop(f) = sup hµ(f), × [n+1,∞) consider a cylinder set A = B0 Bn X in µ × · · · × × XZ+ . Then the supremum taken over all f-invariant Borel probability measures µ, i.e., such with f∗µ = µ. G−1(A)

If two maps f : X X and g : Y Y on compact metric = (x0, w¯): x0 B0,G(x0, w¯)1 B1,...,G(x0, w¯)n Bn . spaces X and Y satisfy→ h f = g h→with a homeomorphism { ∈ ∈ ∈ } ◦ ◦ −1 h : X Y , they are called topologically conjugate and h is Hence, G (A) can be expressed as the preimage of B0 n+1 × → Bn X under the map called a . In this case, the topological · · · × ⊂ entropy of f and g is the same, i.e., htop(f) = htop(g). If h (x0, w¯) (x0, fw (x0), fw fw (x0), . . . , fw fw (x0)). is only a continuous surjection from X to Y , then g is called 7→ 0 1 ◦ 0 n−1 ◦· · ·◦ 0 a topological factor of f and htop(g) htop(f). To show that this map is measurable, it suffices to show that ≤ each component is a measurable map. This follows from the Z+ n+1 III.ADYNAMICAL SYSTEMS APPROACH fact that the projection W W to the first n + 1 components is measurable and→f is measurable. Hence, we + In order to use the concepts of topological and metric entropy, have proved that G is measurable. To see that G∗(π νZ ) = × [2,∞) defined for deterministic maps, we associate a shift map with µ, observe that for a set of the form A = B0 B1 X the given stochastic system (1). More precisely, we consider we have × × the space XZ+ of all sequences in X, equipped with the Z+ ¯ = ( ) π ν ( (x0, w¯): x0 B0, fw0 (x0) B1 ) product topology. We write x x0, x1, x2,... for the × { ∈ ∈ } Z+ Z Z elements of X and we fix the product metric Z+ 1 1 = ν (dw ¯)π(dx0) B0 (x0) B1 (fw0 (x0)) ∞ X W Z+ X 1 d(xt, yt) (¯ ¯) := Z Z D x, y t , (6) 1 2 1 + d(x , y ) = π(dx0) ν(dw) B1 (fw(x0)) t=0 t t B0 W where d( , ) is the given metric on X. A natural dynamical Z · · + + + = π(dx0)ν( w W : fw(x0) B1 ) system on XZ is the shift map θ : XZ XZ , (θx¯)t { ∈ ∈ } → ≡ B0 xt+1, which is continuous with respect to the product topology. [2,∞) = µ(B0 B1 X ). (7) An analogous shift map is defined on W Z+ and denoted by × × ϑ. For more general cylinder sets, the claim follows inductively. ( ) The fact that suppµ is contained in cl follows from Observing that the sequence of random variables xt t∈Z+ T forms a Markov chain, when x0 is fixed, the following lemma + + −1 µ(cl ) = G∗[π νZ ](cl ) = π νZ (G (cl )) shows how a stationary measure of this Markov chain defines T × T × T π νZ+ (G−1( )) an invariant measure for θ. ≥ × T = π νZ+ (G−1(G(X W Z+ ))) × × III.1 Lemma: Let π be a stationary measure of the Markov π νZ+ (X W Z+ ) = 1. × × chain (xt)t∈Z+ . Then an invariant Borel probability measure µ for θ is defined on cylinder sets by Finally, we show that µ is θ-invariant. To this end, note that the Z+ Z+ [n+1,∞) map Φ: X W X W , (x, w¯) (f(x, w0), ϑw¯), µ(B0 B1 Bn X ) × → × 7→ × × · · · × × satisfies θ G = G Φ. Using that Z ◦ ◦ := π(dx0)P (dx1 x0) ...P (dxn xn−1), −1 | | π νZ+ (Φ (A B)) B0×B1×···×Bn × × Z+ where B0,B1,...,Bn are arbitrary Borel sets in X. Here = π ν ( (x0, w¯): fw0 (x0) A, ϑw¯ B ) × { ∈ ∈ } + [ x0 −1 ( = ) = ( ( ) = ) = π νZ x0 ((f ) (A) B) P xn+1 B xn x P f xn, w B xn x × { } × × x ∈−1 | ∈ | x0∈X = ν((f ) (B)). Z Z+ The support of is contained in the closure of the set of all = π(dx0)ν( w : f(x0, w) A )ν (B) µ X { ∈ } trajectories, i.e., Z suppµ cl = νZ+ (B) π(dx)P (x, A) ⊂ T X with = π(A)νZ+ (B) = π νZ+ (A B), × ×  + Z + + := x¯ X : wt W with xt+1 f(xt, wt), t Z+ . i.e., Φ∗(π νZ ) = π νZ , we find that T ∈ ∃ ∈ ≡ ∈ × × + + + Proof: We consider the map θ∗µ = θ∗G∗(π νZ ) = G∗Φ∗(π νZ ) = G∗(π νZ ) = µ, × × × G : X W Z+ XZ+ , completing the proof. × → 5

We will also need the following characterization of topological has measure one and consequently is dense in suppµ. entropy. Choose τ = τ(ε) large enough so that ∞ III.2 Lemma: Let f : X X be a homeomorphism on a X 1 → ε. compact metric space (X, d). Fix ε > 0 and n0 N. For 2t ≤ ∈ t=τ n > n0 we say that a set E X is (n, ε; n0)-separated if i i ⊂ Let E suppµ be a finite (k, 5ε; T (ε))-separated set for some d(f (x), f (y)) > ε for some i n0, n0 + 1, . . . , n 1 , ∈ { − } k > T⊂(ε). Since ˜ is dense in suppµ, a small perturbation whenever x, y E with x = y. We write rsep(n, ε; n0, f) for ∈ 6 of E yields a (k, 5Tε; T (ε))-separated set in ˜ with the same the maximal cardinality of an (n, ε; n0)-separated set. Then, for cardinality as E (using that θ is continuous).T Hence, we may any choice of n0(ε) N, ε > 0, we have ∈ assume E ˜. We define a map α : E k+τ by assigning 1 ⊂ T → E htop(f) = lim lim sup log rsep(n, ε; n0(ε)). (8) to (xt)t∈Z+ E the estimation sequence generated by the ε↓0 n ∈ n0(ε) 0 the objective (2) is rsep(k, 5ε; T (ε)) k+τ . ≤ |E | ≤ |M| achieved by a coder-estimator pair via a noiseless channel of This implies capacity C = log . Then for every k N we define the |M| ∈ 1 set lim sup log rsep(k, 5ε; T (ε)) log = C. k→∞ k ≤ |M| k := (ˆx0, xˆ1,..., xˆk−1): qt , 0 t k 1 E { ∈ M ≤ ≤ − } Using Lemma III.2, the result follows by letting C Cε and of all possible estimation sequences of length k the estimator ε 0. The statement about the metric entropy now→ follows can generate in the time interval [0, k 1]. from↓ the variational principle. − Assume to the contrary that there exists a measurable set A ⊂ III.4 Remark: To make the statement of the theorem clearer, XZ+ of positive measure α := µ(A) > 0 so that for every let us consider the two extreme cases when there is no noise x¯ = (xt)t∈ A there is t T (ε) with d(xt, xˆt) > ε Z+ ∈ ≥ and when there is only noise: in case the sequence (xt) is realized as a trajectory of the system. If G : X W Z+ XZ+ is the map from the proof (i) If the system is deterministic, i.e., x = f(x ) for × → t+1 t of Lemma III.1, then the preimage G−1(A) is measurable in a homeomorphism f : X X of a compact metric + + → X W Z with π0 νZ -measure α > 0. This contradicts space X, then π is an invariant measure of f. Moreover, × × 0 the assumption that the almost sure estimation objective (2) is P (xt B xt−1 = x) = 1 if f(x) B and 0 otherwise, achieved. Hence, the set implying∈ | ∈

 + [n+1,∞) ˜ := x¯ XZ : d(xt, xˆt) ε for all t T (ε) µ(B0 B1 Bn X ) T ∈ ≤ ≥ × × · · · × × 6

−1 −2 −n  + = π0 B0 f (B1) f (B2) ... f (Bn) . If we equip W Z with the product topology, h becomes ∩ ∩ ∩ ∩ continuous. Indeed, if the distance of two points x¯1, x¯2 From this expression, we see that the support of µ is ∈ XZ+ is small, then the distances d (¯x1, x¯2) are small for contained in the set of all trajectories of f (which in X t t finitely many values of t. Hence, by the of this case coincides withT its closure), as already proved (x, y) f −1(y) on the X X, also the dis- in Lemma III.1. The map h : X defined by x tances7→d (h(¯x1) , h(¯x2) ) can be made small× for sufficiently h(¯x) := x , is easily seen to beT a → homeomorphism, W t t 0 many values of t, guaranteeing that D(h(¯x1), h(¯x2)) becomes which conjugates θ and f. That is, the following |T small, where D is a product metric on W Z+ . diagram commutes: The map G : X W Z+ XZ+ , used in the proof of Lemma θ III.1, satisfies × → T −−−−→ T + h h h(G(x0, w¯)) =w ¯ for all (x0, x¯) X W Z , y y ∈ × X X because we can write −−−−→f x0 x1 x2 G(x0, w¯) = (x0, f (w0), f (w1), f (w2),...). Since h∗µ = π0 and conjugate systems have the same entropy, our theorem implies Consequently, G - as a map from X W Z+ to the space of trajectories - is invertible with × T C0 htop(f; suppπ0). (9) −1 ≥ G (¯x) = (x0, h(¯x)). The right-hand side of this inequality is finite under mild assumptions, e.g., if f is Lipschitz continuous on From the assumptions it follows that G is continuous, hence supp supp G is a homeomorphism and is compact. By the proof of π0 and π0 has finite lower box dimension (see T [2, Thm. 6.1.2]). These conditions are in particular satis- Lemma III.1, we have θ G = G Φ, where Φ is the skew- ◦ ◦ + product map Φ(x, w¯) = (f(x, w0), ϑw¯) and G∗(π0 νZ ) = fied when f is a diffeomorphism on a finite-dimensional × manifold. However, one should be aware that even on µ. Hence, G is a topological conjugacy between θ|suppµ and Φ + , implying a compact interval there exist continuous maps with |supp(π0×νZ ) infinite topological entropy on the support of an invariant C0 htop(θ|suppµ) = htop(Φ Z+ ). measure. The lower bound (9) has already been derived ≥ |supp(π0×ν ) in [16, Thm. III.1], and in fact for the deterministic case Since the projection map π :(x, w¯) w¯ exhibits ϑ as a 7→+ + considered here the bound was shown to be tight. topological factor of Φ and π∗(π0 νZ ) = νZ , the second × (ii) Assume that X = W is compact and the system is given inequality in (10) follows. by xt+1 = wt, i.e., the trajectories are only determined 1 by the noise. In this case, with π0 := ν, the measure III.6 Example: Let X = W = S = R/Z. Let f(x, w) = + µ is the product measure νZ . Hence, C0 is bounded x + w mod 1 and let π0 = ν be the normalized Lebesgue 1 x 1 1 below by the topological entropy of the shift on W Z+ measure on S . In this case, the map f : S S , w x −→1 7→ restricted to supp Z+ = (supp )Z+ . This number is x+w, is obviously invertible and (x, y) (f ) (y) = x y ν ν 7→ − finite if and only if supp is finite and in this case is is continuous. Hence, C0 = for the estimation objective ν ∞ given by log suppν . (E1). | | III.7 Theorem: Consider the estimation objective (E2) for an If the system is not deterministic, then usually C0 = . In fact, this is always the case if the estimator is able to recover∞ initial measure π0 which is stationary and ergodic under the Markov chain ( ) . Then, if supp is compact, the noise to a sufficiently large extent. The following corollary xt t∈Z+ µ treats the case, when the noise can be recovered completely C0 hµ(θ). from the state trajectory. ≥

Proof: First observe that the of π0 implies the III.5 Corollary: Additionally to the assumptions in Theorem ergodicity of µ. Indeed, it is well-known that the product x + III.3, suppose that W and X are compact and f : W X measure π0 νZ is ergodic for the skew-product Φ if π0 is x −1→ × is invertible for every x X so that (x, y) (f ) (y) is ergodic (cf. [17]). Since ( Z∗ ) = and = Φ, ∈ 7→ G∗ π0 ν µ θ G G continuous. Then, for (E1), this implies the ergodicity of ×µ. Now consider◦ a noiseless◦ channel with input alphabet and a pair of coder and C0 htop(Φ + ) htop(ϑ + ), (10) ≥ |supp(π0×νZ ) ≥ |suppνZ decoder/estimator which solvesM the estimation problem (E2) where Φ: X W Z+ X W Z+ is the skew-product map for some ε > 0. For every x¯ XZ+ and δ > ε let × → × ∈ (x, w¯) (fw0 (x), ϑw¯). As a consequence, C0 = whenever n o 7→ ∞ T (¯x, δ) := inf k N : sup d(xt, xˆt) δ , suppν contains infinitely many elements. ∈ t≥k ≤ where the infimum is defined as + if the corresponding set Proof: We consider the map h : XZ+ W Z+ , x¯ w¯ = ∞ → 7→ is empty. Note that T (¯x, δ) depends measurably on x¯. Define (wt)t∈Z+ with xt −1 K wt = (f ) (xt+1). B (δ) := x¯ suppµ : T (¯x, δ) K δ > ε, K N, { ∈ ≤ } ∀ ∈ 7

∞ and observe that these sets are measurable. From (3) it follows X −(i+1) 2 = 4 2 lim sup E[(d(xt+i, xˆt+i) ] that for every , t→∞ δ > ε i=0 K  [ K  ∞ lim µ(B (δ)) = µ B (δ) = 1. X −(i+1) K→∞ 4 2 ε K∈ ≤ N i=0 Fixing a K large enough so that µ(BK (δ)) > 0, Katok’s = 4ε =:ε. ¯ (13) characterization of metric entropy yields the assertion, which is proved with the same arguments as in the proof of Theorem In particular, ε¯ 0 as ε 0. Thus, if (E3) holds for every III.3, using the simple fact a maximal (n, ε)-separated set ε > 0, (E3) considered→ in→ [16] also holds for every ε > 0. contained in some set K also (n, ε)-spans K. Here, we apply Jensen’s inequality in (13). In the following, we consider (E3). To obtain a lower bound, Step (iii). Thus, if the encoder is of the form (12), the problem we restrict the encoder to have finite memory and be periodic. can be viewed as an instance of [16, Thm. V.2] for the dynamical system θ. III.8 Theorem: Consider the estimation objective (E3) for an Under the stated periodicity assumption and (12), [16, initial measure π0 which is stationary and ergodic under the Thm. V.2] directly implies that C0 hµ(θ). Markov chain (xt)t∈Z+ . Additionally, assume that there exists ≥ τ > 0 so that the coder map δt is of the form Three remarks are in order.

qt = δt(x[t−τ+1,t]) (11) III.9 Remark: It is worth noting here that for a deterministic and is periodic so that δt+τ δt. Further assume that the ≡ dynamical system, the property of being ergodic is typically estimator map is of the form very restrictive; however, for a stochastic system ergodicity is

xˆt = γt(q[t−τ+1,t]) often very simple to satisfy: The presence of noise often leads to strong mixing conditions which directly leads to ergodicity. and also γt+τ γt. Then, if suppµ is compact, the smallest channel capacity≡ above which (E3) can be achieved for every ε > 0 satisfies III.10 Remark: The discussion in Theorem III.8 leads to an C0 hµ(θ). interesting relation between the classical information-theoretic ≥ problem of optimally encoding (non-causally) sequences of Proof: Step (i). First note that we would obtain a lower bound random variables and metric entropy of an infinite-dimensional on C0 if we allowed the periodic encoders to be of the form: dynamical system defined via the shift operator. A close look at the proof of [16, Thm. V.1] reveals that under a stationarity = ( ) (12) qt δt x[t−τ+1,∞) and ergodicity assumption, when the channel is noisefree, that is, if we allow the encoder to have non-causal access to the lower bound presented in Theorem III.8 is essentially the realizations of xt. Note that every encoder policy of the achievable, provided that the encoder has non-causal access to form (11) would be of the form (12). We keep the structure the source realizations and in particular a large enough look- of the decoder as is. ahead is sufficient for an approximately optimal performance. Note though that the decoder is still restricted to be zero-delay. Step (ii). The criterion (E3) considered in this paper implies (E3) considered in [16] with the distortion metric d being the product metric D introduced in (6) for the dynamical system θ III.11 Remark: Besides our results in [16], a related study to 2 and p = 2: This follows since if lim supt→∞ E[(d(xt, xˆt) ] the approach of this section is due to Savkin [36]. This paper ≤ ε, we have that with x¯t = (xt, xt+1,...) and x¯ˆt = is concerned with nonlinear systems of the form (ˆxt, xˆt+1,...), 2 x(t) = F (x(t), ω(t)) lim sup E[D(¯xt, x¯ˆt) ] t→∞  ∞ 2 X d(xt+i, xˆt+i) where ω(t) is interpreted as an uncertainty input or a dis- = lim sup E 2−i t→∞ 1 + d(xt+i, xˆt+i) turbance. However, no statistical structure is imposed on ω i=0 so that the system can be regarded as deterministic (thus,   ∞ 2 X d(xt+i, xˆt+i) lim sup E 4 2−(i+1) the formulation is distribution-free). A characterization of the ≤ t→∞ 1 + d(x , xˆ ) i=0 t+i t+i smallest channel capacity C0 above which the state x(t) can ∞  2 be estimated with arbitrary precision and for every initial state X d(xt+i, xˆt+i) 4 lim sup 2−(i+1)E x(0) in a specified compact set via a noiseless channel is given ≤ t→∞ 1 + d(x , xˆ ) i=0 t+i t+i by [36, Thm. 3.1]. A close inspection of this result shows ∞  2 that it characterizes C precisely as the topological entropy X −(i+1) d(xt+i, xˆt+i) 0 4 2 lim sup E of the associated shift operator acting on system trajectories. ≤ t→∞ 1 + d(xt+i, xˆt+i) i=0 Moreover, [36, Thm. 3.2] shows that under mild assumptions ∞  2  X d(xt+i, xˆt+i) on the system the entropy of this operator is infinite, and hence 4 2−(i+1) lim sup E ≤ t→∞ 1 observation via a finite-capacity channel is not possible. i=0 8

0 0 IV. AN INFORMATION-THEORETICAND = I(xt; q q ). t| [0,t−1] PROBABILITY-THEORETICAPPROACH Here, (14) follows from the assumption that the channel is of In this section, we consider a much broader class of channels, Class A type. Define so-called Class A type channels (see [46, Def. 8.5.1]). We T −1 restrict ourselves to systems with state space X = N and 1 X 0 0 R RT := max I(q ; q q ) 0 t [0,t] [0,t−1] {P (qt|q[0,t−1],q ), 0≤t≤T −1} T provide lower bounds of the channel capacity for the objectives [0,t−1] t=0 | (E2) and (E3), using information-theoretic methods. Now consider the following identities and inequalities:

IV.1 Definition: A channel is said to be of Class A type, if lim RT T →∞ • it satisfies the following Markov chain condition:  T −1  1 X 0 0 0 0 0 lim sup I(xt; qt q[0,t−1])) + I(x0; q0) q qt, q , q x0, ws, s 0 , ≥ T →∞ T | t ↔ [0,t−1] [0,t−1] ↔ { ≥ } t=1 T −1   i.e., almost surely, for all Borel sets B, 1 X 0 0 = lim sup h(xt q[0,t−1]) h(xt q[0,t]) 0 0 T →∞ T | − | P (q B qt, q , q , x0, ws, s 0) t=1 t ∈ | [0,t−1] [0,t−1] ≥ 0 0 T −1   = P (qt B qt, q[0,t−1], q[0,t−1]) 1 X 0 0 ∈ | lim sup h(xt x , q ) h(xt q ) ≥ T | [0,t−1] [0,t−1] − | [0,t] for all t 0, and T →∞ t=1 ≥ • its capacity with feedback is given by T −1   1 X 0 0 = lim sup h(xt x[0,t−1], q[0,t−1]) h(xt xˆt q[0,t]) C = lim max T →∞ T | − − | 0 t=1 T →∞ {P (qt|q ,q ), 0≤t≤T −1} [0,t−1] [0,t−1] T −1   1 1 X 0 0 lim sup h(xt x[0,t−1], q ) h(xt xˆt) I(q[0,T −1] q[0,T −1]), T [0,t−1] T → ≥ T →∞ t=1 | − − where the directed mutual information is defined by T −1   1 X 0 N lim sup h(xt x , q ) log(2πeεt) T −1 ≥ T | [0,t−1] [0,t−1] − 2 X T →∞ t=1 I(q q0 ) := I(q ; q0 q0 )+I(q ; q0 ) [0,T −1] [0,T −1] [0,t] t [0,t−1] 0 0  T −1  → t=1 | 1 X N = lim sup h(xt xt−1) log(2πeε). (15) T →∞ T | − 2 Discrete noiseless channels and memoryless channels belong t=1 to this class; for such channels, feedback does not increase the Here, the second inequality uses the property that entropy capacity [4]. Class A type channels also include finite state decreases under conditioning on more information. The second 0 stationary Markov channels which are indecomposable [33], equality follows from the fact that xˆt is a function of q[0,t], and and non-Markov channels which satisfy certain symmetry the last inequality follows from that fact that among all real properties [37]. Further examples can be found in [5], [40]. random variables X that satisfy a given second moment con- straint E[X2] ε, a Gaussian maximizes the entropy and the ≤ 1 IV.2 Theorem: Consider system (1) with state space X = differential entropy in this case is given by 2 log(2πeε). Using N T R . Suppose that the fact that for an n-dimensional vector X = [X1,...,Xn] , ( ) = ( ) + Pn ( ) Pn ( ) T −1 h X h X1 i=2 h Xi X[1,i−1] i=1 h Xi , it fol- 2 | ≤ n 1 X lows with E[ xt xˆt ] εt that h(xt xˆt) log(2πeεt). lim sup h(xt xt−1) > 2 T k − k ≤ − ≤ T →∞ t=1 | −∞ The final equality then follows from the fact that conditioned 0 on xt−1, xt and q[0,t−1] are independent. For the final result, in and h(xt) < for all t Z+. Then, under (E3) (and thus ∞ ∈ (15), taking the limit as ε 0, log(ε) , and C0 = under (E1)), follows. → → −∞ ∞ T −1  1 X  N Cε lim sup h(xt xt−1) log(2πeε). N T 2 IV.3 Theorem: Suppose that X R and the system given ≥ T →∞ t=1 | − ⊂ N by xt+1 = f(xt, wt) is so that for all Borel sets B R , ⊂ In particular, C0 = . ∞ P (xt+1 B xt = x) Kλ(B), ∈ | ≤ Proof: Let ( ) be a sequence of non-negative real εt t∈Z+ where λ denotes the Lebesgue measure, K R+, and wt is an 2 ∈ numbers so that E[ xt xˆt ] εt for all t Z+ and i.i.d. noise process. Then, for the objective (E2), C = . k − k ≤ ∈ 0 lim sup εt ε. Observe that for every t > 0 we have ∞ t→∞ ≤ 0 0 A special case for the above is a system of the form I(qt; q[0,t] q[0,t−1]) 0 0 | 0 0 = H(q q ) H(q q , q ) xt+1 = f(xt) + wt, t| [0,t−1] − t| [0,t] [0,t−1] = H(q0 q0 ) H(q0 q , x , q0 ) (14) t [0,t−1] t [0,t] t [0,t−1] where wt ν with the noise measure ν admitting a bounded 0| 0 − 0| 0 ∼ H(q q ) H(q xt, q ) density function. ≥ t| [0,t−1] − t| [0,t−1] 9

Proof: Given a finite alphabet channel with 0 < , for for achieving the objectives (E2) and (E3) with finite capacity. a given time t > 0 under any encoding and decoding|M | ∞ policy, More precisely, we prove the following theorem. there exists a finite partition of the state space X for encoding xt leading to xˆt. Thus there exists ε¯ > 0 so that for all ε V.1 Theorem: Consider a nonlinear deterministic system ∈ (0, ε¯), for each set xt+1 = f(xt) given by a continuous map f : X X on a → 0 0  N 0 0 0  compact metric space X, estimated via a discrete memoryless At(q0, . . . , qt) := x R : d x, xˆt(q0, . . . , qt−1, qt) ε , ∈ ≥ channel (DMC). Then, for the asymptotic estimation objectives where q0 , . . . , q0 0, we find that (E2) and (E3), we have 0 t ∈ M 0 0 0  C h (f). P xt At(q , . . . , q ) x[0,t−1], q 0 top ∈ 0 t [0,t−1] ≤ X  0 0 0  = P qt = q x[0,t−1], q[0,t−1] Proof: It suffices to prove the result for (E2), since for | q0∈M0 a compact metric space, (E2) implies (E3); therefore the 0 0 0 0 0 P xt At(q , . . . , q ) x[0,t−1], q , q = q construction below also applies for the objective (E3). × ∈ 0 t [0,t−1] t X  0 0 0  1 P q = q x , q Without loss of generality, we may assume that htop(f) < , ≥ − t | [0,t−1] [0,t−1] ∞ q0∈M0 since otherwise the statement trivially holds. Then it suffices 0 0 0 0 0 0 to show that for any ε > 0 the estimation objective can be P d(xt, xˆt(q , . . . , q , q )) < ε x[0,t−1], q , q = q × 0 t−1 [0,t−1] t 0 achieved whenever the channel capacity satisfies C > htop(f). 1 Kλ(Bε(0)) > 0. ≥ − |M | Since the capacity of a DMC can take any positive value, it follows that Cε htop(f) for every ε > 0 and thus C0 This implies that ≤ ≤   htop(f). 0 0 0 P xt At(q0, . . . , qt) x[0,t−1], q[0,t−1] > 0 (16) Now, consider a channel with capacity C > h (f) and fix ∈ top ε > 0. Recall that the input alphabet is denoted by and the 0 0 M uniform over all realizations of x[0,t−1], q[0,t−1]. output alphabet by . By the random coding construction M Let of Shannon [7], we can achieve a rate R satisfying ∞ X := 1 0 0 htop(f) < R < C (17) η {xt∈At(q0,...,qt)}. t=1 with a sequence of increasing sets 1,...,Mn of input Our goal is to show that η = almost surely, leading to the messages so that for all n, { } desired conclusion. Let ∞ nR 1 0 0 2 Mn and lim log Mn = C. (18) τ(1) = min t > 0 : xt At(q , . . . , q ) ≤ n→∞ n { ∈ 0 t } Furthermore, there exists a sequence of encoders and for z > 1, z N n n ∈ E : 1,...,Mn , yielding codewords 0 0 n { n } → M τ(z) = min t > τ(z 1) : xt At(q0, . . . , qt) . x (1), . . . , x (Mn), and a sequence of decoders { − ∈ } n 0 n D :( ) 1,...,Mn so that It follows that P (τ(1) < ) = 1 by a repeated use M → { } ∞ n 0 n −nE(R)+o(n) of (16), since the event τ(1) = would imply that the P (D (q[0,n−1]) = c q[0,n−1] = x (c)) e , ∞ 6 | ≤ event (whose probability is lower bounded by (16)) would o(n) uniformly for all c 1,...,Mn . Here 0 as n be avoided infinitely many times leading to a zero measure. ∈ { } n → → Thus, P (η 1) = 1. By a repeated use of (16) and induction and E(R) > 0. In particular, we observe that with cn ≥ ∞ n 0 ∈ if P (η k 1) = 1, we have that 1,...,Mn being the message transmitted and D (q[0,n−1]) ≥ − {the decoder} output, P (η k) = P (η k, η k 1) ≥ ≥ ≥ − P (Dn(q0 ) = c ) = ( (1) ) ( 1) = 1 [0,n−1] n P τ < τ(k−1) P η k , X 6 ∞|F ≥ − = P (Dn(q0 ) = c q = xn(c)) 0 [0,n−1] [0,n−1] where τ(k−1) is the σ-field generated by xs, qs up to time 6 | F { } c∈{1,...,Mn} τ(k 1). Thus, for every k N, P (η k) = 1, and it follows n − ∈ ≥ P (q[0,n−1] = x (c)) by continuity in probability that P (η = ) = limk→∞ P (η × −nE(R)+o(n) k) = 1. Hence, for any finite communication∞ rate, almost sure≥ e . ≤ boundedness is not possible for arbitrarily small ε > 0. This also implies that the bound holds even when the messages to be transmitted are not uniformly distributed. Thus, for the V. ACHIEVABILITY BOUNDS sequence of encoders and decoders constructed above we have X n 0 X −nE(R)+o(n) P (D (q ) = cn) e < . A. Coding of deterministic dynamical systems over noisy [0,n−1] 6 ≤ ∞ communication channels n n The Borel-Cantelli Lemma then implies In this section, we show that for a noisefree system a discrete n n 0 o memoryless noisy communication channel is no obstruction P D (q ) = cn infinitely often = 0. (19) [0,n−1] 6 10

Now we choose δ (0, ε) so that, by uniform continuity, whether the estimation objective (3) can be achieved, when no ∈ noise is present in the system. d(x, y) < δ d(f(x), f(y)) < ε for all x, y X. ⇒ ∈ (20) V.2 Theorem: Consider system (22) estimated over a memo- Furthermore, we choose N sufficiently large so that ryless erasure channel with finite capacity. Then, for (E2), we

rspan(n, δ) Mn for all n N. (21) have ≤ ≥ X C0 log λi . (23) This is possible, because by (17) and (18), for every n ≤ d| |e N |λi|>1 we have ∈ 1 For completeness, we also note that [35, Cor. 5.3 and lim sup log rspan(n, δ) htop(f) n→∞ n ≤ Thm. 4.3] show that for a discrete memoryless channel it suffices that P log for the existence of encoder 1 nR 1 C > |λi|≥1 λi < R = log 2 log Mn. | | n ≤ n and controller policies leading to almost sure stability.

Let Sj be a (j, ε)-spanning set of cardinality rspan(j, δ) and fix injective functions V.3 Remark: An implication on the achievability for non- causal codes over discrete memoryless channels. In Section ιj : Sj 1,...,Mj . III we utilized the fact that one can view a stochastic dynam- → { } ical system as a deterministic one under the shift operator. In fact, by possibly enlarging the set Sj, we can assume that Building on a similar argument as that in the proof of Theorem ∗ ιj is bijective. For any a X let xj (a) denote a fixed element III.8, it can be shown that (E2) considered in this paper implies t ∗∈ t of Sj satisfying d(f (x (a)), f (a)) δ for 0 t j 1. ≤ ≤ ≤ − and is implied by (E2) considered in [16] with the distortion Define sampling times by metric d being the product metric D introduced in (6) for the dynamical system θ. Therefore, provided that the encoder has τ0 := 0 and τj+1 := τj + j + 1 for j 0. ≥ access to future realizations of the state sequence, the proof In the following, we specify the coding scheme. In this of Theorem V.1 implies an achievability result: If the encoder has non-causal access to the source realizations, for (E2) it coding scheme, the encoder from τj to τj+1 1, encodes − suffices to have C0 htop(θ|suppµ) and this can be achieved the information regarding the orbit of the state from τj+1 through the construction≤ in the proof of Theorem V.1 through to τj+2 1. For all j N, at time τj, use the input ∗ − j+1 ≥ an encoder which has non-causal access to the future state ιj+1(xj+1(f (xτj ))) for the encoder, where xτj is the state j+1 ∗ j+1 realizations. Note though that the decoder is still restricted to at time τj. Then x (ιj+1(xj+1(f (xτj )))) is sent during the next j + 1 units of time. This is possible by (21). For be zero-delay. We note that in the traditional Shannon theory, j < N, it is not important what we transmit. block codes are allowed to be non-causal. Let the estimator apply x∗ ι−1 to the output of the j+1 ◦ j+1 decoder, obtaining an element yj+1 Sj+1, and use VI.EXAMPLES ( ) j( ) j+1( ) ∈ yj+1, f yj+1 , . . . , f yj+1 , f yj+1 as the estimates dur- 2 2 VI.1 Example: Consider the diffeomorphism fA : ing the forthcoming time interval of length τj+2 τj+1 = T T − on the 2-torus 2 = 2/ 2, induced by the linear map→ τj+1 τj + 1 = j + 2. Then δ < ε, (20) and the fact that T R Z − Sj+1 is (j + 1, δ)-spanning implies that the desired estimation  2 1  A = , (24) accuracy is achieved, provided that there was no error in the 1 1 transmission. i.e., f (x + 2) = Ax + 2. Note that the inverse of f is Now (19) implies that after a finite random time, there are no A Z Z A given by f −1 , which is well-defined, since det A = 1. The more errors in the transmission. By the analysis above, the A map f is known as Arnold’s Cat Map, and is one of the errors will be uniformly bounded by ε. Hence, the objective A simplest examples of an Anosov diffeomorphism. (3) is achieved. Since det Df (x) det A 1, the map f is area- We note that the proof above crucially depends on the fact that A A preserving. The eigenvalues≡ of the≡ matrix A are given by the system is deterministic and is impractical to implement. The theorem is essentially a possibility result. Note that the 3 1 3 1 γ1 = √5 and γ2 = + √5 proof even does not make use of the fact that the encoder has −2 − 2 −2 2 access to the realizations of the channel output, hence feedback and satisfy γ1 > 1 > γ2 . It is well-known that both is not utilized. For linear systems, a constructive proof is given | | | | the topological entropy and the metric entropy of fA with in [26, Thm. 6.4.1]. We state the following for completeness. respect to Lebesgue measure are given by log γ1 > 0. Hence, Consider the noiseless linear system Theorem V.1 yields | |

xt+1 = Axt (22) 3 1 C0 log √5 1.3885 N ≤ −2 − 2 ≈ with xt R . The following result, essentially given in [26, Thm.∈ 6.4.1], provides a positive answer to the question for (E2) to be achieved over a DMC. 11

Now, suppose we have additive noise for the cat map so that previously known results in the literature which have 2 2 fA(x + Z ) = Ax + w + Z , with w ν which admits a considered only linear systems to our knowledge. ∼ density supported on T2. In this case, the map f x : T2 T2, w Ax + w, is invertible and (x, y) (f x)−1(y) = y→ Ax 7→ 7→ − is continuous. By Corollary III.5, C0 = for the estimation REFERENCES objective (E1), under a stationary initial∞ measure. For the [1] H. Asnani, T. Weissman. Real-time coding with limited lookahead. IEEE objective (E2) it can be shown that, under corresponding initial Trans. Inform. Theory 59 (2013), no. 6, 3582–3606. measure conditions, Theorem III.7 leads to C0 = . [2] V. A. Boichenko, G. A. Leonov, V. Reitmann. Dimension Theory for ∞ Ordinary Differential Equations. Teubner, Stuttgart, 2005. [3] V. S. Borkar, S. K. Mitter, S. Tatikonda. Optimal sequential vector quantization of Markov sources. SIAM J. Control Optim. 40 (2001), VII.DISCUSSIONAND CONCLUDING REMARKS no. 1, 135–148. [4] T. M. Cover, J. A. Thomas. Elements of Information Theory. Wiley, New In this paper, we considered three estimation objectives York, 1991. = ( ) [5] R. Dabora, A. Goldsmith. On the capacity of indecomposable finite-state for stochastic non-linear systems xt+1 f xt, wt with channels with feedback. Proceedings of the Allerton Conf. Commun. i.i.d. noise (wt), assuming that the estimator receives state Control Comput., 1045–1052, Sept. 2008. information via a noisy channel of finite capacity. [6] T. Downarowicz. Entropy in Dynamical Systems. Cambridge University Press, 2011. (1) For noiseless channels, assuming that the initial measure [7] R. G. Gallager. A simple derivation of the coding theorem and some applications. IEEE Trans. Inform. Theory 11 (1965), 3–18. π0 is stationary, we proved that C0 is bounded below [8] R. M. Gray. Entropy and Information Theory. Springer, New York, 2011. by either the topological or the metric entropy of a shift [9] R. M. Gray. Probability, Random Processes, and Ergodic Properties. dynamical system on the space of trajectories (Theorems 2nd edition, Springer, Dordrecht, 2009. III.3, III.7 and III.8). [10] R. M. Gray, D. L. Neuhoff. Quantization. Information theory: 1948– 1998. IEEE Trans. Inform. Theory 44 (1998), no. 6, 2325–2383. (2) For systems on Euclidean space and noisy channels, we [11] T. Javidi, A. Goldsmith. Dynamic joint source-channel coding with provided information-theoretic and probability-theoretic feedback. In Information Theory Proceedings (ISIT), 2013 IEEE Inter- national Symposium on, pp. 16–20. IEEE, 2013. conditions enforcing C0 = . In particular, Theorem ∞ [12] Y. Kaspi, N. Merhav. Structure theorems for real-time variable-rate IV.2 shows that C0 = for the quadratic stability coding with and without side information. IEEE Trans. Inform. Theory objective, whenever ∞ 58 (2012), no. 12, 7135–7153. [13] A. Katok. Lyapunov exponents, entropy and periodic orbits for diffeo- ´ 1 T −1 morphisms. Inst. Hautes Etudes Sci. Publ. Math. 51 (1980), 137–173. X [14] A. Katok. Fifty years of entropy in dynamics: 1958–2007. J. Mod. Dyn. lim sup h(xt xt−1) > . (25) T →∞ T | −∞ 1 (2007), no. 4, 545–596. t=1 [15] C. Kawan. Exponential state estimation, entropy and Lyapunov expo- We have a corresponding negative result under (E2) in nents. Systems Control Lett. 113 (2018), 78–85. [16] C. Kawan, S. Yuksel.¨ On optimal coding of non-linear dynamical Theorem IV.3 for noisy systems which are sufficiently systems. IEEE Trans. Inform. Theory, to appear (also arXiv:1711.06600). irreducible. Since h(xt xt−1) is a measure for the un- [17] F. Ledrappier, L.–S. Young. Entropy formula for random transforma- | tions. Probability theory and related fields 80 (1988), 217–240. certainty of xt given xt−1, the condition (25) means that [18] D. Liberzon, S. Mitra. Entropy and minimal data rates for state the noise on the long run (in average) influences the state estimation and model detection. In Proceedings of the 19th International process in a substantial way. Similarly to the results in Conference on Hybrid Systems: Computation and Control, pp. 247–256. Section III, this means that the noise makes the space ACM, 2016. [19] D. Liberzon, S. Mitra. Entropy and minimal bit rates for state estimation of relevant trajectories too large (or too complicated) to and model detection. IEEE Trans. Automatic Control, 2018. estimate the state with arbitrarily small error over a finite [20] T. Linder, S. Yuksel.¨ On optimal zero-delay quantization of vector capacity channel. Markov sources. IEEE Trans. Inform. Theory 60 (2014), no. 10, 5975– 5991. (3) Compared with our earlier work [16], our results reveal [21] T. Linder, R. Zamir. Causal coding of stationary sources and individual that the rate requirements are not robust with respect to sequences with high resolution. IEEE Trans. Inform. Theory 52 (2006), the presence of noise: That is, even an arbitrarily small no. 2, 662–680. [22] A. Mahajan, D. Teneketzis. Optimal design of sequential real-time noise may lead to drastic effects in the rate requirements. communication systems. IEEE Trans. Inform. Theory 55 (2009), no. 11, However, the metric or topological entropy bounds are 5317–5338. always present and our lower bounds reduce to those [23] A. S. Matveev. State estimation via limited capacity noisy communica- established in [16]. We also note that the metric entropy tion channels. Math. Control Signals Systems 20 (2008), no. 1, 1–35. [24] A. Matveev, A. Pogromsky. Observation of nonlinear systems via finite definition for random dynamical systems [17] in the capacity channels: constructive data rate limits. Automatica J. IFAC 70 ergodic theory literature is not the answer to the oper- (2016), 217–229. ational questions we proposed in this paper, unlike the [25] A. Matveev, A. Pogromsky. Observation of nonlinear systems via finite capacity channels. Part II: Restoration entropy and its estimates. one for the deterministic case which precisely answered Submitted, 2017. the operational question (E2). [26] A. S. Matveev, A. V. Savkin. Estimation and control over communication (4) In Section V-A we assumed that the system is deter- networks. Control Engineering. Birkhauser¨ Boston, Inc., Boston, MA, 2009. ministic with a compact state space, but the channel is [27] M. Misiurewicz. Topological entropy and metric entropy. Ergodic theory noisy. We proved that in this case C0 is bounded from (Sem., Les Plans-sur-Bex, 1980) (French), pp. 61–66, Monograph. above by the topological entropy of the system for the Enseign. Math., 29, Univ. Geneve,` Geneva, 1981. [28] G. N. Nair and R. J. Evans. Stabilizability of stochastic linear systems asymptotic almost sure objective (thus, leading to an with finite feedback data rates. SIAM J. Control and Optimization 43 achievability result). Our result strictly generalizes the (2004), 413–436. 12

[29] A. Nayyar and D. Teneketzis. On the structure of real-time encoders and decoders in a multi-terminal communication system. IEEE Trans. Inform. Theory 57 (2011), no. 9: 6196–6214. [30] D. L. Neuhoff, R. K. Gilbert. Causal source codes. IEEE Trans. Inform. Theory 28 (1982), 701–713. [31] D. Ornstein. Bernoulli shifts with the same entropy are isomorphic. Advances in Math. 4 (1970), 337–352. [32] D. Ornstein. An application of ergodic theory to probability theory. Ann. Probability 1 (1973), no. 1, 43–65. [33] H. H. Permuter, T. Weissman, A. J. Goldsmith. Finite state channels with time-invariant deterministic feedback. IEEE Trans. Inform. Theory 55 (2009), 644–662. [34] A. Pogromsky, A. Matveev. A topological entropy approach for obser- vation via channels with limited data rate. IFAC Proceedings Volumes, 44(1):14416–14421, 2011. [35] A. Sahai, S. Mitter. The necessity and sufficiency of anytime capacity for stabilization of a linear system over a noisy communication link part I: Scalar systems. IEEE Trans. Inform. Theory 52(8) (2006), 3369–3395. [36] A. V. Savkin. Analysis and synthesis of networked control systems: Topological entropy, observability, robustness and optimal control. Au- tomatica J. IFAC 42 (2006), no. 1, 51–62. [37] N. S¸en, F. Alajaji, S. Yuksel.¨ Feedback capacity of a class of symmetric finite-state Markov channels. IEEE Trans. Inform. Theory 56 (2011), 4110–4122. [38] P. C. Shields. The interactions between ergodic theory and information theory. Information theory: 1948–1998. IEEE Trans. Inform. Theory 44 (1998), no. 6, 2079–2093. [39] S. Tatikonda. Control under Communication Constraints. PhD disserta- tion, Massachusetts Institute of Technology, Cambridge, MA, 2000. [40] S. Tatikonda, S. Mitter. The capacity of channels with feedback. IEEE Trans. Inform. Theory 55 (2009), 323–349. [41] D. Teneketzis. On the structure of optimal real-time encoders and decoders in noisy communication. IEEE Trans. Inform. Theory 52 (2006), no. 9, 4017–4035. [42] J. C. Walrand, P. Varaiya. Optimal causal coding-decoding problems. IEEE Trans. Inform. Theory 29 (1983), no. 6, 814–820. [43] H. S. Witsenhausen. On the structure of real-time source coders. Bell Syst. Tech. J, 58:1437–1451, July/August 1979. [44] W. S. Wong, R. W. Brockett. Systems with finite communication bandwidth constraints - part ii: Stabilization with limited information feedback. IEEE Trans. Automatic Control 42 (1997), 1294–1299. [45] R. G. Wood, T. Linder, and S. Yuksel.¨ Optimal zero delay coding of Markov sources: stationary and finite memory codes. IEEE Trans. Inform. Theory 63 (2017), no. 9, 5968–5980. [46]S.Y uksel,¨ T. Bas¸ar. Stochastic Networked Control Systems: Stabilization and Optimization under Information Constraints. Birkhauser,¨ New York, NY, 2013. [47]S.Y uksel.¨ On optimal causal coding of partially observed Markov sources in single and multi-terminal settings. IEEE Trans. Inform. Theory 59 (2013), no. 1, 424–437. [48]S.Y uksel.¨ Stationary and ergodic properties of stochastic nonlinear systems controlled over communication channels. SIAM J. Control Optim. 54 (2016), no. 5, 2844–2871.