Directed , Causal Estimation, and Communication in Continuous Time

Young-Han Kim Haim H. Permuter Tsachy Weissman University of California, San Diego Ben Gurion University of the Negev Stanford University/Technion La Jolla, CA, USA Beer-Sheva, Israel Stanford, CA, USA/Haifa, Israel [email protected] [email protected] [email protected]

Abstract—The notion of directed information is intro- the road to a computable characterization of feedback duced for stochastic processes in continuous time. Proper- capacity; see [7], [8] for examples. ties and operational interpretations are presented for this Directed information and its variants also characterize notion of directed information, which generalizes mutual (via multi-letter expressions) the capacity of two-way information between stochastic processes in a similar channels and multiple access channels with feedback [2], manner as Massey’s original notion of directed information generalizes Shannon’s in the discrete- [9], and the rate distortion function with feedforward time setting. As a key application, Duncan’s theorem is [10], [11]. In another context, directed information also generalized to estimation problems in which the evolution captures the difference in growth rates of wealth in horse of the target signal is affected by the past channel noise, race gambling due to causal side information [12]. This and the causal minimum mean squared error estimation is provides a natural interpretation of I(Xn → Y n) as related to directed information from the target signal to the the amount of information about Y n causally provided observation corrupted by additive white Gaussian noise. by Xn on the fly. A similar conclusion can be drawn An analogous relationship holds for the Poisson channel. for other engineering and science problems, in which The notion of directed information as a characterizing of directed information measures the value of causal side the fundamental limit on reliable communication for a wide class of continuous-time channels with feedback is information [13]. dicussed. In this paper, we extend the notion of directed in- formation to continuous-time random processes. The I. INTRODUCTION contribution of this paper is twofold. First, the definition Directed information I(Xn → Y n) between two we give for directed information in continuous time is n n random n-sequences X = (X1,...,Xn) and Y = valuable in itself. Just as in the discrete-time setting, di- (Y1,...,Yn) is a natural generalization of Shannon’s rected information in continuous time generalizes mutual mutual information to random objects with causal struc- information between two stochastic processes. Indeed, tures. Introduced by Massey [1], this notion of directed when two processes do not have any causal dependence information has been shown to arise as the canonical among them, the two notions become identical. Directed answer to a variety of problems with causally dependent information in continuous time is also a generalization components. For example, it plays a pivotal role in char- of its discrete time counterpart. acterizing the capacity CFB of communication channels Second, we demonstrate the utility of this notion of with feedback. Massey [1] showed that the feedback directed information by generalizing classical results capacity is upper bounded by on the relationship between mutual information and causal estimation in continuous time. In particular, we 1 n n CFB ≤ lim max I(X →Y ), generalize Duncan’s theorem which relates the minimum n→∞ p(xn||yn−1) n mean squared error (MMSE) of a target signal based where the definition of directed information I(Xn → on an observation through an additive white Gaussian Y n) is given in Section II and p(xn||yn−1) = channel to directed information between the target signal n i−1 i−1 i=1 p(xi|x ,y ) is the causal conditioning notation and the observation. We similarly generalize the Poisson streamlined by Kramer [2], [3]. This upper bound is tight analogue of Duncan’s theorem. forQ a certain class of ergodic channels [4]–[6], paving The rest of the paper is organized as follows. Section II is devoted to the definitions of directed information Note that each sequence component is a continuous-time and directed information density in continuous time, stochastic process. Define now which is followed by key properties of continuous- T T It X → Y time directed information in Section III. Section IV 0 0 T,t T,t presents a generalization of Duncan’s theorem, and of := I X0 →Y0 (3) its Poisson counterpart, for target signals that depend n  on the past noise. In Section V we present a feedback ti− ti− ti−1− T T tn− = I Yti−1 ; X0 |Y0 + I Ytn ; X0 |Y0 , communication setting in which our notion of directed " # Xi=1   information in continuous time emerges naturally as the (4) characterization of the capacity. We conclude with a where on the right side of (3) is the directed information few remarks in Section VI. This extended abstract is between two sequences of length n + 1 defined in taken essentially verbatim from [14], with the addition previous sections, in (4) we take t0 = 0, and the mutual of Section V. More details, proofs of the stated results, information terms between two continuous time pro- and additional related results will be given in [15]. cesses, conditioned on a third, are well defined objects, T T as developed in [16], [17]. It X0 → Y0 is monotone II. DEFINITIONOF DIRECTED INFORMATION IN in t in the following sense: CONTINUOUS TIME  Proposition 1. If t′ is a refinement of t, then n n Let be a pair of random -sequences. Di- ′ T T T T (X ,Y ) n It X0 → Y0 ≤ It X0 → Y0 . rected information (from Xn to Y n) is defined as The following definition is now natural: n n n i i−1 Definition 1. Directed information between T and T I(X →Y ) := I(X ; Yi|Y ). X0 Y0 is defined as Xi=1 T T T T I X → Y := inf It X → Y , (5) Note that unlike mutual information, directed informa- 0 0 t 0 0 n n tion is asymmetric in its arguments, so I(X → Y ) 6= where the infimum is over all n and t as in (1). I(Y n →Xn). b Note, in light of Proposition 1, that For a continuous-time process {Xt}, let Xa = {Xs : T T T T a ≤ s ≤ b} denote the process in the interval [a, b] I X → Y = lim inf It X → Y . 0 0 + t 0 0 when a ≤ b and the empty set otherwise. Let Xb− = ε→0 { :ti−ti−1≤ε} a (6) {X : a ≤ s < b} denote the process in the interval   s Directed information can be given an integral repre- [a, b) if a < b and the empty set otherwise. Similarly, b sentation via the following notion of a density. let Xa+ = {Xs : a 0, if Xt = 0 0  T/α T/α information holds in more general situations, as elabo- Xtα and Y˜t = Ytα then I X˜ → Y˜ =  0 0 rated in the next section. T T I(X0 → Y0 ). More generally, if φ is mono- The directed information we have just defined is tone strictly increasing and continuous, and ˜ ˜ between two processes on [0, T ]. We extend this def- (Xφ(t), Yφ(t)) = (Xt,Yt), then inition to processes on any other closed and bounded I(XT → Y T )= I X˜ φ(T ) → Y˜ φ(T ) . (15) interval, and to the conditional directed information 0 0 φ(0) φ(0) T T   I X0 → Y0 V ), where V is a random object jointly 3) Coincidence of directed and mutual information: distributed with (XT ,Y T ), in the obvious way. t− t− T 0 0 If the Markov relation Y0 − X0 − Xt holds for We now define the notion of directed information all 0 ≤ t ≤ T then between a process on [0, T ) and a process on [0, T ]. Let [δ],T T T T T I X0 → Y0 = I X0 ; Y0 . (16) X0 denote the process on [0, T ] formed by shifting   4) Equivalence between discrete-time and piecewise 2) The third part of the proposition is the natural ana- constancy in continuous-time: Let U n, V n be an logue of the fact that I(Xn; Y n)= I(Xn → Y n) i i n arbitrarily jointly distributed pair of n-tuples and whenever Y − X − Xi+1 for all 1 ≤ i ≤ n. It T let t0,t1,...,tn be a sequence of numbers satisfy- covers, in particular, any scenario where X0 and T ing t0 = 0, tn = T , and ti−1 < ti for 1 ≤ i ≤ n. Y0 are the input and output of any channel of the T T t T T Let the pair X0 ,Y0 be formed as the piecewise- form Yt = gt(X0,W0 ), where the process W0 constant process satisfying (which can be thought of as the internal channel noise) is independent of the channel input process (Xt,Yt) = (Ui, Vi) if ti−1 ≤ t 0 and T T = I(X0 → Y0 )  T −  T therefore that I Y0 → X0 exists and equals holds, then the directed information zero. Thus (18) in this case collapses to (16). T − T I Y0 → X0 exists and IV. DIRECTED INFORMATION AND CAUSAL   ESTIMATION T T T − T I X0 → Y0 + I Y0 → X0 A. The Gaussian Channel  T T   = I X0 ; Y0 . (18) In [18], Duncan discovered the following fundamental relationship between the minimum mean squared error Remarks. 1) The first, second and fourth items in in causal estimation of a target signal corrupted by an the above proposition present properties that are additive white Gaussian noise in continuous time and known to hold for mutual information (i.e., when the mutual information between the clean and noise- all the directed information expressions in those corrupted signal: items are replaced by the corresponding mutual T information), and that follow immediately from the Theorem 1 (Duncan [18]). Let X0 be a signal of data processing inequality and from the invariance T 2 finite average power 0 E[Xt ]dt < ∞, independent of of mutual information to one-to-one transforma- the standard Brownian motion {B }, and let Y T satisfy R t 0 tions of its arguments. That these properties hold dYt = Xtdt + dBt. Then also for directed information is not as obvious in 1 T view of the fact that directed information is, in gen- E (X − E[X |Y t])2 dt = I XT ; Y T . (20) 2 t t 0 0 0 eral, not invariant to one-to-one transformations Z0 nor does it satisfy a data processing inequality in A remarkable aspect of Duncan’s theorem is that the its second argument. relationship (20) holds regardless of the distribution of T X0 . Among its ramifications is the invariance of the Proof of Theorem 2: For every t ≥ 0 and ε ≤ δ, causal MMSE (minimum mean squared error) to the flow 1 t+ε− of time, or indeed to any way of reordering time [19], E (X − E[X |Y s])2 ds (24) 2 s s 0 [20]. Zt t+ε−  A key stipulation in Duncan’s theorem is the inde- 1 s 2 t− = E E (Xs − E[Xs|Y0 ]) |Y0 ds (25) T 2 t pendence between the noise-free signal X0 and the Z t+ε−   channel noise {Bt}, which excludes scenarios in which 1 s 2 t− = E E (Xs − E[Xs|Y ]) |Y ds (26) the evolution of X is affected by the channel noise, 2 0 0 t  Zt  as is often the case in signal processing (e.g., target 1 t+ε−   = E (X − E[X |Y s])2|yt− ds dP (yt−) tracking) and in communications (e.g., in the presence 2 s s 0 0 0 Z  Zt  of feedback). Indeed, (20) does not hold in the absence   (27) of such a stipulation. (a) t+ε− t+ε− t− t− As an extreme example, consider the case where the = I Xt ; Yt |y0 dP (y0 ) (28) channel input is simply the channel output with some Z t+ε− t+ε− t− = I Xt ; Yt |Y0 ,  (29) delay, i.e., Xt+ε = Yt for some ε> 0 (and say Xt ≡ 0 t− t− T for t ∈ [0, ε)). In this case the causal MMSE on the left where (a) is due to the following: since Y0 , X0 ,W0 t+ε− t+ε− side of (20) is clearly 0, while the mutual information on is independent of Bt , and Xt is determined by t− t− T t− t+ε− its right side is infinite. On the other hand, in this case the Y0 , X0 ,W0 , it follows that Y0 , Xt is indepen- T T t+ε− directed information I X0 → Y0 = 0, as can be seen dent of Bt . Thus, (a) is nothing but an application T T t by noting that It X0 → Y0 = 0 for all satisfying of Duncan’s theorem on the conditional distribution t ti− t+ε− t+ε− t− maxi ti −ti−1 ≤ ε (since for such , X0 is determined of (Xt ,Bt ) given y0 to get equality between ti−1−  by Y0 for all i). the integrand of (27) and that of (28). Fixing now an The third comment following Proposition 3 implies arbitrary t that satisfies maxi ti − ti−1 ≤ δ gives that Theorem 1 could equivalently be stated with T T T 1 t 2 I X0 ; Y0 on the right side of (20) replaced by E (Xt − E[Xt|Y0 ]) dt (30) T T 2 0 I X0 → Y0 . Further, such a modified equality would Z  N  ti−  be valid in the extreme example just given. This is no 1 t 2  = E (Xt − E[Xt|Y ]) (31) coincidence, and is a consequence of the following result 2 0 " i=1 Zti−1 # that generalizes Duncan’s theorem. X T   1 t 2 + E (Xt − E[Xt|Y0 ]) (32) Theorem 2. Let {Bt} be a standard Brownian motion, 2 Ztn let {Wt} be independent of {Bt}, and let {(Xt,Yt)} N   (a) satisfy ti− ti− ti−1− T T tn− = I Yti−1 ; X0 |Y0 + I Ytn ; X0 |Y0 t−δ t−δ T "i=1 # Xt = at(X0 ,Y0 ,W0 ) (21)   X (33) (for deterministic mappings and some ) such T T at δ > 0 = It X → Y , (34) T 2 0 0 that {Xt} has finite average power 0 E[Xt ]dt < ∞ and where(a) follows by applying (29) on each of the R summands in (32) with the associations t ↔ t and dYt = Xtdt + dBt. (22) i−1 ti − ti−1 ↔ ε. Finally, since (34) holds for arbitrary t Then satisfying maxi ti − ti−1 ≤ δ, by (6) we obtain T T 1 t 2 T T 1 t 2 T T E (Xt − E[Xt|Y ]) dt = I X → Y . E (X − E[X |Y ]) = I X → Y . (35) 2 0 0 0 2 t t 0 0 0 Z0 Z0   (23)    T T Note that since, in general, I X0 → Y0 is not The evolution of the noise-free process in the above invariant to the direction of the flow of time, Theorem theorem (equation (21)), which assumes a nonzero delay  2 implies, as should be expected, that neither is the in the feedback loop, is the standard evolution arising in causal MMSE for processes evolving in the generality signal processing and communications (cf., e.g., [21]). afforded by (21) and (22). The result, however, can be extended to accommodate a more general model with zero delay, as introduced and T developed in [22], by combining our arguments above of encoding functions {ft}t=0. While similar settings with those used in [22]. were studied by Ihara [25], [26], the focus therein is the information capacity, that is, the maximal mutual infor- B. The Poisson Channel mation between the message and the output processes. The following result can be considered an analogue In contrast, we focus on the operational capacity, defined of Duncan’s theorem for the case of Poisson noise. as follows. T Theorem 3 ( [23]). Let Y0 be a doubly stochastic Definition 3. A rate R is said to be achievable with T Poisson process and let X0 be its intensity process. feedback delay if for each there exists a family of T ∆ T Then, provided E |Xt log Xt|dt < ∞, T 0 encoding functions {ft}t=0 such that T R t T T ˆ T T →∞ E φ(Xt) − φ E[Xt|Y0 ] dt = I X0 ; Y0 , P (M 6= M(Y0 )) −→ 0, (39) 0 Z (36) ˆ T    where M(Y0 ) in (39) is the maximum likelihood es- where φ(α)= α log α. T timate of M given Y0 , when employing the encoding T It is easy to verify that the condition stipulated in functions {ft}t=0. T the third item of Proposition 3 holds when Y0 is a Let T doubly stochastic Poisson process and X0 is its intensity process. Thus, the above theorem could equivalently be C∆ = sup{R : R is achievable with feedback delay ∆} stated with directed rather than mutual information on (40) the right hand side of (36). Indeed, with continuous-time be the feedback capacity with delay ∆ and, finally, define directed information replacing mutual information, this the feedback capacity by relationship remains true in much wider generality, as C = sup C∆. (41) the next theorem shows. In the statement of the theorem, ∆>0 we use the notions of a point process and its predictable Our goal is to characterize C∆ and C for the class of intensity, as developed in detail in [24, Chapter II]. channels defined by (41). Our preliminary result suggests that C is given by Theorem 4. Let Yt be a point process and let X be its F Y -predictable intensity, where F Y = 1 t t t C = sup sup lim I XT → Y T , (42) σ(Xt) (the σ-field generated by Xt). Then, provided 0 0 0 0 ∆>0 S∆ T →∞ T T E 0 |Xt log Xt|dt < ∞,  where the inner supremum in (42) is over S∆, which is T R t T T the set of all channel input processes of the form Xt = E φ(Xt) − φ E[Xt|Y0 ] dt = I X0 → Y0 . t−∆ T 0 gt(Ut,Y −), some family of functions {gt}t=0, and Z (37) T    some process U0 which is independent of the channel noise process T (appearing in (38)). Existence of the V. COMMUNICATION OVER CONTINUOUS-TIME W0 limit in (42) follows a standard superadditivity argument; CHANNELS WITH FEEDBACK recall that the input memory is bounded in (38). The Consider a channel characterized via the following main difficulty in completing the proof is the fact that ingredients: chopping the time into small segments causes technical • X and Y are the channel input and output alphabets. difficulties such as not preserving the ergodicity of the • M is the message, uniformly distributed on channel in the segment. {1, 2,..., ⌊2T R⌋} and independent of the stationary ONCLUDING EMARKS AND ESEARCH and ergodic channel noise process {Wt}. VI. C R R • Channel output process: DIRECTIONS The machinery developed here for directed informa- Y = g(Xt ,W ) (38) t t−s t tion between continuous-time stochastic processes ap- for some fixed s> 0. pears to have several powerful applications, emerging t−∆− • Encoding process: Xt = ft(M,Y ) for t ≥ 0 naturally both in estimation and communication theoretic and some ∆ > 0 (and set arbitrarily for t< 0). contexts. In [15], we provide further substantiation of the Here ∆ is the feedback delay. An encoding scheme for significance of directed information in continuous time the time interval [0, T ] is characterized by the family through connections to the Kalman–Bucy filter theory. One immediate goal of our future research is to complete [12] H. H. Permuter, Y.-H. Kim, and T. Weissman, “On directed the characterization of feedback capacity as presented in information and gambling,” in Proc. International Symposium on (ISIT), Toronto, Canada, 2008. Section V and to establish the capacity of certain power- [13] Y.-H. Kim, H. H. Permuter, and T. Weissman, “An interpretation constrained non-white Gaussian channels with feedback. of directed information in gambling, portfolio theory and ,” 2009, in preparation. REFERENCES [14] ——, “Directed information and causal estimation in continu- ous time,” in Proc. IEEE International Symp. Inf. Theory, Seoul, [1] J. Massey, “Causality, feedback and directed information,” Proc. Korea, 2009. Int. Symp. Inf. Theory Applic. (ISITA-90), pp. 303–305, 1990. [15] ——, “Directed information in continuous time,” 2009, in [2] G. Kramer, “Directed information for channels with feed- preparation. back,” Ph.D. Dissertation, Swiss Federal Institute of Technology [16] A. Kolmogorov, “On the shannon theory of information trans- (ETH) Zurich, 1998. mission in the case of continuous signals,” Information Theory, [3] ——, “Feedback strategies for a class of two-user multiple- IRE Transactions on, vol. 2, no. 4, pp. 102–108, December access channels,” IEEE Trans. Inf. Theory, vol. IT-45, no. 6, 1956. pp. 2054–2059, 1999. [17] M. S. Pinsker, Information and Information Stability of Random [4] S. Tatikonda and S. Mitter, “The capacity of channels with Variables and Processes. Moskva: Izv. Akad. Nauk, 1960, in feedback,” IEEE Trans. Inf. Theory, vol. IT-55, no. 1, pp. 323– Russian. 349, Jan 2009. [18] T. E. Duncan, “On the calculation of mutual information,” SIAM [5] Y.-H. Kim, “A coding theorem for a class of stationary channels J. Appl. Math., vol. 19, pp. 215–220, 1970. with feedback,” IEEE Trans. Inf. Theory., vol. 25, pp. 1488– [19] A. Cohen, N. Merhav, and T. Weissman, “Scanning and sequen- 1499, April, 2008. tial decision making for multidimensional data, Part II: noisy [6] H. H. Permuter, T. Weissman, and A. J. Goldsmith, “Finite data,” IEEE Trans. Inf. Theory, vol. IT-54, no. 12, pp. 5609– state channels with time-invariant deterministic feedback,” IEEE 5631, Dec. 2008. Trans. Inf. Theory, vol. IT-55, no. 2, pp. 644–662, 2009. [20] D. Guo, S. Shamai, and S. Verd´u, “Mutual Information and [7] Y.-H. Kim, “Feedback capacity of stationary Gaussian chan- Minimum Mean-Square Error in Gaussian Channels,” IEEE nels,” 2006, submitted to IEEE Trans. Inform. Theory. Availble Trans. Inf. Theory, vol. IT-51, no. 4, pp. 1261–1283, April 2005. at arxiv.org/pdf/cs.IT/0602091. [21] T. T. Kadota, M. Zakai, and J. Ziv, “Capacity of a continuous [8] H. H. Permuter, P. Cuff, B. V. Roy, and T. Weissman, “Capac- memoryless channel with feedback,” IEEE Trans. Inf. Theory, ity of the trapdoor channel with feedback,” IEEE Trans. Inf. vol. IT-17, pp. 372–378, 1971. Theory, vol. IT-54, no. 7, pp. 3150–3165, 2008. [22] ——, “Mutual information of the white Gaussian channel with [9] G. Kramer, “Capacity results for the discrete memoryless net- and without feedback,” IEEE Trans. Inf. Theory, vol. IT-17, pp. work,” IEEE Trans. Inf. Theory, vol. IT-49, pp. 4–21, 2003. 368–371, 1971. [10] S. P. R. Venkataramanan, “Source coding with feed-forward: [23] R. S. Liptser and A. N. Shiryaev, Statistics of Random Processes Rate-distortion theorems and error exponents for a general II: Applications, 2nd ed. Springer, 2001. source,” IEEE Trans. Inf. Theory, vol. 53, no. 6, pp. 2154–2179, [24] P. Br´emaud, Point Processes and Queues: Martingale Dynam- 2007. ics, 1st ed. New York: Springer-Verlag, 1981. [11] S. Pradhan, “On the role of feedforward in gaussian sources: [25] S. Ihara, “Coding theorems for a continuous-time gaussian Point-to-point source coding and multiple description source channel with feedback,” IEEE Transactions on Information coding,” IEEE Trans. Inf. Theory, vol. 53, no. 1, pp. 331–349, Theory, vol. 40, no. 6, pp. 2041–2044, 1994. 2007. [26] ——, Information Theory for Continuous Systems. River Edge, NJ: World Scientific, 1993.