ISIT 2010, Austin, Texas, U.S.A., June 13 - 18, 2010 Universal Estimation of Directed Lei Zhao, Haim Permuter, Young-Han Kim, and Tsachy Weissman

Abstract— In this paper, we develop a universal algorithm to Notation: We use capital letter X to denote a random estimate Massey’s directed information for stationary ergodic variable and use small letter x to denote the corresponding processes. The sequential probability assignment induced by a realization or constant. Calligraphic letter X denotes the universal source code plays the critical role in the estimation. |X | In particular, we use context tree weighting to implement the alphabet of X and denotes the cardinality of the alphabet. algorithm. Some numerical results are provided to illustrate the II. PRELIMINARIES performance of the proposed algorithm. We first give the mathematical definitions of directed infor- mation and causally conditional entropy, and then discuss the I. INTRODUCTION relation between universal sequential probability assignment First introduced by Massey in [1], directed information and universal source coding. arises as a natural counterpart of for chan- nel capacity when feedback is present. In [2] and [3], Kramer A. Directed information extended the use of directed information to discrete memory- Directed information from Xn to Y n is defined as less networks with feedback, including the two-way channel n → n n − n|| n and the multiple access channel. For a class of stationary I (X Y )=H(Y ) H(Y X ), (1) channels with feedback, where the output is a function of the where H(Y n||Xn) is the causally conditional entropy [2], current and past m inputs and channel noise, Kim [4] proved defined as that the feedback capacity is equal to the limit of the supremum n n n i−1 i of the normalized directed information from the input to the H(Y ||X )= H(Yi|Y ,X ). (2) output. Tatikonda and Mitter [5] used directed information to i=1 prove a general feedback channel coding theorem for channels Compared with the definition of mutual information with memory. In [6], Permuter et al. considered the capacity of discrete-time channels with feedback where the feedback is a I(Xn; Y n)=H(Xn) − H(Xn|Y n), time-invariant deterministic function of the output. Under mild the conditional entropy is replaced by the causal conditioning. conditions, they showed that the capacity is the maximum of And unlike mutual information, directed information is not the normalized directed information between the input and symmetric, i.e., I (Y n → Xn) = I (Xn → Y n) in general. output sequence in the limit. Recently, Permuter et al. [7] Other interesting properties of directed information such as showed that directed information plays an important role in the conservation law can be found in [2], [11]. portfolio theory, , and hypothesis testing, For random processes X,Y, which are jointly stationary, we where causality constraints exist. can define directed information rate [2] as follows: Besides , directed information is shown to 1 be a valuable tool in biology, when inference about causality H(Y||X) = lim H(Y n||Xn), (3) is needed. In [8], directed information was used to identify n→∞ n 1 pairwise influence. The authors in [9] used directed informa- I(X → Y) = lim I (Xn → Y n) , (4) tion to test the inference of influence in gene networks. Thus n→∞ n it is of both theoretical and practical interests to develop a way The existence of the limit can be checked as follows: to estimate directed information efficiently. I(X → Y) As we were completing this paper, [10] was brought to 1 our attention, in which the authors used directed information = lim I(Xn → Y n) n→∞ n to infer causal relationships in ensemble neural spike train 1 n n n recordings. At the heart of both our estimation framework and = lim (H(Y ) − H(Y ||X )) n→∞ n theirs is the estimation of causally conditional entropy. The n n 1 i−1 1 i−1 i main difference is that they took a parametric approach [10, = lim H(Yi|Y ) − lim H(Yi|Y ,X ) n→∞ n n→∞ n Assumption 3, Page 10], while our approach is based on non- i=1 i=1 −1 0 −1 parametric universal data compressors, and therefore leading = H(Y0|Y−∞) − H(Y0|X−∞,Y−∞), to stronger universality properties. where the last equality is obtained via the property of Ces´aro Lei Zhao and Tsachy Weissman are with the Department of Electrical mean [12] and standard martingale arguments [13]. Note Engineering, Stanford University. Email: {leiz,tsachy}@stanford.edu that the entropy rate H(Y) of the process Y is equal to Haim Permuter is with the Department of Electrical and Computer Engi- | −1 Y||X | 0 −1 neering, Ben-Gurion University of the Negev, Israel. Email: [email protected] H(Y0 Y−∞),andH( )=H(Y0 X−∞,Y−∞). Thus Young-Han Kim is with the Department of Electrical and Computer Engineering, University of California, San Diego. Email: [email protected] I(X → Y)=H(Y) − H(Y||X). (5)

978-1-4244-7892-7/10/$26.00 ©2010 IEEE 1433 ISIT 2010 ISIT 2010, Austin, Texas, U.S.A., June 13 - 18, 2010

− 1 ≥ B. Universal sequential probability assignment and universal By the Kraft inequality, n log kn 0. Given a universal source coding source coding scheme and a stationary source X,by(10), 1 lim sup D(PXn ||QXn ) A sequential probability assignment Q consists of a set n→∞ n   −1 −1 ∞ { i−1 · ∀ i ∈Xi } 1 1 of conditional probabilities QXi|x ( ), x i=1. ≤ E n − n n lim sup ln(X ) H(X ) Note that Q induces a probability distribution on X in the n→∞ n n obvious way. =0. (13)

Definition 1 A sequential probability assignment Q is univer- Thus we can construct a universal sequential probability sal if assignment from a universal coding scheme. 1 lim sup D(PXn ||QXn )=0 (6) n→∞ n III. ESTIMATION OF I(X → Y) for any stationary probability measure P . As we have seen, I(X → Y)=H(Y) − H(Y||X).Inthis section, we will show an estimate of H(Y||X) based on a A source code for an n-block source sequence, C ,is n universal sequential probability assignments. Similar method defined as a mapping from a source sequence xn to a binary applies to the estimate of H(Y). sequence of finite length, i.e., Let M(X , Y) be the set of all distribution on X×Y. n ∗ Cn : X →{0, 1} . (7) Define f as the function that maps a joint distribution PX,Y of a random vector (X, Y ) to the corresponding conditional More explicitly, entropy H(Y |X), i.e., C n ···  n(x )=b1,b2, ,bln , (8) − | f(PX,Y )= PX,Y (x, y)logPY |X (y x), (14) n x,y where ln = ln(x ) is the code length. Furthermore, Cn is said C n  C n ∀ n  n ·|· to be non-singular if n(x ) = n(y ), x = y .Itissaid where PY |X ( ) is the conditional distribution induced by to be uniquely decodable if all its extensions are non-singular PX,Y . [12]. The codeword lengths ln(·) of any uniquely decodable code must satisfy the Kraft inequality Lemma 1 For any >0, there exists K > 0 such that for  all P and Q in M(X , Y): − ( n) 2 ln x ≤ 1. (9) |f(P ) − f(Q)|≤ + K P − Q1 , xn∈X n See [12] for a proof. where ·1 is the l1 norm (viewing P and Q as |X ||Y|- dimensional simplex vectors), For any uniquely decodable code, we have Proof: Fix >0.SinceM(X , Y) is bounded and closed, 1 n 1 n 1 1 El (X )= H(X )+ D(P n ||Q n )− log k , (10) f(·) is uniformly continuous. Thus there exists δ such that n n n n X X n n |f(P ) − f(Q)|≤,ifP − Q ≤ δ .Furthermore,f(·) is  n −l(xn) 1 −l(x ) n 2 n n max  |X | |Y| where kn = x 2 ,andQ(x )= kn . Q(x ) bounded by f log +log .Wehave induces a probability measure on X n. With slight abuse of |f(P ) − f(Q)|≤1{ −  ≤ } + fmax1{ −  } notation, call this measure Q. P Q 1 δ P Q 1>δ P − Q1 ≤ + fmax Definition 2 The sequential probability assignment induced δ by a source code Cn is the set of conditional probabilities fmax n ≤ + P − Q { | i−1 } 1 QXi x i=1,where δ −1  −  Q(axi ) = + K P Q 1 , (15) i−1 QXi|x (a)= −1 . (11) Q(xi ) where fmax . K = δ −1 where, axi is a concatenation of symbol a and sequence Lemma 2 [15] Let X be a stationary ergodic process. If i−1 x . limk→∞ gk(X) → g(X), w.p. 1, and {gk(·)} are bounded, then Definition 3 A source coding scheme is a sequence of source n   codes. It is said to be universal if each code is uniquely 1 k lim gk T (X) = Eg(X), w.p. 1 and in L1, (16) n→∞ n decodable and k=1 1 n · lim sup E [ln(X )] = H(X). (12) where T ( ) is the shift operator. n→∞ n Now define gk(X, Y)=f(P | −1 −1 ) for a jointly for every stationary source X. X0,Y0 X−k ,Y−k stationary and ergodic process (X, Y). Note that, by mar- X Y → X Y The per-symbol expected number of bits based on the tingale convergence [13], gk( , ) g( , ), w.p. 1, X where g(X, Y)=f(P −1 −1 ). Noting further that universal source coding scheme is a good estimate of H( ). X0,Y0|X−∞,Y−∞

1434 ISIT 2010, Austin, Texas, U.S.A., June 13 - 18, 2010

Eg(X, Y)=H(Y||X), we can apply Lemma 2 and get the • (a) comes from Lemma 1, following corollary: • (b) is due to Pinsker’s inequality. Note that k−1 k−1 k−1 k−1 both PXk ,Yk|X ,Y and QXk,Yk|X ,Y Corollary 1 −1 −1 are functions of (Xk ,Yk ). Thus n k−1 k−1 || k−1 k−1 1 D P k k| Q k k| is a random k−1 k−1 Y||X X ,Y X ,Y X ,Y X ,Y lim f(PXk ,Yk|X ,Y )=H( ) w.p. 1, and in L1. n→∞ n variable, √ k=1 · (17) • (c) and (d) come from the concavity of , • (e) is because of the chain rule of the Kullback-Leibler Let Q be a universal sequential probability assignment on divergence. (X×Y)∞. Denote the normalized casual conditional entropy

estimate induced by Q by: E  n−1 n−1 − Y||X n   lim sup HQ(X ,Y ) H( )  n−1 n−1 1 n→∞ H (X ,Y )  f Q | i−1 i−1 (18) Q n Xi,Yi X ,Y n   i=1  n−1 n−1 1 ≤ E − k−1 k−1 lim sup HQ(X ,Y ) f PXk ,Yk|X ,Y →∞  n−1 n−1 n n =1 Note that HQ(X ,Y ) is a random variable,whichisa k n−1 n−1 n function of the pair (X ,Y ), not to be confused with 1   n−1 n−1 E k−1 k−1 − Y||X deterministic objects such as H(X ,Y ), that depend + lim sup f PXk ,Yk|X ,Y H( ) n→∞ n on the distribution of (Xn−1,Yn−1). Our goal is to prove k=1 n  Y||X ( )    that HQ is a good estimator of H( ) if Q is a universal f  n−1 n−1 1 E − k−1 k−1 = lim sup HQ(X ,Y ) f PXk ,Yk|X ,Y sequential probability assignment. n→∞ n  k=1 Theorem 1 Let Q be a universal sequential probability as- (g) 1 ∞ ≤ n n || n n signment on on (X×Y) .Then + lim sup K D (PX ,Y QX ,Y ) n→∞ 2n  n−1 n−1 (h) lim HQ(X ,Y )=H(Y||X) in L1. (19) n→∞ = (21) Proof: Fix an arbitrary >0. where (f) is because of Corollary 1; (g) comes from (20); (h) n   is due to Definition 1. Now we can use the arbitrariness of  n−1 n−1 1 E − k−1 k−1 HQ(X ,Y ) f PXk ,Yk|X ,Y to complete the proof. n X ∅ k=1 Set = .IfQY is a universal probability assignment for n Y∞ 1      ,wehave = E f Q | k−1 k−1 − f P | k−1 k−1 Xk ,Yk X ,Y Xk ,Yk X ,Y  n−1 Y n =1 lim HQY (Y )=H( ) in L1. (22) k n→∞ n     1 ≤ E k−1 k−1 − k−1 k−1 To sum up, provided with universal probability assignments f QXk,Yk|X ,Y f PXk ,Yk|X ,Y ∞ ∞ n Y X Y Y X Y k=1 Q and Q , on and ( , ) , respectively, we can ( ) n construct an estimate of I(X → Y) based on a realization a 1 ≤ E + K Q | k−1 k−1 − P | k−1 k−1 of (X, Y), which converges to the true value in L1.As n Xk,Yk X ,Y Xk ,Yk X ,Y 1 k=1   we have shown in the previous section, a universal source (b) n   coding scheme induces a universal probability assignment. K 1 ≤ E D P | k−1 k−1 ||Q | k−1 k−1 + Y X Y n 2 Xk ,Yk X ,Y Xk ,Yk X ,Y Thus, using two universal compressors on and ( , ), k=1  respectively (Actually, it can be the same source coding (c) n    K 1 method. Just apply it to different alphabets.), we can have ≤ E D P | k−1 k−1 ||Q | k−1 k−1 + n 2 Xk ,Yk X ,Y Xk,Yk X ,Y a good estimate of the directed information rate from X to k=1 Y n . K = + n  k=1 IV. ALGORITHM AND NUMERICAL EXAMPLES   n n 1 −1 −1 Given a source realization x ,y and universal source D P | k−1 k−1 ||Q | k−1 k−1 Xk ,Yk Xk ,Yk X ,Y Xk,Yk X ,Y CY Y CX,Y X Y 2  code n on and n on ( , ), we use the sequential ( ) CY d 1 probability assignments QY and QX,Y induced by n and ≤ + K × X,Y  n−1  n−1 n−1 C to calculate H Y (y ) and H X,Y (x ,y ).The  2n n Q Q  first one is our estimate of H(Y) and the second one is our n    −1 −1 estimate of H(Y||X). The difference of the two gives the D P | k−1 k−1 ||Q | k−1 k−1 Xk ,Yk Xk ,Yk X ,Y Xk ,Yk X ,Y X → Y k=1 estimate of I( ). We have proved that when n is large,  our estimates will be close to the true value (in the L1 sense.). (e) 1 = + K D (PXn,Y n ||QXn,Y n ) Although our theorem holds true for any universal source 2n coding scheme, in practice, a scheme with low complexity (20) and fast rates of convergence is preferred. We emphasize here where that the computation of the sequential probability assignment

1435 ISIT 2010, Austin, Texas, U.S.A., June 13 - 18, 2010

−3 (11) induced by an arbitrary source code is computationally x 10 I(Y−>X) 12 non-trivial in general. estimated In our implementation, we choose the context tree weighting true value as our universal source coding scheme. One advantage is 11 that the complexity of the algorithm is linear in the block length n and the algorithm provides the sequential probability 10 { i−1 }n assignments QXi|x i=1 directly [16]. Note that while the original context tree weighting was tuned for binary sequence, 9 it has been extended for larger alphabet [17]. As a first step experiment, we assume a depth D of the context tree, which 8 is larger than the memory of the source. This short coming

can be overcome by the method introduced in [18], although 7 we did not implement it here. Our algorithm of estimating Y||X Y H( ) is explained in Algorithm 1. H( ) can be obtained 6 4 5 6 similarly as we have discussed. 10 10 10 n

Algorithm 1 Universal estimation algorithm based on context Fig. 2. X is a binary first order Markov process with transition probability tree weighting 0.1. Y is the result of X passing through a binary symmetric channel with Fix block length n and context tree depth D. parameter 0.4. The simulation was performed four times, each with data length 104, 104.5, 105, 105.5, 106 and context tree depth 3. Hˆ ← 0 for i ← 1,ndo zi =(xi,yi)  Make a super symbol with alphabet observed, i.e. Z = X + ,whered is the unit of shifts. We size |X ||Y| i i d use our algorithm to estimate H(Y||Z) for different d values. end for i−1 i Note that H(Yi|Y ,Z )=H(Yi|Xi) for d ≥ 0.Ford<0, for i ← D +1,ndo i−1 i −1 H(Y |Y ,Z ) increases with d for large i. Gather the context zi for the ith symbol z . i i−D i We run the algorithm for p =0.2 and =0.4 with block Update the context tree. The estimated probability length 200000 and context tree depth 3. Note that H(Yi|Xi)= ˆ i−1 · PZi|z ( ) is obtained along the way. H2(0.2) when i →∞. In Fig. 3, the simulation results is ˆ ˆ ← ˆ ˆ i−1 i−1 Update H as H H + f PXi,Yi|y ,x . plotted for different values of d. end for ˆ ← 1 ˆ H n H Hˆ is the estimation of H(Y||X). 1

0.995 estimated H(Y||Z) H (0.4) A. Stationary process passing through a DMC 2 0.99 Let X be a binary first order Markov process with transition probability p,i.e.P(Xn = Xn−1|Xn−1)=p.LetY be the output of a binary symmetric channel with parameter with 0.985 X the input process, illustrated in Fig. 1.

0.98 ··· ··· ,X−1,X0,X1, 0 0 ··· ,Y−1,Y0,Y1, ··· 0.975 1 1

0.97 −3 −2 −1 0 1 2 3 Fig. 1. Example 1 setup: X is a binary first order Markov process. d

We run our universal estimation algorithm to estimate the Fig. 3. X is a binary first order Markov process with transition probability directed information rate I(Y → X) with p =0.1 and = 0.2. Y is the result of X passing through a binary symmetric channel with 0.4. The results are shown in Fig. 2. The depth of the context parameter 0.4. Z is a shifted version of X with d units of shift. The algorithm estimates H(Y||Z) with n = 200000 and context tree depth 6. tree is set to be 3. As data length grows, the estimated vale is getting closer to the true value.

B. Presence of shifts C. Detection of the presence of feedback. We use the same setup in Section IV-A. Instead of observing The following example is used by Massey to demonstrate Xn directly, we assume that a shifted version of Xn is actually the difference between mutual information (MI) and directed

1436 ISIT 2010, Austin, Texas, U.S.A., June 13 - 18, 2010 information (DI) when feedback is present in a discrete memo- n=104 1.2 ryless channel. The input of a binary symmetric channel is the delayed output of the channel, Fig. 4. The mutual information 1

0.8 Yn−1 Yn unit delay Estimated DI rate 0.6 Estimated MI rate True DI rate True MI rate 0.4

··· ··· ··· −1 0 1 ··· ,X−1,X0,X1, 0 0 ,Y ,Y ,Y , 0.2

1 1 0

−0.2 Fig. 4. Example 2 setup: The output of a binary symmetric channel is used 0 0.1 0.2 0.3 0.4 0.5 as the input to the channel with unit delay. rate and directed information can be easily computed in the Fig. 5. Estimation of MI rate and DI rate when feedback is present. following way:

1 [7] H. Permuter, Y. H. Kim, and T. Weissman, “Interpretations I(X; Y) = lim I(Xn; Y n) n→∞ n of Directed Information in Portfolio Theory, Data Compression, 1 1 and Hypothesis Testing”, IEEE Trans. on Inf. Th., submitted, = lim H(Y n) − lim H(Y n|Xn) http://arxiv.org/abs/0912.4872 n→∞ n n→∞ n [8] P. Mathai, N. Martins, B. Shapiro ”On the Detection of Gene Network 1 n − 1 | n Interconnections using Directed Mutual Information”, ITA 2007. = lim H(Y ) lim H(Yn X ) [9] A. Rao, A. O. Hero III, D. J. States, and J. D. Engel, “Using Directed n→∞ n→∞ n n Information to Build Biologically Relevant Influence Networks,” Journal =H2(), (23) on Bioinformatics and Computational Biology, vol. 6, no.3, pp. 493-519, June 2008. where the second last equality is because Xi = Yi−1;thelast [10] C. Quinn, T. P. Coleman, N. Kiyavash, and N. G. Hatsopoulos, “Esti- n mating the directed information to infer causal relationships in ensemble equality comes from the fact H(Yn|X ) is bounded by 1.  − − − − neural spike train recordings”, Journal of Computational Neuroscience: H2(t) t log2 t (1 t)log2(1 t). Similarly, Special Issue on Methods of Information Theory in Computational X → Y Neuroscience, submitted Dec 15, 2009, I( ) [11] J. Massey and P.C. Massey. Conservation of mutual and directed 1 information. Proc. Int. Symp. Information Theory (ISIT-05), pages 157– = lim I(Xn → Y n) →∞ 158, 2005. n n [12] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, n n 1 i−1 1 i−1 i New-York, 2nd edition, 2006. = lim H(Yi|Y ) − lim H(Yi|Y ,X ) [13] Leo Breiman, Probability, SIAM: Society for Industrial and Applied n→∞ n n→∞ n i=1 i=1 Mathematics, May, 1992 n n [14] M. Weinberger, N. Merhav, and M. Feder “Optimal Sequential Prob- 1 1 abilty Assignment for Individual Sequences”,IEEE Trans. on Inf. Th., = lim H(Yi|Yi−1) − lim H(Yi|Yi−1) n→∞ n n→∞ n Vol. 40, No. 2, March 1994, pp 384 –pp396. i=1 i=1 [15] R. Durrett, Probability: Theory and Example, Duxbury Press, 3 edition, =0 (24) 2004. [16] F. Willems, Y. Shtarkov, and T. Tjalken, “The Context-Tree Weighting In this setup, the difference between the estimated MI rate Method: Basic Properties”,IEEE Trans. on Inf. Th.. Vol. 41, No. 3, May and DI rate can be used as an indicator of the presence of 1995, pp 653 – 664. [17] Tj. J. Tjalkens, Y. M. Shtarkov, and F. M. J. Willems, “Sequential feedback. For different value, we estimate the MI rate and Weighting Algorithms for Multi-Alphabet Sources,” 6th Joint Swedish- DI rate for data size n =105, shown in Fig. 5. Russian International Workshop on Information Theory,Molle,Sweden, 1993, pp. 230 – 234. [18] F. Willems, “The Context-Tree Weighting Method: Extensions”,IEEE Trans. on Inf. Th.. Vol. 44, No. 2, Mar. 1998, pp 792 – 798. REFERENCES

[1] J. Massey. “Causality, feedback and directed information”,Proc. Int. Symp. Inf. Theory Applic. (ISITA-90), pp. 303–305, Nov. 1990. [2] G. Kramer. Directed information for channels with feedback.Ph.D. dissertation, Swiss Federal Institute of Technology (ETH) Zurich, 1998. [3] G. Kramer. Capacity results for the discrete memoryless network. IEEE Trans. Inf. Theory, IT-49:4–21, 2003. [4] Y.-H. Kim. A coding theorem for a class of stationary channels with feedback. IEEE Trans. Inf. Theory., 25:1488–1499, April, 2008. [5] S. Tatikonda and S. Mitter. The capacity of channels with feedback. IEEE Trans. Inf. Theory, 55:323–349, 2009. [6] H. H. Permuter, T. Weissman, and A. J. Goldsmith. Finite state channels with time-invariant deterministic feedback. IEEE Trans. Inf. Theory, 55(2):644–662, 2009.

1437