Universal Estimation of Directed Information Lei Zhao, Haim Permuter, Young-Han Kim, and Tsachy Weissman

ISIT 2010, Austin, Texas, U.S.A., June 13 - 18, 2010 Universal Estimation of Directed Information Lei Zhao, Haim Permuter, Young-Han Kim, and Tsachy Weissman Abstract— In this paper, we develop a universal algorithm to Notation: We use capital letter X to denote a random estimate Massey’s directed information for stationary ergodic variable and use small letter x to denote the corresponding processes. The sequential probability assignment induced by a realization or constant. Calligraphic letter X denotes the universal source code plays the critical role in the estimation. |X | In particular, we use context tree weighting to implement the alphabet of X and denotes the cardinality of the alphabet. algorithm. Some numerical results are provided to illustrate the II. PRELIMINARIES performance of the proposed algorithm. We first give the mathematical definitions of directed information and causally conditional entropy, and then discuss the I. INTRODUCTION relation between universal sequential probability assignment First introduced by Massey in [1], directed information and universal source coding. arises as a natural counterpart of mutual information for channel capacity when feedback is present. In [2] and [3], Kramer A. Directed information extended the use of directed information to discrete memory- Directed information from Xn to Y n is defined as less networks with feedback, including the two-way channel n → n n − n|| n and the multiple access channel. For a class of stationary I (X Y )=H(Y ) H(Y X ), (1) channels with feedback, where the output is a function of the where H(Y n||Xn) is the causally conditional entropy [2], current and past m inputs and channel noise, Kim [4] proved defined as that the feedback capacity is equal to the limit of the supremum n n n i−1 i of the normalized directed information from the input to the H(Y ||X )= H(Yi|Y ,X ). (2) output. Tatikonda and Mitter [5] used directed information to i=1 prove a general feedback channel coding theorem for channels Compared with the definition of mutual information with memory. In [6], Permuter et al. considered the capacity of discrete-time channels with feedback where the feedback is a I(Xn; Y n)=H(Xn) − H(Xn|Y n), time-invariant deterministic function of the output. Under mild the conditional entropy is replaced by the causal conditioning. conditions, they showed that the capacity is the maximum of And unlike mutual information, directed information is not the normalized directed information between the input and symmetric, i.e., I (Y n → Xn) = I (Xn → Y n) in general. output sequence in the limit. Recently, Permuter et al. [7] Other interesting properties of directed information such as showed that directed information plays an important role in the conservation law can be found in [2], [11]. portfolio theory, data compression, and hypothesis testing, For random processes X,Y, which are jointly stationary, we where causality constraints exist. can define directed information rate [2] as follows: Besides information theory, directed information is shown to 1 be a valuable tool in biology, when inference about causality H(Y||X) = lim H(Y n||Xn), (3) is needed. In [8], directed information was used to identify n→∞ n 1 pairwise influence. The authors in [9] used directed informa- I(X → Y) = lim I (Xn → Y n) , (4) tion to test the inference of influence in gene networks. Thus n→∞ n it is of both theoretical and practical interests to develop a way The existence of the limit can be checked as follows: to estimate directed information efficiently. I(X → Y) As we were completing this paper, [10] was brought to 1 our attention, in which the authors used directed information = lim I(Xn → Y n) n→∞ n to infer causal relationships in ensemble neural spike train 1 n n n recordings. At the heart of both our estimation framework and = lim (H(Y ) − H(Y ||X )) n→∞ n theirs is the estimation of causally conditional entropy. The n n 1 i−1 1 i−1 i main difference is that they took a parametric approach [10, = lim H(Yi|Y ) − lim H(Yi|Y ,X ) n→∞ n n→∞ n Assumption 3, Page 10], while our approach is based on non- i=1 i=1 −1 0 −1 parametric universal data compressors, and therefore leading = H(Y0|Y−∞) − H(Y0|X−∞,Y−∞), to stronger universality properties. where the last equality is obtained via the property of Cesáro Lei Zhao and Tsachy Weissman are with the Department of Electrical mean [12] and standard martingale arguments [13]. Note Engineering, Stanford University. Email: {leiz,tsachy}@stanford.edu that the entropy rate H(Y) of the process Y is equal to Haim Permuter is with the Department of Electrical and Computer Engi- | −1 Y||X | 0 −1 neering, Ben-Gurion University of the Negev, Israel. Email: [email protected] H(Y0 Y−∞),andH( )=H(Y0 X−∞,Y−∞). Thus Young-Han Kim is with the Department of Electrical and Computer Engineering, University of California, San Diego. Email: [email protected] I(X → Y)=H(Y) − H(Y||X). (5) 978-1-4244-7892-7/10/$26.00 ©2010 IEEE 1433 ISIT 2010 ISIT 2010, Austin, Texas, U.S.A., June 13 - 18, 2010 − 1 ≥ B. Universal sequential probability assignment and universal By the Kraft inequality, n log kn 0. Given a universal source coding source coding scheme and a stationary source X,by(10), 1 lim sup D(PXn ||QXn ) A sequential probability assignment Q consists of a set n→∞ n −1 −1 ∞ { i−1 · ∀ i ∈Xi } 1 1 of conditional probabilities QXi|x ( ), x i=1. ≤ E n − n n lim sup ln(X ) H(X ) Note that Q induces a probability distribution on X in the n→∞ n n obvious way. =0. (13) Definition 1 A sequential probability assignment Q is univer- Thus we can construct a universal sequential probability sal if assignment from a universal coding scheme. 1 lim sup D(PXn ||QXn )=0 (6) n→∞ n III. ESTIMATION OF I(X → Y) for any stationary probability measure P . As we have seen, I(X → Y)=H(Y) − H(Y||X).Inthis section, we will show an estimate of H(Y||X) based on a A source code for an n-block source sequence, C ,is n universal sequential probability assignments. Similar method defined as a mapping from a source sequence xn to a binary applies to the estimate of H(Y). sequence of finite length, i.e., Let M(X , Y) be the set of all distribution on X×Y. n ∗ Cn : X →{0, 1} . (7) Define f as the function that maps a joint distribution PX,Y of a random vector (X, Y ) to the corresponding conditional More explicitly, entropy H(Y |X), i.e., C n ··· n(x )=b1,b2, ,bln , (8) − | f(PX,Y )= PX,Y (x, y)logPY |X (y x), (14) n x,y where ln = ln(x ) is the code length. Furthermore, Cn is said C n C n ∀ n n ·|· to be non-singular if n(x ) = n(y ), x = y .Itissaid where PY |X ( ) is the conditional distribution induced by to be uniquely decodable if all its extensions are non-singular PX,Y . [12]. The codeword lengths ln(·) of any uniquely decodable code must satisfy the Kraft inequality Lemma 1 For any >0, there exists K > 0 such that for all P and Q in M(X , Y): − ( n) 2 ln x ≤ 1. (9) |f(P ) − f(Q)|≤ + K P − Q1 , xn∈X n See [12] for a proof. where ·1 is the l1 norm (viewing P and Q as |X ||Y|- dimensional simplex vectors), For any uniquely decodable code, we have Proof: Fix >0.SinceM(X , Y) is bounded and closed, 1 n 1 n 1 1 El (X )= H(X )+ D(P n ||Q n )− log k , (10) f(·) is uniformly continuous. Thus there exists δ such that n n n n X X n n |f(P ) − f(Q)|≤,ifP − Q ≤ δ .Furthermore,f(·) is n −l(xn) 1 −l(x ) n 2 n n max |X | |Y| where kn = x 2 ,andQ(x )= kn . Q(x ) bounded by f log +log .Wehave induces a probability measure on X n. With slight abuse of |f(P ) − f(Q)|≤1{ − ≤ } + fmax1{ − } notation, call this measure Q. P Q 1 δ P Q 1>δ P − Q1 ≤ + fmax Definition 2 The sequential probability assignment induced δ by a source code Cn is the set of conditional probabilities fmax n ≤ + P − Q { | i−1 } 1 QXi x i=1,where δ −1 − Q(axi ) = + K P Q 1 , (15) i−1 QXi|x (a)= −1 . (11) Q(xi ) where fmax . K = δ −1 where, axi is a concatenation of symbol a and sequence Lemma 2 [15] Let X be a stationary ergodic process. If i−1 x . limk→∞ gk(X) → g(X), w.p. 1, and {gk(·)} are bounded, then Definition 3 A source coding scheme is a sequence of source n codes. It is said to be universal if each code is uniquely 1 k lim gk T (X) = Eg(X), w.p. 1 and in L1, (16) n→∞ n decodable and k=1 1 n · lim sup E [ln(X )] = H(X). (12) where T ( ) is the shift operator. n→∞ n Now define gk(X, Y)=f(P | −1 −1 ) for a jointly for every stationary source X. X0,Y0 X−k ,Y−k stationary and ergodic process (X, Y). Note that, by mar- X Y → X Y The per-symbol expected number of bits based on the tingale convergence [13], gk( , ) g( , ), w.p.

Load more