Universal Estimation of Directed Information,” in Proc

1 Universal Estimation of Directed Information Jiantao Jiao, Student Member, IEEE, Haim H. Permuter, Member, IEEE, Lei Zhao, Young-Han Kim, Senior Member, IEEE, and Tsachy Weissman, Fellow, IEEE Abstract—Four estimators of the directed information rate be- I. INTRODUCTION tween a pair of jointly stationary ergodic finite-alphabet processes are proposed, based on universal probability assignments. The IRST introduced by Marko [1] and Massey [2], directed first one is a Shannon–McMillan–Breiman type estimator, similar information arises as a natural counterpart of mutual to those used by Verdú(2005) and Cai, Kulkarni, and Verdú F (2006) for estimation of other information measures. We show the information for channel capacity when causal feedback from almost sure and L1 convergence properties of the estimator for the receiver to the sender is present. In [3] and [4], Kramer any underlying universal probability assignment. The other three extended the use of directed information to discrete memory- estimators map universal probability assignments to different less networks with feedback, including the two-way channel functionals, each exhibiting relative merits such as smoothness, and the multiple access channel. Tatikonda and Mitter [5] used nonnegativity, and boundedness. We establish the consistency of directed information spectrum to establish a general feedback these estimators in almost sure and L1 senses, and derive near- optimal rates of convergence in the minimax sense under mild channel coding theorem for channels with memory. For a class conditions. These estimators carry over directly to estimating of stationary channels with feedback, where the output is a other information measures of stationary ergodic finite-alphabet function of the current and past m inputs and channel noise, processes, such as entropy rate and mutual information rate, with Kim [6] proved that the feedback capacity is equal to the limit near-optimal performance and provide alternatives to classical approaches in the existing literature. Guided by these theoretical of the maximum normalized directed information from the results, the proposed estimators are implemented using the input to the output. Permuter, Weissman, and Goldsmith [7] context-tree weighting algorithm as the universal probability as- considered the capacity of discrete-time finite-state channels signment. Experiments on synthetic and real data are presented, with feedback where the feedback is a time-invariant func- demonstrating the potential of the proposed schemes in practice tion of the output. Under mild conditions, they showed that and the utility of directed information estimation in detecting and measuring causal influence and delay. the capacity is again the limit of the maximum normalized directed information. Recently, Permuter, Kim, and Weissman Index Terms—Causal influence, context-tree weighting, di- [8] showed that directed information plays an important role rected information, rate of convergence, universal probability assignment in portfolio theory, data compression, and hypothesis testing under causality constraints. Beyond information theory, directed information is a valu- able tool in biology, for it provides an alternative to the notion of Granger causality [9], which has been perhaps the most widely-used means of identifying causal influence between two random processes. For example, Mathai, Martins, and Manuscript received Month 00, 0000; revised Month 00, 0000; accepted Shapiro [10] used directed information to identify pairwise Month 00, 0000. Date of current version Month 00, 0000. This work was supported in part by the Center for Science of Information (CSoI), an NSF influence in gene networks. Similarly, Rao, Hero, States, and Science and Technology Center, under grant agreement CCF-0939370, the Engel [11] used directed information to test the direction of arXiv:1201.2334v4 [cs.IT] 30 May 2013 US–Israel Binational Science Foundation (BSF) Grant 2008402, NSF Grant influence in gene networks. CCF-0939370, and the Air Force Office of Scientific Research (AFOSR) through Grant FA9550-10-1-0124. Haim H. Permuter was supported in part Since directed information has significance in various fields, by the Marie Curie Reintegration Fellowship. The material in this paper was it is of both theoretical and practical importance to develop presented in part at the 2010 IEEE International Symposium on Information efficient methods of estimating it. The problem of estimating Theory, Austin, TX, and the 2012 IEEE International Symposium on Infor- mation Theory, Cambridge, MA. information measures, such as entropy, relative entropy and Jiantao Jiao is with the Department of Electrical Engineering, Stanford mutual information, has been extensively studied in the liter- University, Stanford, CA 94305, USA (e-mail: [email protected]). ature. Verdú[12] gave an overview of universal estimation Haim Permuter is with the Department of Electrical and Computer Engi- neering, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel (e- of information measures. Wyner and Ziv [13] applied the mail: [email protected]). idea of Lempel–Ziv parsing to estimate entropy rate, which Lei Zhao was with the Department of Electrical Engineering, Stanford converges in probability for all stationary ergodic processes. University, Stanford CA, USA. He is now with Jump Operations, Chicago, IL 60654, USA (e-mail: [email protected]). Ziv and Merhav [14] used Lempel–Ziv parsing to estimate Young-Han Kim is with the Department of Electrical and Computer relative entropy (Kullback–Leibler divergence) and established Engineering, University of California, San Diego, La Jolla, CA 92093, USA consistency under the assumption that the observations are (e-mail: [email protected]). Tsachy Weissman is with the Department of Electrical Engineering, Stan- generated by independent Markov sources. Cai, Kulkarni, and ford University, Stanford CA 94305, USA (e-mail: [email protected]). Verdú[15] proposed two universal relative entropy estimators Communicated by I. Kontoyiannis, Associate Editor for Shannon Theory. for finite-alphabet sources, one based on the Burrows–Wheeler Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. transform (BWT) [16] and the other based on the context-tree Digital Object Identifier 10.1109/TIT.2013.0000000 weighting (CTW) algorithm [17]. The BWT-based estimator 2 n n was applied in universal entropy estimation by Cai, Kulkarni, X and (x1, x2,...,xn) as x . Calligraphic letters , ,... and Verdú[18], while the CTW-based one was applied in denote alphabets of X, Y, . ., and denotes the cardinalityX Y universal erasure entropy estimation by Yu and Verdú[19]. of . Boldface letters X, Y,... denote|X | stochastic processes, For the problem of estimating directed information, Quinn, andX throughout this paper, they are finite-alphabet. Coleman, Kiyavashi, and Hatspoulous [20] developed an es- Given a probability law P , P (xi) = P Xi = xi denotes timator to infer causality in an ensemble of neural spike train { i } i−1 the probability mass function (pmf) of X and P (xi x ) recordings. Assuming a parametric generalized linear model i−1 i−1| denotes the conditional pmf of Xi given X = x , i.e., and stationary ergodic Markov processes, they established { } with slight abuse of notation, xi here is a dummy variable and strong consistency results. Compared to [20], Zhao, Kim, i−1 P (xi x ) is an element of ( ), the probability simplex Permuter, and Weissman [21] focused on universal methods on |, representing the saidM conditionalX pmf. Accordingly, for arbitrary stationary ergodic processes with finite alphabet P (xXXi−1) denotes the conditional pmf P (x xi−1) evalu- i| i| and showed their L1 consistencies. ated for the random sequence Xi−1, which is an ( )- As an improvement and extension of [21], the main con- i−1 M X valued random vector, while P (Xi X ) is the random vari- tribution of this paper is a general framework for estimating | i−1 able denoting the Xi-th component of P (xi X ). Through- information measures of stationary ergodic finite-alphabet pro- out this paper, log( ) is base 2 and ln( ) is| base e. cesses, using “single-letter” information-theoretic functionals. · · Although our methods can be applied in estimating a number of information measures, for concreteness and relevance to emerging applications we focus on estimating the directed information rate between a pair of jointly stationary ergodic A. Directed Information finite-alphabet processes. The first proposed estimator is adapted from the universal Given a pair of random sequences Xn and Y n, the directed relative entropy estimator in [15] using the CTW algorithm, information from Xn to Y n is defined as and we provide a refined analysis yielding strong consistency n results. We further propose three additional estimators in a I(Xn Y n) , I(Xi; Y Y i−1) (1) → i| unified framework, present both weak and strong consistency i=1 X results, and establish near-optimal rates of convergence under = H(Y n) H(Y n Xn), (2) mild conditions. We then employ our estimators on both sim- − k where n n is the causally conditional entropy [3], ulated and real data, showing their effectiveness in measuring H(Y X ) defined as k channel delays and causal influences between different pro- n cesses. In particular, we use these estimators on the daily stock H(Y n Xn) , H(Y Y i−1,Xi). (3) market data from 1990 to 2011 to observe a significant level k i| i=1 of causal influence from the Dow Jones Industrial Average to X the Hang Seng Index, but relatively low causal influence in Compared to mutual information the reverse direction. I(Xn; Y n)= H(Y n) H(Y n Xn), (4) The rest of the paper is organized as follows. Section II − | reviews preliminaries on directed information, universal prob- directed information in (2) has the causally conditional en- ability assignments, and the context-tree weighting algorithm. tropy in place of the conditional entropy. Thus, unlike mu- Section III presents our proposed estimators and their basic tual information, directed information is not symmetric, i.e., properties.

Load more