arXiv:1201.2334v4 [cs.IT] 30 May 2013 niea http://ieeexplore.ieee.org. at online odUiest,Safr A935 S emi:tsachy@st (e-mail: USA 94305, CA Stanford University, ford niern,Uiest fClfri,SnDeo aJolla La Diego, [email protected]). San (e-mail: California, of University Engineering, nvriy tnodC,UA ei o ihJm Operations Jump with [email protected]). now (e-mail: is USA He 60654, IL USA. CA, Stanford University, ern,BnGro nvriyo h ee,Be-hv 84 Beer-Sheva Negev, the [email protected]). of mail: University Ben-Gurion neering, jiantao@stan (e-mail: USA 94305, CA Stanford, University, hoy utn X n h 02IE nentoa Symposi International IEEE MA. o 2012 Cambridge, Symposium the Theory, International mation and IEEE TX, 2010 i Austin, the material ( Theory, at The part Research Fellowship. suppo in Reintegration Scientific was presented Curie of Permuter Marie the H. Office by Haim Force FA9550-10-1-0124. Air Grant 20084 through the CCF-0 Grant (C and (BSF) agreement w Foundation CCF-0939370, grant This of Science under 0000. Science Binational Center, 00, for US–Israel Technology Center Month and the version by Science current part of in Date supported 0000. 00, Month etdifrain aeo ovrec,uieslprobab universal convergence, of assignment rate information, rected n h tlt fdrce nomto siaini detec in delay. estimation and information influence directed causal prac measuring of in and utility schemes proposed the presen the the are and of data potential using real the and demonstrating implemented synthetic probabi on universal are Experiments the signment. as estimators algorithm weighting proposed theo context-tree these by the clas Guided literature. to results, existing alternatives the in provide approaches and rate information performance mutual estima and near-optimal rate finite-al to entropy ergodic mild as directly stationary such under processes, of over sense measures carry information minimax other estimators the in These convergence conditions. of rates optimal otoeue yVr´ 20)adCi ukri n Verd´u and Kulkarni, sho Cai, and We measures. sure and information almost other (2005) Verd´u of estimation by for (2006) used sim those estimator, assignments. type to probability Shannon–McMillan–Breiman a universal is one on first based p finite-alphabet proposed, ergodic stationary are jointly of pair a tween hs siaosi lotsr and sure almost consiste in smoot the estimators as establish these We diffe such boundedness. merits to and relative nonnegativity, assignments exhibiting probability each functionals, universal map othe The estimators assignment. probability universal underlying any iia betIetfir10.1109/TIT.2013.0000000 Identifier Object Digital oo esoso n rmr ftefiue nti ae r av are paper this in figures the of more or one Shan for of Editor versions Associate Color Kontoyiannis, I. Engine by Electrical Communicated of Department the with is Weissman Tsachy on-a i swt h eateto lcrcladCompu and Electrical of Department the with is Kim Young-Han e howswt h eateto lcrclEngineering, Electrical of Department the with was Zhao Lei amPrue swt h eateto lcrcladCompu and Electrical of Department Engineeri the Electrical with is of Permuter Department Haim the with is Jiao Jiantao aucitrcie ot 0 00 eie ot 0 0000; 00, Month revised 0000; 00, Month received Manuscript ne Terms Index Abstract ina Jiao, Jiantao nvra siaino ietdInformation Directed of Estimation Universal Fu siaoso h ietdifrainrt be- rate information directed the of estimators —Four Cua nune otx-rewihig di- weighting, context-tree influence, —Causal L 1 tdn ebr IEEE Member, Student ovrec rpriso h siao for estimator the of properties convergence ebr IEEE Member, L 1 ess n eienear- derive and senses, amH Permuter, H. Haim , A903 USA 92093, CA , ford.edu). hspprwas paper this n anford.edu). 2 S Grant NSF 02, 0,Ire (e- Israel 105, n scyWeissman, Tsachy and , Information n o) nNSF an SoI), mo Infor- on um o Theory. non g Stanford ng, 330 the 939370, rn,Stan- ering, tdi part in rted Chicago, , rocesses e Engi- ter AFOSR) accepted Stanford r was ork iyas- lity phabet three r retical c of ncy hness, with , ailable the w sical rent ting ting ility The ted, tice ilar ter fsainr hneswt edak hr h upti a is output past and the current where the feedback, of with function class channels a For stationary memory. with of feedb channels for general theorem a coding establish channel us to [5] spectrum Mitter channel information and Tatikonda directed two-way channel. Kramer the access multiple [4], including memory the and and feedback, discrete to [3] with information In networks directed present. less of is use sender the from extended the feedback to causal receiver when the capacity channel for information F ne 1]ue ietdifraint ettedrcino direction the test networks. to gene information an in States, directed influence pairwis Hero, used Rao, identify [11] Similarly, and Engel to networks. information Martins, gene in directed Mathai, influence used example, between most [10] For influence the Shapiro processes. perhaps causal random been identifying two has of which no means the [9], to widely-used alternative causality an Granger provides it of for biology, in tool able test hypothesis and constraints. rol compression, causality important under data an theory, plays normalized portfolio information maximum in directed the that Weissm showed of and that [8] Kim, limit showed Permuter, Recently, the they information. again conditions, func- directed is mild time-invariant Under capacity a output. the is the feedback of the chan tion where finite-state [7] discrete-time feedback Goldsmith of with and the capacity Weissman, from Permuter, the information output. considered the directed limi to the normalized to input equal maximum is the capacity feedback of the that proved [6] Kim egtn CW loih 1] h W-ae estimator BWT-based The context-tree the [17]. on algorithm based (CTW) other the weighting and [16] Burrows–Wheel (BWT) the on transform based one estima sources, finite-alphabet entropy for relative are an universal Kulkarni, observations two Cai, proposed the sources. Verd´u [15] Markov that independent by assumption estimate generated the to under parsing consistency Lempel–Ziv estab and divergence) used (Kullback–Leibler proces entropy [14] relative ergodic which Merhav stationary the rate, and all entropy applied Ziv for estimate [13] probability to in Ziv parsing converges and Lempel–Ziv estimation Wyner of universal idea of measures. overview information lit an the of gave in an studied [12] Verd´u extensively entropy ature. been relative has entropy, estimatin information, develo as of mutual to problem such importance The measures, it. practical information estimating and of theoretical methods both efficient of is it ebr IEEE Member, ic ietdifrainhssgicnei aiu field various in significance has information directed Since valu- a is information directed theory, information Beyond nomto rssa aua onepr fmutual of counterpart natural a directed [2], as Massey arises and [1] information Marko by introduced IRST elw IEEE Fellow, e ho on-a Kim, Young-Han Zhao, Lei , .I I. NTRODUCTION m nusadcanlnoise, channel and inputs Senior lished nels tion tors ses. ack ing er- an ed er s, d d d g p e e 1 f - t 2

n n was applied in universal entropy estimation by Cai, Kulkarni, X and (x1, x2,...,xn) as x . Calligraphic letters , ,... and Verd´u[18], while the CTW-based one was applied in denote alphabets of X,Y,..., and denotes the cardinalityX Y universal erasure entropy estimation by Yu and Verd´u[19]. of . Boldface letters X, Y,... denote|X | stochastic processes, For the problem of estimating directed information, Quinn, andX throughout this paper, they are finite-alphabet. Coleman, Kiyavashi, and Hatspoulous [20] developed an es- Given a probability law P , P (xi) = P Xi = xi denotes timator to infer causality in an ensemble of neural spike train { i } i−1 the probability mass function (pmf) of X and P (xi x ) recordings. Assuming a parametric generalized linear model i−1 i−1| denotes the conditional pmf of Xi given X = x , i.e., and stationary ergodic Markov processes, they established { } with slight abuse of notation, xi here is a dummy variable and strong consistency results. Compared to [20], Zhao, Kim, i−1 P (xi x ) is an element of ( ), the probability simplex Permuter, and Weissman [21] focused on universal methods on |, representing the saidM conditionalX pmf. Accordingly, for arbitrary stationary ergodic processes with finite alphabet P (xXXi−1) denotes the conditional pmf P (x xi−1) evalu- i| i| and showed their L1 consistencies. ated for the random sequence Xi−1, which is an ( )- As an improvement and extension of [21], the main con- i−1 M X valued random vector, while P (Xi X ) is the random vari- tribution of this paper is a general framework for estimating | i−1 able denoting the Xi-th component of P (xi X ). Through- information measures of stationary ergodic finite-alphabet pro- out this paper, log( ) is base 2 and ln( ) is| base e. cesses, using “single-letter” information-theoretic functionals. · · Although our methods can be applied in estimating a number of information measures, for concreteness and relevance to emerging applications we focus on estimating the directed information rate between a pair of jointly stationary ergodic A. Directed Information finite-alphabet processes. The first proposed estimator is adapted from the universal Given a pair of random sequences Xn and Y n, the directed relative entropy estimator in [15] using the CTW algorithm, information from Xn to Y n is defined as and we provide a refined analysis yielding strong consistency n results. We further propose three additional estimators in a I(Xn Y n) , I(Xi; Y Y i−1) (1) → i| unified framework, present both weak and strong consistency i=1 X results, and establish near-optimal rates of convergence under = H(Y n) H(Y n Xn), (2) mild conditions. We then employ our estimators on both sim- − k where n n is the causally conditional entropy [3], ulated and real data, showing their effectiveness in measuring H(Y X ) defined as k channel delays and causal influences between different pro- n cesses. In particular, we use these estimators on the daily stock H(Y n Xn) , H(Y Y i−1,Xi). (3) market data from 1990 to 2011 to observe a significant level k i| i=1 of causal influence from the Dow Jones Industrial Average to X the Hang Seng Index, but relatively low causal influence in Compared to the reverse direction. I(Xn; Y n)= H(Y n) H(Y n Xn), (4) The rest of the paper is organized as follows. Section II − | reviews preliminaries on directed information, universal prob- directed information in (2) has the causally conditional en- ability assignments, and the context-tree weighting algorithm. tropy in place of the conditional entropy. Thus, unlike mu- Section III presents our proposed estimators and their basic tual information, directed information is not symmetric, i.e., properties. Section IV is dedicated to performance guarantees I(Y n Xn) = I(Xn Y n), in general. → 6 → of the proposed estimators, including their consistencies and The following notation of causally conditional pmfs will be miminax-optimal rates of convergence. Section V shows ex- used throughout: perimental results in which we apply the proposed estimators n to simulated and real data. Section VI concludes the paper. p(xn yn)= p(x xi−1,yi), (5) k i| The proofs of the main results are given in the Appendices. i=1 Yn II. PRELIMINARIES p(xn yn−1)= p(x xi−1,yi−1). (6) k i| i=1 We begin with mathematical definitions of directed in- Y formation and causally conditional entropy. We also define It can be easily verified that universal and pointwise universal probability assignments. We p(xn,yn)= p(yn xn)p(xn yn−1) (7) then introduce the context-tree weighting (CTW) algorithm k k used in our implementations of the universal estimators that and that we have the conservation laws: are introduced in the next section. I(Xn; Y n)= I(Xn Y n)+ I(Y n−1 Xn), (8) Throughout the paper, we use uppercase letters X,Y,... to → → denote random variables and lowercase letters x, y, . . . to de- I(Xn; Y n)= I(Xn−1 Y n)+ I(Y n−1 Xn) → → note values they assume. By convention, X = means that X n is a degenerate random variable (unspecified constant)∅ regard- + I(X ; Y Xi−1, Y i−1), (9) i i| less of its support. We denote the -tuple as i=1 n (X1,X2,...,Xn) X 3 where Definition 2 (Pointwise universal probability assignment) A probability assignment Q is said to be pointwise universal n−1 n n−1 n (10) I(Y X )= I(( , Y ) X ) for P if → ∅ n→ = H(Xn) H(X Xi−1, Y i−1) (11) 1 1 1 1 − i| lim sup log log 0 P -a.s. i=1 n Q(Xn) − n P (Xn) ≤ X n→∞   denotes the reverse directed information. Other interesting (21) for every P P. A probability assignment Q is said to be properties of directed information can be found in [3], [22], ∈ [23]. pointwise universal if it is pointwise universal for the class of The directed information rate [3] between a pair of jointly stationary ergodic probability measures. stationary finite-alphabet processes X and Y is defined as It is well known that there exist universal and pointwise universal probability assignments. Ornstein [25] constructed a , 1 n n I(X Y) lim I(X Y ). (12) pointwise universal probability assignment, which was gen- → n→∞ n → eralized to Polish spaces by Algoet [26]. Morvai, Yakowitz, The existence of the limit can be checked [3] as and Algoet [27] used universal source codes to induce a 1 I(X Y) = lim I(Xn Y n) (13) probability assignment and established its universality. Since → n→∞ n → the quantity (1/n) log(1/Q(Xn)) is generally unbounded, a 1 = lim (H(Y n) H(Y n Xn)) (14) pointwise universal probability assignment is not necessarily n→∞ n − k universal. However, if we have a pointwise universal probabil- n 1 i−1 ity assignment, it is easy to construct a probability assignment = lim H(Yi Y ) n→∞ n | that is both pointwise universal and universal. Let Q (xn) be i=1 1 n X n a pointwise universal probability assignment and Q2(x ) be 1 i−1 i lim H(Yi Y ,X ) (15) the i.i.d. uniform distribution, then it is easy to verify that − n→∞ n | i=1 ˜ n n n −1 X 0 −1 Q(x )= anQ2(x )+(1 an)Q1(x ) (22) = H(Y0 Y−∞) H(Y0 X−∞, Y−∞), (16) − | − | is both universal and pointwise universal provided that a where the last equality is obtained via the property of Ces´aro n decays subexponentially, for example, an = 1/n. For more mean and standard martingale arguments; see [24, Chs. 4 discussions on universal probability assignments, see, for Y and 16]. Note that the entropy rate H( ) of the process example, [28] and the references therein. Y −1 is equal to H(Y0 Y−∞). In a similar vein, the causally conditional entropy rate| is defined as 1 C. Context-Tree Weighting H(Y X) , lim H(Y n Xn) (17) k n→∞ n k The sequential probability assignment we use in the imple- = H(Y X0 , Y −1 ). (18) mentations of our directed information estimators is the cel- 0| −∞ −∞ ebrated context-tree weighting (CTW) algorithm by Willems, Thus, Shtarkov, and Tjalken [17]. One of the main advantages of I(X Y)= H(Y) H(Y X). (19) → − k the CTW algorithm is that its computational complexity is This identity shows that if we estimate H(Y) and H(Y X) linear in the block length n, and the algorithm provides the separately and if both estimates converge, we have a conver-k probability assignments Q directly; see [17] and [29]. Note gent estimate of the directed information rate. that while the original CTW algorithm was tuned for binary processes, it has been extended for larger alphabets in [30], an B. Universal Probability Assignment extension that we use in this paper. In our experiments with simulated data, we assume that the depth of the context tree is A probability assignment Q consists of a set of conditional larger than the memory of the source. This assumption can be pmfs Q(x xi−1) for every xi−1 i−1 and i = 1, 2,.... i alleviated by the algorithm introduced by Willems [31], which Note that Q| induces a probability∈ measure X on a random pro- we will not implement in this paper. cess X and the pmf Q(xn)= Q(x )Q(x x ) Q(x xn−1) 1 2 1 n An example of a context tree of input sequence x8 with on Xn for each n. | ··· | −2 a binary alphabet is shown in Fig. 1. In general, each node in Definition 1 (Universal probability assignment) Let P be the tree corresponds to a context, which is a string of symbols a class of probability measures. A probability assignment Q is preceding the symbol that follows. For concreteness, assume said to be universal for the class P if the normalized relative the alphabet is 0, 1,...,M 1 . With a slight abuse of entropy satisfies notation, we use{s to represent− both} a node in the context 1 tree and a specific context. At every node s, we use a length- lim D(P (xn) Q(xn))=0 (20) n→∞ n || M array (a0,s,a1,s,...,aM−1,s) to count the numbers of different values emitted with context in sequence n. In for every probability measure P in P. A probability assign- s x Fig. 1, the counts are marked near each node , ment Q is said to be universal (without a qualifier) if it is (a0,s,a1,s) s universal for the class of stationary probability measures. and they are simply numbers of zeros and ones emitted from node s. 4

; 11 (0 0) 11(1; 0) is the universal probability assignment in the CTW algorithm, which will be denoted as Q(xn) in Section III. We compute (1; 0) 00 11(3; 1) the sequential probability assignments as (1; 0)11 00 Q(xn+1) P λ(xn+1) Q(x xn)= = w . (25) (2; 1) n+1| Q(xn) P λ(xn) (1; 1)00 w root In [29, Ch. 5], Willems and Tjalkens introduced a factor (0; 1)11 (4; 4) 11 (1; 1) βs(xn) at every node s to simplify the calculation of the (1; 0)00 sequential probability assignment, which could also help un- derstand how the weighted probabilities are updated when the (0; 1)11 00 00(1; 3) input sequence xn grows to xn+1. For each node s, we define factor βs(xn) as (0; 1)00 (0; 2) P s(xn) βs(xn) , e . (26) Fig. 1. An illustration of the CTW algorithm when and M−1 is n D = 3 i=0 Pw (x ) (x−2,x−1,x0,x1,...,x8) = 00011010010. The count starts at x1. For example, there are 3 zeros and 1 one with context 1, represented by count Assuming js is a contextQ of xn+1, where 0 j M (3, 1) at the node of context 1 in the upper right. 1. Obviously, any other node ks,k = j cannot≤ be a≤ context− s 6 n of xn+1. We express Pw(Xn+1 = q x ), q = 0, 1,...,M 1 in (33) at the bottom of this page,| which shows that the− n Take any sequence z whose alphabet is 0, 1,...,M 1 . sequential probability assignment Q(x xn) is a weighted n { − } n+1 If z contains b0 zeros, b1 ones, b2 twos, and so on, the summation of the Krichevsky–Trofimov sequential| probability n Krichevsky–Trofimov probability estimate of z [32], i.e., assignments, i.e., P s(X = q xn) at all nodes of the context n e n+1 Pe(z ) = Pe(b0,b1,...,bM−1) can be computed sequen- tree. | tially. We let P (0, 0,..., 0) = 1, and for b 0,b e 0 ≥ 1 ≥ By (23), for any node s, 0,...,bM−1 0, 0 i M 1, we have ≥ ≤ ≤ − s n 1/2 1 P (Xn+1 = q x ) . (27) Pe(b0,b1,...,bi−1,bi +1,bi+1,...,bM−1) e | ≥ n + /2 ≥ 2n + |X | |X | bi +1/2 n , Thus, Q(xn+1 x ) = Ω(1/n), or more precisely, b0 + b1 + ... + bM−1 + M/2 | n 1 Pe(b0,b1,...,bi−1,bi,bi+1,...,bM−1). (23) Q(x x ) . (28) × n+1| ≥ 2n + We denote the Krichevsky–Trofimov probability estimate of |X | n s n The probability assignment Q in the CTW algorithm is both the M-array counts at node s of sequence x as Pe (x ). The s n universal and pointwise universal for the class of stationary weighted probability Pw at node s of sequence x in the CTW irreducible aperiodic finite-alphabet Markov processes. For the algorithm is calculated as proof of universality, see [17]. The pointwise universality is 1 P s(xn)+ 1 M−1 P is(xn) 0 l(s)

P s (X = q, xn) s n w n+1 (30) Pw(Xn+1 = q x )= s n | Pw(x ) 1 s n 1 M−1 is n P (Xn+1 = q, x )+ P (Xn+1 = q, x ) = 2 e 2 i=0 w (31) P s (xn) wQ 1 s n s n 1 js n M−1 is n P (x )P (Xn+1 = q x )+ P (Xn+1 = q x ) P (x ) = 2 e e | 2 w | i=0 w (32) P s (xn) w Q βs(xn) 1 = P s(X = q xn)+ P js(X = q xn), (33) 1+ βs(xn) e n+1 | 1+ βs(xn) w n+1 | 5 where P (y x) is the conditional pmf induced by P (x, y). Take where G is | n Q as a universal probability assignment, either on processes n 1 1 with ( )-valued components, or with -valued compo- i i Gn = Q(xi+1,yi+1 X , Y ) log i . X × Y Y n | Q(yi+1 Y ) nents, as will be clear from the context. i=1 (xi+1,yi+1) | X X (44) n Recall the definition of the directed information from X It is also worthwhile to note that Iˆ4 involves an average of n to Y : xi+1 in the relative entropy term for each i, which makes it n analytically different from Iˆ3. I(Xn Y n)= I(Xi; Y Y i−1)= H(Y n) H(Y n Xn), Here is the big picture of the general ideas behind these → i| − k i=1 estimators. The first estimator Iˆ is calculated through the X 1 (34) difference of two terms, each of which takes the form of (39). we define the four estimators as follows: Since the Shannon–McMillan–Breiman theorem guarantees n n n n n Iˆ1(X Y ) , Hˆ1(Y ) Hˆ1(Y X ), (35) the asymptotic equipartition property (AEP) of entropy rate → − k [24] as well as directed information rate [33], it is natural Iˆ (Xn Y n) , Hˆ (Y n) Hˆ (Y n Xn), (36) 2 2 2 to believe that Iˆ would converge to the directed information → n − k 1 1 rate. This is indeed the case, which is proved in Appendix B. Iˆ (Xn Y n) , D Q(y Xi, Y i−1) Q(y Y i−1) , 3 → n i| k i| The Shannon–McMillan–Breiman type estimators have been i=1 X (37) widely applied in the literature of information-theoretic mea- n sure estimation, for example, relative entropy estimation by 1 Iˆ (Xn Y n) , D Q(x ,y Xi, Y i) Cai, Kulkarni, and Verd´u[15], and erasure entropy estimation 4 → n i+1 i+1| i=1 i i i by Yu and Verd´u[19]. X Q(yi+1 Y )Q(xi+1 X , Y ) , k | | Equation (39) can be rewritten in the Ces´aro mean form, (38) i.e., n 1 1 1 where n n (45) log Q(Y X )= log i−1 i , − n k n Q(Yi Y ,X ) ˆ n n , 1 n n i=1 H1(Y X ) log Q(Y X ), (39) X | k −n k and estimators Iˆ through Iˆ are derived by changing every 1 n 2 4 Hˆ (Y n Xn) , f(Q(x ,y Xi, Y i)), (40) term in the Ces´aro mean to other functionals of probability 2 k n i+1 i+1| i=1 assignments Q. For concreteness, estimator Iˆ2 uses conditional Xn entropy as the functional, and estimators Iˆ3 and Iˆ4 use relative ˆ n , 1 i 1 H2(Y ) Q(yi+1 Y ) log i , entropy. n | Q(yi+1 Y ) i=1 yi+1 ˆ X X | One disadvantage of I1 is that it has a nonzero probability (41) of being very large, since it only averages over logarithms of n n n Hˆ1(Y ) , Hˆ1(Y ). (42) estimated conditional probabilities, while the directed infor- k∅ mation rate that it estimates is always bounded by log . |Y| The estimator Iˆ is the universal directed information esti- Recall that Q(y Xi, Y i−1) denotes the conditional pmf 2 i mator introduced in [21]. Thanks to the use of information- Q(y xi,yi−1) evaluated| for the random sequence (Xi, Y i−1), i theoretic functionals to “smooth” the estimate, the absolute and |Q(Y n Xn) denotes the causally conditional pmf value of Iˆ (Xn Y n) is upper bounded by log on any Q(yn xn) evaluated|| for (Xn, Y n). Thus, an entropy estimate 2 realization, a clear→ advantage over Iˆ . |Y| such|| as Hˆ (Y n Xn) is a random variable (since it is a 1 1 The common disadvantage of Iˆ and Iˆ is that they are function of (Xn,k Y n)), as opposed to the entropy terms such 1 2 computed by subtraction of two nonnegative quantities. When as H(Y n Xn), which are deterministic and depend on the there is insufficient data, or the stationarity assumption is distributionk of (Xn, Y n). violated, Iˆ1 and Iˆ2 may generate negative outputs, which is clearly undesirable. In order to overcome this, Iˆ and Iˆ are Note that in (37) and (38) the universal probabil- 3 4 introduced, which take the form of a (random) relative entropy ity assignments conditioned on different data are calcu- and are always nonnegative. Section V-D gives an example lated separately. For example, Q(y Y i−1) is not com- i where Iˆ and Iˆ give negative estimates, which might be puted from Q(x ,y Xi−1, Y i−1), but from| running the uni- 1 2 i i caused by the fact that the underlying process (stock market) versal probability assignment| algorithm again on dataset i−1 i i−1 is not stationary, at least in a short term. Y . In the case of Q(Yi X , Y ), which is inherent in the computation of Q(Y|n Xn), the estimate is com- i−1 k i−1 i i−1 IV. PERFORMANCE GUARANTEES puted from pmf Q(xi,yi X , Y ) via Q(Yi X , Y )= i−1 i−1 | i−1 i−| 1 Q(Xi, Yi X , Y )/ Q(Xi,yi X , Y ). In this section, we establish the consistency of the proposed | yi | estimators, mainly in the almost sure and L senses. Under P 1 We can express Iˆ4 in another form which might be enlight- some mild conditions, we derive near-optimal rates of conver- ening: gence in the minimax sense. The proofs of the stated results Iˆ = G Hˆ (Y n Xn), (43) are given in the Appendices. 4 n − 2 k 6

Theorem 1 Let Q be a universal probability assignment and information rate if we take process Y = X, so the minimax (X, Y) be a pair of jointly stationary ergodic finite-alphabet lower bound also applies in the universal entropy estimation processes. Then situation. Actually in the proof of proposition 3, we indeed

n n reduce the general problem to entropy estimation problem to lim Iˆ1(X Y )= I(X Y) in L1. (46) n→∞ → → show the minimax lower bound. Furthermore, if Q is also a pointwise universal probability Proposition 3 Let P(X, Y) be any class of processes that assignment, then the limit in (46) holds almost surely as well. includes the class of i.i.d. processes. Then, there exists a The proof of Theorem 1 is in Appendix B-A. If (X, Y) positive constant C3 such that is a stationary irreducible aperiodic finite-alphabet Markov −1/2 inf sup E Iˆ I(X Y) C3n , (52) process, we can say more about the performance of Iˆ1 using Iˆ P(X,Y) | − → |≥ the probability assignment in the CTW algorithm. where the infimum is over all estimators Iˆ of the directed Proposition 1 Let Q be the CTW probability assignment information rate based on (Xn, Y n). and let (X, Y) be a jointly stationary irreducible aperiodic The proof of Proposition 3 is in Appendix B-E. Evidently, finite-alphabet Markov process whose order is bounded by −1/2 the prescribed tree depth in the CTW algorithm, and let Y convergence rates better than O(n ) is not attainable even be a stationary irreducible aperiodic finite-alphabet Markov with respect to the class of i.i.d. sources and thus, a fortiori, process with the same order as (X, Y). Then there exists a in our setting of a much larger uncertainty set. For the third and fourth estimators, we establish the follow- constant C1 such that ing consistency results using the CTW algorithm. E Iˆ (Xn Y n) I(X Y) C n−1/2 log n, (47) 1 → − → ≤ 1 Theorem 3 Let Q be the probability assignment in the CTW X Y and ǫ> 0, P -a.s. algorithm. If ( , ) is a jointly stationary irreducible ape- ∀ riodic finite-alphabet Markov process whose order does not ˆ n n X Y −1/2 5/2+ǫ I1(X Y ) I( ) = o(n (log n) ). exceed the prescribed tree depth in the CTW algorithm, and → − → (48) Y is also a stationary irreducible aperiodic finite-alphabet X Y The proof of Proposition 1 is in Appendix B-B. Markov process with the same order as ( , ), then n n We can establish similar consistency results for the second lim Iˆ3(X Y )= I(X Y) P -a.s. and in L1. ˆ n→∞ → → estimator I2 in (36). (53) Theorem 2 Let Q be a universal probability assignment, and Theorem 4 Let Q be the probability assignment in the CTW X Y finite-alphabet process ( , ) be jointly stationary ergodic. algorithm. If (X, Y) is a jointly stationary irreducible ape- Then riodic finite-alphabet Markov process whose order does not n n exceed the prescribed tree depth in the CTW algorithm, and lim Iˆ2(X Y )= I(X Y) in L1. (49) n→∞ → → Y is also a stationary irreducible aperiodic finite-alphabet The proof of Theorem 2 is in Appendix B-C. As was the Markov process with the same order as (X, Y), then ˆ X Y case for I1, if the process ( , ) is a jointly stationary irre- n n lim Iˆ4(X Y )= I(X Y) P -a.s. and in L1. ducible aperiodic finite-alphabet Markov process, we can say n→∞ → → (54) more about the performance of Iˆ2 using the CTW algorithm as follows: The proofs of Theorem 3 and Theorem 4 are in Appen- Proposition 2 Let Q be the probability assignment in the dices B-F and B-G. CTW algorithm. If (X, Y) is a jointly stationary irreducible Remark 1 The properties of the CTW probability assignment aperiodic finite-alphabet Markov process whose order does not we use in the proofs of Theorem 3 and Theorem 4 are not exceed the prescribed tree depth in the CTW algorithm, and only universality and pointwise universality, but also lower Y is also a stationary irreducible aperiodic finite-alphabet boundedness (recall Section II-C). Markov process with the same order as (X, Y), then Remark 2 Note that the assumption that (X, Y) is a jointly n n lim Iˆ2(X Y )= I(X Y) P -a.s. and in L1, stationary irreducible aperiodic finite-alphabet Markov process n→∞ → → (50) does not imply Y also has these properties. Suppose that X Y and there exists a constant C2 such that is a Markov process of order m, is a hidden Markov process whose internal process is X, then it is obvious that E Iˆ (Xn Y n) I(X Y) C n−1/2(log n)3/2. X Y Y 2 → − → ≤ 2 joint process ( , ) is Markov with order m, but is not a (51) Markov process. In applications, it is sensible to assume that Z The proof of Proposition 2 is in Appendix B-D. a process can be approximated by Markov processes better We also investigate the minimax lower bound of estimating and better as the Markov order increases, i.e., there exists constants C′ > 0, 0 ρ< 1, such that directed information rate, and show the rates of convergence ≤ for the first two estimators are optimal within a logarithmic C′ 0 H(Z Z−1) H(Z) ρk. (55) factor. Note that entropy rate is a special case of directed ≤ 0| −k − ≤ ln(2) 7

It deserves mentioning that the exponentially fast convergence Markov processes that are passed through simple channels in (55) can be satisfied under mild conditions. For example, such as discrete memory channels (DMC), or channels with as shown in Birch [34], let G be a Markov process with intersymbol interference. We compare the performances of the strictly positive transition probabilities, and Zn = ψ(Gn), estimators to each other, as well as the ground truth, which we then (55) holds. For more on this “exponential forgetting” are able to analytically compute. We also extend the proposed property, please refer to Gland and Mevel [35] and Hochwald methods to estimation of directed information with delay, and and Jelenkovi´c[36]. to estimation of mutual information. Further, we show how one The properties established for the proposed estimators are can use the directed information estimator to detect delay ofa summarized in Table I. channel, and to detect the “causal influence” of one sequence on another. Finally, we apply our estimators on real stock TABLE I market data to detect the causal influence that exists between PROPERTIES OF THE PROPOSED ESTIMATORS the Chinese and the US stock markets. Support Rates of convergence Iˆ ( , ) O(n−1/2 log n) 1 −∞ ∞ Iˆ [ log , log ] O(n−1/2(log n)3/2) 2 − |Y| |Y| A. Stationary Hidden Markov Processes Iˆ3 [0, ) – ∞ Let X be a binary symmetric first order Markov process ˆ I4 [0, ) – P ∞ with transition probability p, i.e. (Xn = Xn−1 Xn−1) = p. Let Y be the output of a binary symmetric6 channel| with crossover probability ǫ, corresponding to the input process X, V. ALGORITHMS AND NUMERICAL EXAMPLES as depicted in Fig. 2. In this section, we use the context-tree weighting (CTW) algorithm as the universal probability assignment to describe X n Y n = {Xi}i=1 = {Yi}i the corresponding directed information estimation algorithms BSC(ǫ) =1 and perform experiments on simulated as well as real data. The CTW algorithm [17] has a linear computational complexity in Fig. 2. Section V-A setup: X is a binary first order Markov process with the block length n, and it provides the probability assignment transition probability p, and Y is the output of a binary symmetric channel with crossover probability ǫ corresponding to the input X. Q directly. A brief introduction on how the CTW works can be found in Section II-C. For simplicity and concreteness, we explicitly describe the We use the four algorithms presented to estimate the di- ˆ algorithm for computing I2. The algorithms for the other rected information rate I(Y X) for the case where p =0.3 estimators are identical, except for the update rule, which is and ǫ = 0.2. The depth of→ the context tree is set to be 3. given, respectively, by (35) to (38). The simulation was performed three times. The results are shown in Fig. 3. As the data length grows, the estimated value Algorithm 1 Universal estimator Iˆ2 based on the CTW approaches the true value for all four algorithms. algorithm

Fix block length n and context tree depth D. n n n n Iˆ1(Y → X ) Iˆ2(Y → X ) Iˆ2 0 0.4 0.4 for←i 1, n do ← 0.35 0.35 zi = (xi,yi) ⊲ Make a super symbol with alphabet 0.3 0.3 size 0.25 0.25 end for|X ||Y| 0.2 0.2 0.15 0.15 for i D +1, n +1 do 0.1 0.1 ← i−1 Gather the context zi−D for the ith symbol zi. 0.05 0.05 2 3 4 5 2 3 4 5 Update the context tree for every possible value of zi. 10 10 10 10 10 10 10 10 i−1 n n The estimated pmf Q(zi Z ) is obtained along the way. ˆ n n ˆ n n |i−1 I3(Y → X ) I4(Y → X ) Gather the context y for the ith symbol yi. i−D 0.4 0.4 Update the context tree for every possible value of yi. 0.35 0.35 i−1 The estimated pmf Q(yi Y ) is obtained along the way. 0.3 0.3 | i−1 i−1 Update Iˆ2 as Iˆ2 Iˆ2 + f(Q(xi,yi X , Y )) 0.25 0.25 i−1 ← | − 0.2 0.2 f(Q(yi Y )) where f( ) is defined in (29). end for| · 0.15 0.15 ˆ ˆ 0.1 0.1 I2 I2/(n D) 0.05 0.05 2 3 4 5 2 3 4 5 ← − 10 10 10 10 10 10 10 10 n n We now present the performance of the estimators on synthetic and real data. The synthetic data is generated using Fig. 3. Estimation of I(Y → X): The straight line is the analytical value. 8

The true value can be simply computed analytically as for d D′, I(Y n Xn−d) = H(Xn−d). Therefore, we ≥ d+1 → can use the shifted directed information I(Y n Xn−d) to I(Y n Xn)= H(Xn) H(Xn Y n) (56) d+1 → → − || estimate D′. n ˆ n n−d 6 i−1 i−1 i Fig. 5 depicts I2(Yd+1 X ) where n = 10 for = H(Xi X ) H(Xi X , Y ) (57) → | − | the setting in Fig. 4, where the input is a binary stationary i=1 Xn Markov process of order one and the channel is given by ′ = H(X X ) H(X X , Y ) (58) (61). The delay of the channel, D , is equal to 2. We use i| i−1 − i| i−1 i i=1 I2 to estimate the shifted directed information (all algorithms X n pǫ perform similarly for this case) where the tree depth of the = Hb(p) (pǫ +¯p¯ǫ)Hb CTWb algorithm is set to be 6. The result in Fig. 5 shows that − pǫ +¯p¯ǫ i=1   ′ ˆ n n−d X for d

0 B. Channel Delay Estimation via Shifted Directed Information −4 −2 0 2 4 d Assume a setting similar to that in Section V-A, a stationary process that passes through a channel, but now there exists a ˆ n n−d 6 Fig. 5. The value of I2(Yd+1 → X ) where n = 10 for the setting delay in the entrance of the input to the channel, as depicted ′ ˆ n n−d depicted in Fig. 4 with D = 2. When d < 2, I2(Yd+1 → X ) is very ˆ n n−d in Fig. 4. close to zero and for d ≥ 2, I2(Yd+1 → X ) is significantly larger than zero.

··· X0,X1, ··· ′ A channel ··· Y0,Y1, ··· D units delay with memory C. Causal Influence Measurement Fig. 4. Using the shifted directed information estimation to find the delay There is an extensive literature on detecting and measuring D′. causal influence. See, for example, [37] for a recent survey of some of the common tools and approaches in biomedical ′ Our goal is to find the delay D . We use the shifted informatics. One particularly celebrated tool, in both the life directed information I(Y n Xn−d) to estimate D′, where d+1 → sciences and economics, for assessing whether and to what I(Y n Xn−d) is defined as d+1 → extent one time series influences another is the Granger n−d causality test [9]. The idea is to model Y first as a univariate n n−d i−1 i−1 d+i I(Y X ) , H(X X ) H(X X , Y ). autoregressive time series with error correction term Vi d+1 → i| − i| d+1 i=1 p X (60) Y = a Y + V , (62) To illustrate the idea, suppose X is a binary stationary i j i−j i j=1 process, and we define the binary process Y as follows X and then model it again using X as causal side information: ′ ′ Yi = Xi−D + Xi−D −1 + Wi, (61) p ˜ where Wi Bernouli(ǫ) and addition in (61) is modulo 2. Yi = [bj Yi−j + cjXi+1−j ]+ Vi (63) ∼ ′ j=1 The goal is to find the delay D from the observations of X the processes Y and X. Note that the mutual information with V˜i as the new error correction term. The Granger causality rate lim 1 I(Y n; Xn) is not influenced by D′. However, the n→∞ n is defined as 1 n n−d var(Vi) shifted directed information rate lim n−d I(Yd+1 X ) GX Y , log , (64) n→∞ → → ˜ is highly influenced by D′. Assuming that there is no feedback, var(Vi) ′ i+d i−1 for d 0. For instance, in the to verify that when the process pair is jointly Gauss–Markov channel≥ example (61),→ if W = 0 with probability 1 then with evolution that obeys both (62) and (63) with p = , the i ∞ 9

Granger causality coincides with the directed information rate information that the forward link exists if and only if I(Xn (up to a multiplicative constant) [23]. Y n) > 0 and the backward link exists if and only→ if In this section, we implement our universal estimators of I(Y n−1 Xn) > 0. More generally, the directed information directed information to infer causal influences in more general I(Xn →Y n) quantifies how much X influences Y, while the scenarios, where the Gauss–Markov modeling assumption directed→ information in the reverse direction I(Y n−1 Xn) inherent in Granger causality fails to adequately capture the quantifies how much Y influences X. The mutual information,→ nature of the data. which is the sum of those two directed , (see (8)), One philosophical basis for causal analysis is that when quantifies the mutual influence of the two sequences. There- we measure causal influence between two processes, X and fore, using the directed information measures, it is natural to Y, there is an underlying assumption that Xi happens earlier adopt terminology as follows: n n n−1 n than Yi for every (Xi, Yi). Under this assumption, we say two • Case A: I(X Y ) I(Y X ), we say that jointly distributed processes X and Y induce a forward chan- X causes Y. → ≫ → i i−1 i−1 i−1 n n n−1 n nel P (yi x ,y ) and a backward channel P (xi x ,y ), • Case B: , we say that | | I(X Y ) I(Y X ) as depicted in Fig. 6, where X is the input process. In this Y causes X. → ≪ → n n n−1 n section we present the use of directed information, reverse • Case C: I(X Y ) I(Y X ) 0, we say directed information, and mutual information to measure the that the processes→ are mutually≃ causing→ each≫ other. n n causal influence between two processes. • Case D: I(X ; Y )=0, we say that the processes are independent of each other. forward channel To illustrate this idea, consider processes X and Y gener- ated by the system that is depicted in Fig. 7, where the forward Xi P (y xi,yi−1) Yi i| channel is a BSC(α) and the backward channel is a BSC(β) 1 1 where 0 α 2 and 0 β 2 . Intuitively, if α is much ≤ ≤ ≤ X≤ Y backward channel less than β, then the process is influencing , and if α is much larger than β, the process Y is influencing X. If α and β Yi−1 i−1 i−1 Delay have similar values then the processes mutually influence each P (xi x ,y ) 1 | other, and finally if they are both equal to 2 , then the processes are independent of each other. Note that the information- i i−1 Fig. 6. Modeling any two processes using forward channel P (yi|x ,y ) i−1 i−1 theoretic measures can be analytically calculated as in (65)- and backward channel P (xi|x ,y ). (67), and indeed if I(Xn Y n) > I(Y n−1 Xn), then α<β and vice versa. Hence→ the intuition regarding→ which process influences the other is consistent with cases A through Xi BSC(α) Yi D presented above. 1 I(Xn Y n)= H (αβ + αβ) H (α) (65) n → b − b where the terms α and β denote 1 α and 1 β respectively. Yi−1 − − BSC(β) Delay Similarly, we have 1 I(Y n−1 Xn)= H (αβ + αβ) H (β) (66) n → b − b Fig. 7. Simulation of a sequence of random variables {Xi,Yi}i ≥ 1 according to the relation shown in the scheme. Namely, Yi is the output and of a binary symmetric channel with parameter α and input Xi and Xi is the 1 output of a binary symmetric channel with parameter β and input Yi−1. The n n 1 I(Y ; X )=2Hb(αβ + αβ) Hb(β) Hb(α). (67) initial random variable X1 is assumed to be distributed Bernoulli( 2 ). n − − Since the normalized reverse directed information is nothing Definition 3 (Existence of a channel) We say that the for- but the normalized directed information between another pair i i−1 i−1 of processes, where one is shifted, the estimators Iˆ1 to Iˆ4 ward channel does not exist if P (yi x ,y ) = P (yi y ) for i 1 and similarly the backward| channel does not| exist can be easily adapted to this situation, and the convergence ≥ i−1 i−1 i−1 theorems (Theorem 1 through Theorem 4) apply also (with the if P (xi x ,y )= P (xi x ) for i 1. | | ≥ appropriate translations) to the reverse directed information. We interpret the existence of the forward link as that the Finally, the normalized mutual information can be estimated Y X sequence is “influenced” or “caused” by the process . once we have the normalized directed information and the Similarly, the existence of the backward link is interpreted as normalized reverse directed information simply by summing X Y that is “influenced” or “caused” by the sequence . We them. would like to answer the following two questions: Fig. 8 depicts the estimated and analytical information- 1) Does the forward channel exist? 1 n n 1 n−1 n theoretic measures n I(X Y ), n I(Y X ), and 2) Does the backward channel exist? 1 n n → → n I(X ; Y ) for the case α =0.1 and β =0.2. One can note Directed information can naturally answer these questions. that with just a few hundreds of samples, directed information It is straightforward to note from the definition of directed and reverse directed information start strongly indicating that 10

Alg. 1 Alg. 2 0.6 0.6 We denote by Xi and Yi the (quantized ternary valued) 0.5 0.5 I(Xn;Y n) change in the HSI and the DJIA in day i, respectively, and n 0.4 0.4 ← I(Xn→Y n) estimate the normalized mutual information 1 I(Xn; Y n), the 0.3 0.3 n n ← normalized directed information 1 I(Xn Y n), and the 0.2 0.2 n − I(Y n 1→Xn) 1 → n−1 n 0.1 0.1 normalized reverse directed information I(Y X ), n n → 0 0 ← 2 3 4 5 2 3 4 5 using all four algorithms. Fig. 10 plots our estimates of these 10 10 10 10 10 10 10 10 n n information-theoretic measures. Alg. 3 Alg. 4 0.6 0.6 Evidently, the reverse directed information is much higher 0.5 0.5 I(Xn;Y n) than the directed information; hence there is a significant n 0.4 0.4 ← I(Xn→Y n) n causal influence by the DJIA on the HSI, and a low influence 0.3 0.3 ← 0.2 0.2 in the reverse direction. In other words, between 1990 and − I(Y n 1→Xn) 0.1 0.1 2011, it was the Chinese market that was influenced by the n 0 0 ← 2 3 4 5 2 3 4 5 US market rather than the other way around. 10 10 n10 10 10 10 n 10 10 It is also worth noting that estimators Iˆ1 and Iˆ2 do generate 1 n n 1 n−1 negative outputs as shown in Fig. 10. It may be caused by Fig. 8. The information-theoretic measures n I(X → Y ), n I(Y → n 1 n n various reasons, such as data insufficiency and non-stationarity X ), and n I(X ; Y ) evaluated using the four algorithms. The data was generated according to the setting in Fig. 7 where α = 0.1 and β = 0.2. of process (X, Y). In such cases of insufficient data, we would The straight black line is the analytical value given by (65)-(67) and the blue prefer estimators Iˆ and Iˆ , since they are always nonnegative, lines are the estimated values. 3 4 which can be sensibly interpreted in practice.

α<β, in other words, X influences Y more than the other Alg. 1 Alg. 2 n n n n way around. 0.2 I(X ; Y )/n 0.2 I(X ; Y )/n I(Xn → Y n)/n I(Xn → Y n)/n I(Y n−1 → Xn)/n I(Y n−1 → Xn)/n 0.15 0.15 D. Causal Influence in Stock Markets Here we use the historic data of the Hang Seng Index (HSI) 0.1 0.1 and the Dow Jones Index (DJIA) between 1990 and 2011 0.05 0.05 to compute the directed information rate between these two 0 0 indexes. The data of those two indexes are presented in Fig. 2000 2005 2010 2000 2005 2010 9 on a daily time scale. year year Alg. 3 Alg. 4 n n n n 4 0.2 I(X ; Y )/n 0.2 I(X ; Y )/n x 10 I(Xn → Y n)/n I(Xn → Y n)/n I(Y n−1 → Xn)/n I(Y n−1 → Xn)/n 0.15 0.15 3 HSI DJI 0.1 0.1 2.5 0.05 0.05 2 0 0 1.5 2000 2005 2010 2000 2005 2010 year year 1 Fig. 10. Estimates of information-theoretic measures between HSI denoted X Y 0.5 by , and DJI denoted by . It is clear that the reverse directed information is much higher than the directed information, hence it is DJI that causally 0 influences HSI rather than the other way around. 1990 1995 2000 2005 2010 year

Fig. 9. The Hang Seng Index (HSI) and the Dow Jones Industrial Average VI. CONCLUDING REMARKS (DJIA) between 1990 and 2011. The goal is to determine which index is causally influencing the other. We have presented four approaches to estimating the di- rected information rate between a pair of jointly stationary There is no time overlap between the stock market in Hong ergodic finite-alphabet processes. Weak and strong consistency Kong and that in New York, that is, when the stock market in results have been established for all four estimators, in precise Hong Kong is open, the stock market in New York is closed, senses of varying strengths. For two of these estimators and vice versa. Therefore the causal influence between the we established convergence rates that are optimal to within markets is well defined. Since the value of the stock market logarithmic factors. The other two have their own merits, is continuous, we discretize it into three values: 1, 0, and 1. such as nonnegativty on every sample path. Experiments Value 1 means that the stock market went down− in one day on simulated and real data substantiate the potential of the by more− than 0.8%, value 1 means that the stock market went proposed approaches in practice and the efficacy of directed up in one day by more than 0.8%, and value 0 means that the information estimation as a tool for detecting and quantifying absolute change is less than 0.8%. causality and delay. 11

VII. ACKNOWLEDGMENTS where f is defined in (29). The authors would like to thank Todd Coleman for helpful Lemma 5 Let X be a stationary irreducible aperiodic finite- discussions on the merits of nonnegative directed information alphabet Markov process. For fixed i 1, let random variable i ≥ i estimators during Haim Permuter’s visit at UCSD. They would Vi(Xi−m) be a deterministic function of random vector Xi−m, like to thank the associate editor and anonymous reviewers for where m is the Markov order. Let Vi be uniformly bounded their very helpful suggestions that significantly improved the by a constant V for any i, and EV = 0, i 1. Then there i ∀ ≥ presentation of our results. Jiantao Jiao would like to thank exists a constant C4 such that Hyeji Kim for very helpful discussions in the revision stage n 2 1 of the paper. E V C V 2n−1. (74) n i ≤ 4 i=1 ! X APPENDIX A Lemma 6 (Breiman’s generalized ergodic theorem [38]) SOME KEY LEMMAS Let X be a stationary ergodic process. If lim gk(X) = k→∞ Here is the roadmap of the Appendices. In Appendix A g(X) P -a.s. and E[sup gk ] < , then we list some key lemmas without proofs, and in Appendix B k | | ∞ we prove the main theorems and propositions in Section IV. n 1 k X E X Appendix C provides the proofs of the lemmas in Appendix A. lim gk T ( ) = g( ) P -a.s. (75) n→∞ n k=1 The first lemma is on the asymptotic equipartition property X  (AEP) for causally conditional entropy rate. It was proved in where T ( ) is the shift operator which increases the index by [33] that the AEP for causally conditional entropy rate holds 1, and T k· increases the index by k. in the almost sure sense. Here we prove that it holds in the Here we paraphrase a result from [30] on the redundancy L sense as well. We also show convergence rates for jointly 1 bounds of the CTW probability assignment. stationary irreducible aperiodic Markov processes. Lemma 7 ([30]) Let Q be the CTW probability assignment Lemma 1 Let X Y be a jointly stationary ergodic finite- ( , ) and let X be a stationary finite-alphabet Markov process alphabet process. Then, whose order is bounded by the prescribed tree depth of the 1 n n CTW algorithm. Then there exist constants C5, C6 such that lim log P (Y X )= H(Y X) P -a.s. and in L1. n→∞ −n k k the pointwise redundancy is bounded as (68) In addition, if (X, Y) is irreducible aperiodic Markov, then 1 1 max log log C5 log n + C6 (76) xn Q(xn) − P (xn) ≤ 1   E log P (Y n Xn) H(Y X) = O(n−1/2 log n) (69) −n k − k where C5 > 0, C6 depend on nothing but the parameters X specifying the process . In particular, taking expectation over and for every ǫ> 0, the inequality with respect to P , the redundancy is bounded 1 as log P (Y n Xn) H(Y X) − n k − k D(P (xn) Q(xn)) C log n + C . (77) k ≤ 5 6 = o(n−1/2(log n)5/2+ǫ) P -a.s. (70) Remark 3 The constants C5, C6 can be specified once the The next lemma shows that the conditional probability in- parameters of process X are given. For example, see [30], duced by the CTW algorithm converges to the true probability where of a Markov process if the CTW depth is sufficiently large. (γ 1) C5 = − |S| , (78) Lemma 2 Let Q be the CTW probability assignment and let 2 X be a stationary irreducible aperiodic finite-alphabet Markov (γ 1) 1 γ 1 C6 = − |S| log + + log γ . process whose order is bounded by the prescribed tree depth 2 |S| γ 1 − γ 1 |S|  −  − of the CTW algorithm. Then, (79) n n Here γ is the size of alphabet, in this case γ = . is the lim Q(xn+1 X ) P (xn+1 X )=0 P -a.s. (71) n→∞ | − | number of states in the Markov process, given|X Markov | |S| order Lemma 3 ([21, Lemma 1]) For any ǫ> 0, there exists K > m, m. ǫ |S| ≤ |X | 0 such that for all P and Q in ( , ): M X Y APPENDIX B f(P ) f(Q) ǫ + K P Q , (72) | − |≤ ǫk − k1 PROOFS OF THEOREMS AND PROPOSITIONS where 1 is the l1 norm (viewing P and Q as - For brevity, in the sequel we denote Hˆ (Y n Xn) by Hˆ , k·k |X ||Y| 1 1 dimensional simplex vectors), and f is defined in (29). Hˆ (Y n Xn) by Hˆ , Iˆ (Xn Y n) by Iˆ ,i =1k, 2, 3, 4. 2 k 2 i → i Lemma 4 Let P,Q be two probability mass functions in ( , ), denote θ = P Q . If θ< 1/2, we have A. Proof of Theorem 1 M X Y k − k1 Briefly speaking, we need to show estimator Iˆ1 converges f(P ) f(Q) 2θ log |X ||Y| , (73) to the corresponding directed information rate I(X Y) | − |≤ θ → 12

i i−1 for any jointly stationary ergodic process (X, Y). Since Iˆ1 is Q(yi x ,y ) . We bound Cn2 from (112) to (120), where n n n | } defined in (35) as Hˆ1(Y ) Hˆ1(Y X ), if we can show − k n n the corresponding convergence properties of Hˆ1(Y X ), k • (113) follows by the log-sum inequality [24, Theorem then we have the desired convergence properties of Iˆ1 since n n 2.7.1], Hˆ1(Y )= Hˆ1(Y ). k∅ • (114) follows since x> 1, log(1 + x) x/ ln(2), Given Q is a universal probability assignment, first we show ∀ − ≤ ˆ • (115) follows since x x, I1 converges in L1. Then we show given Q is a pointwise | |≥ • (116) follows by Scheff´e’s theorem [42, Lemma 2.1], universal probability assignment, Iˆ1 also converges almost surely. • (117) follows by Pinsker’s inequality [42, Lemma 2.5], • (118) follows by the concavity of √ , 1) L convergence: We decompose · 1 • (119) follows by data-processing inequality [24, Theorem

Hˆ1 H(Y X)= Cn + Dn, (80) 2.8.1], − k • (120) follows by the chain rule for relative entropy, the where 1 concavity of √ , and data-processing inequality. C = Hˆ + log P (Y n Xn) (81) · n 1 n k 1 D = log P (Y n Xn) H(Y X). (82) Combining (89) and (120), we have n −n k − k E 1 n n n n According to Lemma 1 shown in Appendix A, we know D Cn D(P (x ,y ) Q(x ,y )) n | |≤ n k converges to zero in L1. Now we deal with Cn. Pinsker [39] 2 proved the existence of a universal constant Γ > 0 such that + D(P (xn+1,yn+1) Q(xn+1,yn+1))/n, ln(2) k dP s D(P Q) E log( ) D(P Q)+Γ D(P Q), p (90) k ≤ P dQ ≤ k k   p (83) by definition of universal probability assignment, we show Cn

Barron [40] simplified Pinsker’s argument and proved that the converges to zero in L1. Since constant Γ= √2 is best possible when natural logarithms are E Hˆ H(Y X) E C + E D 0 n , (91) used in the definition of D(P Q). Here we follow Barron’s | 1 − k |≤ | n| | n| → → ∞ E k arguments to bound Cn with Cn defined in (81). we know Iˆ converges to I(X Y) in L . | | 1 → 1 Denote the set (xn,yn): P (yn xn) Q(yn xn) as , { k ≤ k } Bn we have 2) Almost sure convergence: Consider the probability of n n the following event E n n 1 P (y x ) Cn = P (x ,y ) log nk n | | n n n n Q(y x ) 1 (x ,y )∈(X ×Y) \Bn n n ˆ n n X k n,ǫ = (x ,y ): H1 log P (y x ) ǫ , (92) 1 Q(yn xn) A { ≤−n k − } + P (xn,yn) log k (84) n P (yn xn) we have (xn,yn)∈B X n k n n n n P( )= P (x ,y ) (93) 1 P (Y X ) An,ǫ = E log k (xn,yn)∈A n Q(Y n Xn) X n,ǫ   n n n n−1 k n n = P (y x )P (x y ) (94) n n 1 Q(y x ) +2 P (x ,y ) log k (85) n n k k n n (x ,y )∈An,ǫ n n n P (y x ) X (x ,y )∈Bn k X Q(yn xn)2−nǫP (xn yn−1) (95) n n ≤ n n k k , E 1 P (Y kX ) , (x ,y )∈An,ǫ Define Cn1 n log Q(Y nkXn) , Cn2 X n n −nǫ n n n n−1 n n 1 Q(y kx ) =2 Q(y x )P (x y ) (96) n n P (x ,y ) logh n n , we boundi (x ,y )∈Bn n P (y kx ) n n k k (x ,yX)∈An,ǫ n i i−1 −nǫ P 1 P (Yi X , Y ) 2 , (97) C = E log | (86) n1 n Q(Y Xi, Y i−1) ≤ i=1 i X  |  where the first inequality is because of the definition of n i i−1 1 P (Yi X , Y ) i−1 i−1 even n,ǫ, and the last step follows from the fact that for = E E log | X , Y A n n i i−1 any two conditional distributions of the form Q(y x ) and n " " Q(Yi X , Y ) ## i=1 | P (xn yn−1), we have Q(yn xn)P (xn yn−1) = Q˜k(xn,yn) X (87) wherekQ˜ is a joint distribution.k As k 1 n P (Y ,X Xi−1, Y i−1) E E log i i| Xi−1, Y i−1 ∞ i−1 i−1 P ≤ n " " Q(Yi,Xi X , Y ) ## ( n,ǫ) < , (98) i=1 | A ∞ X (88) n=1 X 1 by the Borel-Cantelli Lemma, we have = D(P (xn,yn) Q(xn,yn)). (89) n k ˆ 1 n n , i i−1 i i−1 lim inf H1 log P (Y X ) 0. P -a.s. (99) Then, i, define i i(x ,y ) = yi : P (yi x ,y ) n→∞ − −n k ≥ ∀ C C { | ≤   13

In order to get an inequality with the reverse direction, write which implies the convergence of Iˆ1 to I(X Y) also holds Hˆ ( 1 log P (Y n Xn)) explicitly as almost surely. → 1 − − n k 1 Hˆ + log P (Y n Xn) 1 n k 1 P (Y n Xn) B. Proof of Proposition 1 = log k (100) n Q(Y n Xn) For similar reasons as shown in the proof of Theorem 1, nk n n n−1 1 P (Y ,X ) 1 P (X Y ) here it suffices to show the convergence properties of Hˆ1. For = log log k , (101) n Q(Y n,Xn) − n Q(Xn Y n−1) convenience, we restate some arguments shown in the proof k of Theorem 1. We decompose Hˆ H(Y X) as by the definition of pointwise universality (2), we know 1 − k ˆ Y X 1 P (Y n,Xn) H1 H( )= Cn + Dn, (107) -a.s. (102) − k lim sup log n n 0, P n→∞ n Q(Y ,X ) ≤ where 1 n n with a similar argument used for showing (99), we show Cn = Hˆ1 + log P (Y X ) (108) n k 1 P (Xn Y n−1) -a.s. (103) 1 n n Y X lim sup log nk n−1 0, P Dn = log P (Y X ) H( ), (109) n→∞ −n Q(X Y ) ≤ −n k − k k then we have and we restate (90)

1 n n 1 n n n n lim sup Hˆ log P (Y X ) 0. P -a.s. (104) E Cn D(P (x ,y ) Q(x ,y )) 1 − −n k ≤ | |≤ n k n→∞   2 Combining (104) with (99), + D(P (xn+1,yn+1) Q(xn+1,yn+1))/n. sln(2) k 1 n n lim Hˆ1 log P (Y X ) =0. P -a.s. (105) p (110) n→∞ − −n k   By Lemma 1 shown in Appendix A, 1) L1 convergence rates: We apply Lemma 7 in Appendix A. Plugging (77) of Lemma 7 in (110), we have 1 n n Y X lim log P (Y X ) H( )=0, P -a.s. (106) E 1/2 −1/2 n→∞ −n k − k Cn = O((log n) n ). (111) | | Combining (111) with the L1 convergence rates of Dn

n 1 Q(y xi,yi−1) C P (xi,yi−1) P (y xi,yi−1) log i| (112) n2 ≤ n i| P (y xi,yi−1) i=1 i i−1 i X (x X,y ) yXi∈Ci | 1 n Q(Y xi,yi−1) P (xi,yi−1)P (Y xi,yi−1) log i ∈Ci| (113) ≤ n i ∈Ci| P (Y xi,yi−1) i=1 i i−1 i i X (x X,y ) ∈C | n 1 1 P (xi,yi−1) (Q(Y xi,yi−1) P (Y xi,yi−1)) (114) ≤ n ln(2) i ∈Ci| − i ∈Ci| i=1 i i−1 X (x X,y ) n 1 1 P (xi,yi−1) Q(Y xi,yi−1) P (Y xi,yi−1) (115) ≤ n ln(2)| i ∈Ci| − i ∈Ci| | i=1 i i−1 X (x X,y ) n 1 i i−1 1 i i−1 i i−1 P (x ,y ) P (yi x ,y ) Q(yi x ,y ) (116) ≤ 2n − ln(2) | | − | | i=1 i i 1 yi X (x X,y ) X 1 n 2 P (xi,yi−1) D(P (y xi,yi−1) Q(y xi,yi−1) (117) ≤ 2n ln(2) i| k i| i=1 i i−1 s X (x X,y ) n 1 2 ED(P (y Xi, Y i−1) Q(y Xi, Y i−1) (118) ≤ 2n ln(2) i| k i| i=1 s X p 1 n 2 ED(P (y , x Xi, Y i−1) Q(y , x Xi, Y i−1) (119) ≤ 2n ln(2) i i+1| k i i+1| i=1 s X p 1 D(P (xn+1,yn+1) Q(xn+1,yn+1)/n, (120) ≤ s2 ln(2) k p 14

shown in Lemma 1 in Appendix A, we have Putting (134) and the almost sure convergence rates of Dn shown in Lemma 1 in Appendix A together, we know ǫ> 0, E Hˆ1 H(Y X) E Cn + E Dn (121) ∀ | − k |≤ | | | | n n −1/2 5/2+ǫ −1/2 Iˆ (X Y ) I(X Y)= o(n (log n) ). P -a.s. = O(n log n), (122) 1 → − → then we know the convergence rates in Proposition 1 hold as C. Proof of Theorem 2 follows It suffices to show the convergence properties of Hˆ2. We E Iˆ (Xn Y n) I(X Y) = O(n−1/2 log n). (123) decompose 1 → − → Hˆ (Y n Xn) H(Y X)= A + B , (135) 2) Almost sure convergence rates: We look at the almost 2 n n k − k sure convergence rates of Cn (108) at first. We know the where probability of event defined in (92) is bounded as An,ǫ n 1 k k Y X P −nǫ An = f(P (xk+1,yk+1 X , Y )) H( ) (136) ( n,ǫ) 2 . (124) n | − k A ≤ k=1 X For any fixed δ′ >δ> 0, taking ǫ = n−1+δ in (92), we see n ˆ n n 1 k k is equal to the set Bn = H2(Y X ) f(P (xk+1,yk+1 X , Y )). n,ǫ k − n | A k=1 ′ 1 ′ X (137) (xn,yn): n1−δ Hˆ + log P (yn xn) nδ−δ . { 1 n k ≤− } X Y , 0 0   (125) Define gk( , ) f(P (x1,y1 X , Y )) for a jointly | −k −k Note that stationary and ergodic process (X, Y). Note that, by martin- ∞ ∞ X Y X Y P −nδ gale convergence [41], gk( , ) g( , ), P -a.s. where ( n,ǫ) 2 < . (126) X Y 0 0→ A ≤ ∞ g( , ) = f(P (x1,y1 X−∞, Y−∞)). Noting further that n=1 n=1 E X Y Y X | X X ′ g( , )= H( ) and k,gk are bounded, we can apply By the Borel-Cantelli lemma, since nδ−δ goes to zero as n Lemma 6 in Appendix|| A and∀ get the following result: , we proved that → ∞ lim An =0 P -a.s. and in L1. (138) ′ n→∞ 1−δ 1 n n lim inf n Hˆ1 + log P (y x ) 0 P -a.s. (127) n→∞ n k ≥ Then we deal with Bn defined in (137) from (155) to (162),   where fixing an arbitrary ǫ> 0, In order to get an inequality of the reverse direction, ′ • (157) follows by Lemma 3 in Appendix A, dividing (101) by n−1+δ , we have • (158) follow by Pinsker’s inequality, ′ 1−δ ˆ 1 n n • (159) and (161) follow by the concavity of √ , n H1 + log P (Y X ) (128) · n k • (162) follows by the chain rule for relative entropy.  n n  ′ 1 P (Y ,X ) We continue to bound = n1−δ log n Q(Y n,Xn) E ˆ n n Y X (139)  n n−1 lim H2(Y X ) H( ) ′ 1 P (X Y ) n→∞ k − k n1−δ log k . (129) E E − n Q(Xn Y n−1) lim An + lim Bn (140)  k  ≤ n→∞ | | n→∞ | | By the pointwise redundancy of the CTW algorithm restated = lim E Bn (141) n→∞ in Lemma 7 in Appendix A, we know | | 2 ln(2) n+1 n+1 n+1 n+1 1 P (Y n,Xn) ǫ + lim Kǫ D (P (x ,y ) Q(x ,y )) -a.s. (130) ≤ n→∞ n || lim sup log n n 1 P r n→∞ log n Q(Y ,X ) ≤ (142) then we have = ǫ (143) n n ′ 1 P (Y ,X ) lim sup n1−δ log 0 P -a.s. (131) where (141) follows by (138), (142) follows by (162), (143) n Q(Y n,Xn) ≤ n→∞   follows by Definition 1. Now we can use the arbitrariness of For the second term on the right hand side of (129), following ǫ to complete the proof. similar argument applied to show (127), we know n n−1 D. Proof of Proposition 2 ′ 1 P (X Y ) 1−δ -a.s. (132) ˆ lim sup n log nk n−1 0 P It suffices to show the convergence properties of H2. n→∞ − n Q(X Y ) ≤  k  1) Almost sure convergence: For stationary ergodic process From (131) and (132), we obtain (X, Y), let

′ 1−δ 1 n n X Y −1 lim sup n Hˆ + log P (Y X ) 0 P -a.s. gk( , )= f(Q(x0,y0 X−k )) (144) 1 n k ≤ | n→∞   g(X, Y)= f(P (x ,y X−1 , Y −1 )), (145) (133) 0 0| −∞ −∞ Combining (127) and (133) together, we know δ′ > 0, ∀ by Lemma 2 in Appendix A, ′ 1 n n −1+δ X Y X Y lim Hˆ1 + log P (Y X )= o(n ) P -a.s. (134) lim gk( , ) g( , )=0 P -a.s. (146) n→∞ n k k→∞ − 15

Since E[sup gk ] log , by Lemma 6 in Appendix A, in Lemma 7 in Appendix A, i.e., (77) into (183), then have k | | ≤ |Y| −1/2 3/2 n E Bn = O(n (log n) ). (151) 1 k | | lim gk(T (X, Y)) = lim Hˆ2 = H(Y X), (147) n→∞ n n→∞ k Combining (151) with (150), we proved Proposition 2. Xk=1 which justifies the almost sure convergence of Hˆ2. E. Proof of Proposition 3 2) L1 convergence rates: For convenience, we restate the We rephrase a general lemma showing minimax lower definitions of An and Bn as follows n bounds: 1 k k Y X An = f(P (xk+1,yk+1 X , Y )) H( ) (148) Lemma 8 ([42, Theorem 2.2, Page 90]) Let be a class n | − k F k=1 of models, and suppose we have observations Z distributed X n 1 according to P ,f . Let d(f,fˆ ) be the performance B = Hˆ (Y n Xn) f(P (x ,y Xk, Y k)). f ∈ F n 2 k − n k+1 k+1| measure of the estimator fˆ(Z) relative to the true model f. k=1 X Assume also d( , ) is a semi-distance, i.e., it satisfies (149) · · 1) d(f,g)= d(g,f) 0, Letting V be f(P (x ,y Xk, Y k)) H(Y X), V be ≥ k k+1 k+1| − k 2) d(f,f)=0, log , and applying Lemma 5 in Appendix A, we know 3) d(f,g) d(h,f)+ d(h,g). |Y| ≤ 2 −1/2 E An EA = O(n ). (150) Let f ,f satisfy d(f ,f ) 2s > 0, where s is fixed. | |≤ n 0 1 0 1 Then ∈ F ≥ Then we bound E Bn fromp (178) to (183). | | ˆ ˆ inf sup Pf (d(f,f) s) inf max Pfj (d(f,fj ) s) • (180) is an application of Lemma 2 and Lemma 4 fˆ f∈F ≥ ≥ fˆ j∈{0,1} ≥ in Appendix A. Indeed, Lemma 2 guarantees that (152) when n , the ℓ1 norm of the difference of 1 → ∞k k k k exp( D(Pf1 Pf0 )). (153) P (xk+1,yk+1 X , Y ) and Q(xk+1,yk+1 X , Y ) will ≥ 4 − k | | be small enough so that Lemma 4 can be applied. In this proof, in Lemma 8 is taken to be P(X, Y). De- • (181) follows by Pinsker’s inequality and the fact that F note the binary entropy as Hb(p)= p log p (1 p) log(1 p) function t log(1/t) is increasing for small t. − − − − and the class of i.i.d. processes as 0. Since • (182) and (183) are by the concavity of √t log(1/√t) M 1 p and the chain rule for relative entropy. H′(p) = log − , (154) b p √ √ Because of the monotonicity of t log(1/ t) when t 0, and H′(p) is decreasing in interval [2/8, 3/8], we know we can plug in the redundancy bounds of the CTW algorithm≈ b

n 1 k k k k E Bn = E f(Q(xk+1,yk+1 X , Y )) f(P (xk+1,yk+1 X , Y )) (155) | | n | − | k=1 n X  1 k k k k E f(Q(xk+1,yk+1 X , Y )) f(P (xk+1,yk+1 X , Y )) (156) ≤ n | − | k=1 nX 1 E ǫ + K Q(x ,y Xk, Y k) P (x ,y Xk, Y k) (157) ≤ n ǫk k+1 k+1| − k+1 k+1| k1 k=1 Xn  K ǫ E 2 ln(2)D (P (x ,y Xk, Y k) Q(x ,y Xk, Y k)) + ǫ (158) ≤ n k+1 k+1| || k+1 k+1| k=1 q  Xn K ǫ 2 ln(2)E [D (P (x ,y Xk, Y k) Q(x ,y Xk, Y k))] + ǫ (159) ≤ n k+1 k+1| || k+1 k+1| kX=1 q K n = ǫ + ǫ 2 ln(2)ED (P (x ,y Xk, Y k) Q(x ,y Xk, Y k)) (160) n k+1 k+1| || k+1 k+1| kX=1 q 2 ln(2) n ǫ + K ED (P (x ,y Xk, Y k) Q(x ,y Xk, Y k)) (161) ≤ ǫ n × v k+1 k+1| || k+1 k+1| r uk=1 uX t 2 ln(2) n+1 n+1 n+1 n+1 = ǫ + Kǫ D (P (x ,y ) Q(x ,y )) (162) r n || 16

Lemma 9 p, q [2/8, 3/8], we have Y) based on (Xn, Y n), d(x, y)= x y , we have ∀ ∈ 1 1 | − | Hb(p) Hb(q) log(5/3) p q . (163) d(Hb(q0),Hb(q1)) log(5/3) q0 q1 = log(5/3)/√n. | − |≥ | − | ≥ | − | (169) We also show a lemma bounding the divergence between Then we take s = log(5/3)/(2√n) to satisfy the assumption two Bernoulli pmfs. of Lemma 8. For brevity, here we denote I(X Y) as I. By → Lemma 10 Let P and Q be Bernoulli pmfs with parame- Lemma 8, ters, respectively, 1/2-p and 1/2-q. If p , q 1/4, then ˆ ˆ 2 | | | | ≤ inf sup Pf (d(In, I) s) inf max Pfj (d(In,Hb(qj )) s) D(P Q) 8(p q) . ˆ ≥ ≥ ˆ j∈{0,1} ≥ k ≤ − In M0 In Lemma 10 can be verified as follows: (170) 1 1/2 p 1/2+ p exp( D(P P )). (171) D(P Q)=(1/2 p) log − + (1/2+ p) log ≥ 4 − f1 k f0 k − 1/2 q 1/2+ q − Then we bound D(P P ): (164) f1 k f0 q p P (X , Y ) = (1/2 p) log 1+ − D(P P )= nE log f1 1 1 (172) − 1/2 q f1 k f0 f1 P (X , Y )  −   f0 1 1  p q 8n(q q )2 (173) + (1/2+ p) log 1+ − (165) ≤ 0 − 1 1/2+ q   =8. (174) 1 q p p q (1/2 p) − + (1/2+ p) − Thus we have ≤ ln(2) − 1/2 q 1/2+ q   − (166) 1 −8 inf sup Pf (d(Iˆn, I) s) e . (175) 2 ˆ 4 1 (p q) In M0 ≥ ≥ = − (167) ln(2) 1/4 q2 Using Markov’s inequality, − 8(p q)2, (168) E ˆ E ˆ ≤ − inf sup In I inf sup In I (176) Iˆn P(X,Y) | − |≥ Iˆn M0 | − | where the first inequality holds because log(1 + x) ≤ 1 1 1 x/ ln(2), x > 1, and the second inequality holds because e−8s = e−8 log(5/3) . (177) q 1/4∀. − ≥ 4 8 √n | |≤ i.i.d. Taking the observations model as Xi Bern(q), Yi = X , then we have I(X Y) = H(X)∼. Assuming under i → model f0, q = q0 = 1/4, under model f1, q = q1 = 1/4+ 1/√n, and n 64. Let Iˆ be an arbitrary estimator of I(X ≥ n →

n 1 k k k k E Bn = E f(Q(xk+1,yk+1 X , Y )) f(P (xk+1,yk+1 X , Y )) (178) | | n | − | k=1 nX  1 k k k k E f(Q(xk+1,yk+1 X , Y )) f(P (xk+1,yk+1 X , Y )) (179) ≤ n | − | k=1 Xn 1 E 2 P (x ,y Xk, Y k) Q(x ,y Xk, Y k) ≤ n k k+1 k+1| − k+1 k+1| k1 kX=1 log |X ||Y| (180) × P (x ,y Xk, Y k) Q(x ,y Xk, Y k) k k+1 k+1| − k+1 k+1| k1 1 n E 2 2 ln(2)D(P (x ,y Xk, Y k) Q(x ,y Xk, Y k)) ≤ n k+1 k+1| k k+1 k+1| kX=1 q log |X ||Y| (181) × 2 ln(2)D(P (x ,y Xk, Y k) Q(x ,y Xk, Y k)) k+1 k+1| k k+1 k+1| 1 n 2 p2 ln(2)ED(P (x ,y Xk, Y k) Q(x ,y Xk, Y k)) ≤ n k+1 k+1| k k+1 k+1| Xk=1 q log |X ||Y| (182) × 2 ln(2)ED(P (x ,y Xk, Y k) Q(x ,y Xk, Y k)) k+1 k+1| k k+1 k+1| 2 2 ln(2)pD(P (xn+1,yn+1) Q(xn+1,yn+1))/n log |X ||Y| (183) ≤ k 2 ln(2)D(P (xn+1,yn+1) Q(xn+1,yn+1))/n p k p 17

F. Proof of Theorem 3 probability assignment is lower bounded (27); We decompose • (210) follows by Pinsker’s inequality, • n (211) follows by data-processing inequality, ˆ 1 i i−1 1 • (212) follows by the chain rule of relative entropy and I3 = Q(yi X , Y ) log i−1 n | Q(yi Y ) concavity of √ . i=1 yi | · XnX After applying Lemma 7 in Appendix A, we know R 1 1 n i i−1 converges to zero in L . By Birkhoff’s ergodic theorem, we Q(yi X , Y ) log i i−1 . 1 − n | Q(yi X , Y ) i=1 yi know the convergence of S is also in L , which completes X X | n 1 (184) the proof of L1 convergence. ˆ Following the proof of almost sure and L1 convergence of H2 G. Proof of Theorem 4 in that of Proposition 2, we can show that the second term We decompose ˆ on the right hand side of (184) converges to H(Y X) almost I4 k surely and in L1 under the conditions of Theorem 3. Denote Iˆ4 = Gn Hˆ2, (191) the first term on the right hand size of (184) as − where ˆ is the estimator for Y X in ˆ , is defined n H2 H( ) I2 Gn 1 1 k i i−1 (185) as Fn = Q(yi X , Y ) log i−1 . n | Q(yi Y ) n i=1 yi 1 1 X X | G = Q(x ,y Xi, Y i) log . n n i+1 i+1| Q(y Y i) Then it suffices to show the almost sure and L1 convergence i=1 i+1 X (xi+1X,yi+1) | of Fn to H(Y). Decompose Fn H(Y) as (192) − Since G is in similar form as F , we can follow cor- F H(Y)= R + S , n n n − n n responding steps in the proof of Theorem 3 to establish where Theorem 4 analogously. n 1 i i−1 i−1 APPENDIX C Rn = P (yi X , Y ) log P (yi Y ) n | | ROOFS OF ECHNICAL EMMAS i=1 yi P T L X X 1 n A. Proof of Lemma 1 Q(y Xi, Y i−1) log Q(y Y i−1) (186) − n i| i| 1) General stationary ergodic processes: The convergence i=1 yi Xn X holds almost surely by the Shannon–McMillan–Breiman the- 1 i i−1 i−1 orem for causally conditional entropy rate (see, for example, Sn = P (yi X , Y ) log P (yi Y ) H(Y). −n | | − [33]). We now prove the AEP also holds in L . i=1 yi 1 X X (187) Denote

1 n 1 n n 1) Almost sure convergence: Express R as Z , An = log P (Y X ) (193) n n i=1 i −n k where P 1 n n 0 0 Bn = log P (Y X ,X−∞, Y−∞) (194) Z = Q(y Xi, Y i−1) log Q(y Y i−1) −n k i − i| i| y Cn = Bn An, (195) Xi − i i−1 i−1 n i−1 + P (yi X , Y ) log P (yi Y ). (188) where P (Y n Xn,X0 , Y 0 ) = P (Y Xi , Y ). | | −∞ −∞ i=1 i −∞ −∞ y k | i Our goal is to show that E An H(Y X) converges to zero X | − Qk | According to Lemma 2 in Appendix A, the CTW probability when n . i i−1 i−1 → ∞ assignments, Q(yi X , Y ) and Q(yi Y ) both converge Note that | | i i−1 n almost surely to the true probability P (yi X , Y ) and 1 i−1 | EA = H(Y Y i−1,Xi), (196) P (yi Y ). Therefore, n n i| | i=1 X lim Zi =0 P -a.s. (189) EB = H(Y X). (197) i→∞ n k Then we know the Ces´aro mean of Z n also converges to By stationarity of (X, Y) and conditioning reduces entropy, { i}i=1 zero almost surely, i.e., we know H(Y Y i−1,Xi) is a nonnegative, nonincreasing i| n sequence in i, and further, it converges to H(Y X). Since 1 E i−k1 i n lim Rn = lim Zi =0 P -a.s. (190) An is the Ces´aro mean of sequence H(Yi Y ,X ) i=1, n→∞ n→∞ n { | } i=1 it follows that EA converges to H(Y X) as n . Thus, X n k → ∞ Now we show S converges to zero almost surely, which is n lim ECn =0. (198) implied by Birkhoff’s ergodic theorem. n→∞ 2) L1 convergence: We express Rn in another form in We have (205), and bound E Rn from (206) to (212), where | | E An H(Y X) = E An EBn (199) • The first part of (209) is derived by (83), and the second | − k | | − | E C + E B EB . (200) part of (209) is implied by the fact that the CTW ≤ | n| | n − n| 18

E E 0 0 , By Birkhoff’s ergodic theorem, Bn Bn converges to zero fact that 1 Fn(x) = P (Cn > x). Let B(X−∞, Y−∞) | − | n n − n n 0 0 when n . It now suffices to show that lim E Cn = 0. (x ,y ): P (x ,y X , Y ) > 0 , we have (235), then → ∞ n→∞ | | { | −∞ −∞ } Denote the CDF of random variable Cn as Fn(x), then we by Markov’s inequality, we have have P (Y n Xn) 1 (203) ∞ P n n k 0 0 tn , E E P (Y X ,X−∞, Y−∞) ≥ ≤ tn Cn = Cn +2 xdFn(x), (201)  k  | | − 0 Z ∞ for arbitrary positive tn. E = Cn +2 P (Cn > x)dx, (202) Taking nx , we have − tn =2 , x 0 Z0 ≥ 1 P (Y n Xn) where the second step follows by integration by parts and the P log k x 2−nx. (204) n P (Y n Xn,X0 , Y 0 ) ≥ ≤  k −∞ −∞ 

n i−1 1 i i−1 P (yi Y ) Rn = P (yi X , Y ) log | i−1 n | Q(yi Y ) i=1 yi X X | 1 n + P (y Xi, Y i−1) Q(y Xi, Y i−1) log Q(y Y i−1), (205) n i| − i| i| i=1 yi X X  n i−1 E 1 E i i−1 P (yi Y ) Rn P (yi X , Y ) log | i−1 | |≤ n | Q(yi Y ) i=1 yi X X | n 1 + E P (y Xi, Y i−1) Q(y Xi, Y i−1) log Q(y Y i−1) (206) n i| − i| i| i=1 yi X X n  1 P (y Y i−1) E i i−1 i P (yi X , Y ) log | i−1 ≤ n | Q(yi Y ) i=1 " yi # X X | 1 n + E P (y Xi, Y i− 1) Q(y Xi, Y i−1) log Q(y Y i−1) (207) n i| − i| i| i=1 yi X X n  1 P (y Y i−1) E i−1 i P (yi Y ) log | i−1 ≤ n | Q(yi Y ) i=1 " yi | # Xn X 1 1 E i i−1 i i−1 (208) + log i−1 P (yi X , Y ) Q(yi X , Y ) n Q(yi Y ) | − | i=1 " yi # X X | n 1 2 ED(P (y Y i−1) Q(y Y i−1)) + ED(P (y Y i−1) Q(y Y i−1)) ≤ n i| k i| ln(2) i| k i| i=1 s ! Xn p 1 + E log(2i + ) P (y Xi, Y i−1) Q(y Xi, Y i−1) (209) n |Y| | i| − i| | i=1 " yi # X X n 1 2 ED(P (y Y i−1) Q(y Y i−1)) + ED(P (y Y i−1) Q(y Y i−1)) ≤ n i| k i| ln(2) i| k i| i=1 s ! Xn p 1 + log(2i + ) 2 ln(2)ED(P (y Xi, Y i−1) Q(y Xi, Y i−1)) (210) n |Y| i| k i| i=1 nX p 1 2 ED(P (y Y i−1) Q(y Y i−1)) + ED(P (y Y i−1) Q(y Y i−1)) ≤ n i| k i| ln(2) i| k i| i=1 s ! Xn p 1 + log(2i + ) 2 ln(2)ED(P (x ,y Xi, Y i−1) Q(x ,y Xi, Y i−1)) (211) n |Y| i i| k i i| i=1 X p 1 2 D(P (yn) Q(yn)) D(P (yn) Q(yn)) + k ≤ n k sln(2) n 2 ln(2)D(P (xn,yn) Q(xn,yn)) + log(2n + ) k , (212) |Y| r n 19

Equivalently, and bound the three terms on the right hand side of (221) P (C > x) 2−nx. (213) separately. n ≤ Plugging (213) into (202), we have ∞ E C = EC +2 P (C > x)dx (214) | n| − n n Z0 2 EC + . (215) ≤− n n ln(2) By (198), we know For the first term, by Lemma 5 in Appendix A with X (X, Y), V gL HL, and V L, we have ← lim E Cn =0. (216) i i n→∞ | | ← − ← n 2 By (200), we know the AEP for causally conditional entropy E gL HL = O(nL2). (222) holds in L . i − 1 i=1 ! X 2) Irreducible aperiodic Markov processes: We express For the second term, consider 1 log P (Y n Xn) H(Y X) as n n 2 − k − k ′ ′ ′ ′ n E L L 2 E L L 2 gi H n max (gi H ) . (223) 1 n n 1 i log P (Y X ) H(Y X)= Zi, (217) − ! ≤ − − n k − k n i=1 i=1 X X Define where i i i i−1 Ei,K = (xi−m,yi−m): K log P (yi xi−m,yi−m) K+1 , Z = log P (Y Xi , Y i−1 ) H(Y X), (218) { ≤− | ≤ } i − i| i−m i−m − k (224) and m is the order of the Markov process (X, Y). Let we have ′ ′ ′ E L L 2 E L 2 i i−1 (gi H ) (gi ) (225) gi = log P (Yi Xi−m, Yi−m) (219) − | − ≤ ∞ i i−1 2 and denote Egi by H. Here H does not depend on i since the (log P (Y X , Y )) dµ ≤ i| i−m i−m Markov process is stationary. K=L ZEi,K X (226) We decompose Zi as ∞ ′ ′ Z = (gL HL) + (gL HL ), (220) (K + 1)22−K (227) i i − i − ≤ |Y| ′ K=L L 1 L L L E L X where gi = gi {|gi|≤L}, gi = gi gi , H = gi , and 2 −L ′ ′ = O(L 2 ), (228) L E L Y X L − H = gi = H( ) H . We expand k − where the last inequality is an inequality developed by McMil- n 2 n 2 n 2 ′ ′ lan [43], and the last step could be intuitively understood since E Z = E gL HL + E gL HL i i − i − the terms decay rapidly, the sum is dominated by the largest i=1 ! i=1 ! i=1 ! X X n Xn term, hence the order. Now we have ′ ′ L L L L +2E g H g H n 2 i i ′ ′ − − 2 2 i=1 ! i=1 ! E gL HL = O(n L 2−L). (229) X X i − (221) i=1 ! X

P (Y n Xn) P (Y n Xn) E k = E E k X0 , Y 0 (230) P (Y n Xn,X0 , Y 0 ) P (Y n Xn,X0 , Y 0 ) −∞ −∞  −∞ −∞    −∞ −∞  k k P (yn xn) E n n 0 0 (231) = n n k0 0 P (x ,y X−∞, Y−∞)  n n 0 0 P (y x ,X−∞, Y−∞) |  (x ,y )∈BX(X−∞,Y−∞) k   E n n n n−1 0 0 = P (y x )P (x y ,X−∞, Y−∞) (232)  n n 0 0 k k  (x ,y )∈BX(X−∞,Y−∞)   P (yn xn)P (xn yn−1) (233) ≤ n n k k (xX,y ) = P (xn,yn) (234) n n (xX,y ) =1. (235) 20

λ n For the third term, we apply the Cauchy–Schwarz inequality, reach the root node λ and obtain Pw(Xn+1 = q x ). Thus, λ n s | n n n Pw(Xn+1 = q x ) is a weighted sum of Pe (Xn+1 = q x ), ′ ′ | | 2E gL HL gL HL (236) where s is any node in the updating path. i − i − i=1 ! i=1 ! X X Let s λ denote the set of nodes in the path from s to n 2 n { → } s n ′ ′ λ. The weight associated with Pe (Xn+1 = q x ) is 2 E gL HL E( gL HL )2 (237) | ≤ v i − v i − 1 u i=1 ! u i=1 βs(xn) , (248) u X u X u n t 3/2 2 −L/2 t β (x )+1 = O(n L 2 ) (238) u∈{Ys→λ} Summing the three terms together and taking L = 2 log n, where s is an internal node in the updating path. The weight associated with P v (X = q xn), where v is the leaf node we have w n+1 | n 2 in the updating path, is E 2 Zi = O(n(log n) ) (239) 1 . (249) i=1 u n X β (x )+1 and thus u∈{{uY→λ}\v}

n 1 n n 1 λ n E log P (Y X ) H(Y X) = E Zi (240) The convergence properties of Pw(Xn+1 = q X ) depends −n k − k n on the limiting behavior of βs(Xn) at every node| s along the i=1 X updating path. If s is an internal node in the tree representation n 2 1 of the source, we actually have s n almost E Z limn→∞ β (X )=0 ≤ nv i surely. This fact was stated in [15, Lemma 4]. Here, we restate u i=1 u X t (241) this fact and give a proof for stationary irreducible aperiodic finite-alphabet Markov processes. = O(n−1 /2 log n ). (242) Lemma 12 Let s be an internal node in the tree representa- Now we deal with the almost sure convergence rates of tion of the source. Then AEP of causally conditional entropy rate. We restate the G´al– lim βs(Xn)=0 P -a.s. (250) Koksma theorem [44] as follows: n→∞ Lemma 11 (Gal–Koksma´ theorem) Let (Ω, , P) be a F Proof: It suffices to show probability space and let (Zn)n≥1 be a sequence of random variables belonging to Lp, p 1, such that βs(Xn) -a.s. (251) ≥ lim s n =0 P p n→∞ β (X )+1 E ZM+1 + ZM+2 + + ZM+n = O(Ψ(n)) (243) | ··· | We have uniformly in M, where Ψ(n)/n is a nondecreasing sequence. βs(Xn) Then for every ǫ> 0, (252) βs(Xn)+1 Z (ω)+ Z (ω)+ + Z (ω) s n 1 2 n Pe (X ) ··· 1 (253) p+1+ǫ = s n = o((Ψ(n)(log n) ) p ) P -a.s. (244) 2Pw(X ) P s(Xn) The bound in (239) indicates that if we take Ψ(n) = e (254) ≤ M−1 is n n(log n)2 and p = 2 in the G´al–Koksma theorem, then for i=0 Pw (X ) s n every , Pe (X ) ǫ> 0 2QM (255) ≤ M−1 is n 1 i=0 Pe (X ) log P (Y n Xn) H(Y X)= o(n−1/2(log n)5/2+ǫ) −n k − k M−1 M Q 1 s n 1 is n (245) =2 exp ns log Pe (X ) log Pe (X ) , · ns − ns P -a.s. (246) ( i=0 !) Y (256) n B. Proof of Lemma 2 where ns denotes the number of symbols in X with context s, and the inequalities follow from applying (24) repeatedly. Denote the alphabet size as M. We examine the updat- Here since s is an internal node of the tree, without loss of ing computation of P λ(X |X |= q xn), q =0, 1,...,M 1. w n+1 generality, we can assume offsprings of s do not all have For an internal node s in the updating| path, if js is in− the the same conditional distribution. If it were violated, we can updating path, we have (33). For the leaf node v in the simply iterate the inequalities obtain above till we reach the updating path, leaf nodes of the tree, after which we can apply the same P v (X = q xn)= P v(X = q xn). (247) arguments that will be shown later. w n+1 | e n+1 | λ n The computation of Pw(Xn+1 = q x ) starts from a leaf It was shown in [32] that the Krichevsky–Trofimov prob- | n n and is repeated recursively along the updating path, until we ability estimate of sequence X , i.e., Pe(X ), satisfies the 21 following bound: path: Q(q Xn)= w P s(X = q Xn), (267) log P (Xn) N(a Xn) N(a Xn) | s e n+1 | e | log | s n − n n X a∈X where w are given in (248) and (249). Lemma 12 implies X s M 1 log n C that for s an internal node of the tree representation of X, − + , (257) ≤ 2 n n ws 0. Hence → where N(a Xn) denotes the number of symbol a in the n s n lim Q(q X ) wsPe (Xn+1 = q X )=0 P -a.s. |n n→∞ | − | sequence X , and C is a constant depending only on the s leaf node alphabet size M. X (268) For leaf node s, by the property of Krichevsky–Trofimov Under the assumption of Lemma 2, Markov process X is probability estimate, we know ergodic, hence s n n N(a Xn) lim Pe (Xn+1 = q X ) P (q X )=0, (269) lim | = π(a), (258) n→∞ | − | n→∞ n where P (q Xn) is the true conditional probability. Thus we where π( ) is the stationary distribution of X. Equation (258) have | implies that· Q(x Xn) P (x Xn) n n+1| − n+1| log Pe(X ) λ n n lim = H(π). (259) = P (xn+1 X ) P (xn+1 X ) (270) n→∞ n − w | − | 0 P -a.s. (271) → Applying the same argument to 1 log P s(Xn), we have ns e

1 s n lim log Pe (X )= H(πs), (260) n→∞ ns − where πs is the stationary conditional distribution conditioned C. Proof of Lemma 3 on context s. Analogously, for node is, we have

1 is n Fix ǫ > 0. Since ( , ) is bounded and closed, f( ) is lim log Pe (X )= H(πis), (261) M X Y · n→∞ nis − uniformly continuous. Thus there exists δǫ such that f(P ) f(Q) ǫ if P Q δ . Furthermore, f( ) is bounded| − thus | ≤ k − k1 ≤ ǫ · by fmax , log + log . Therefore, we have M−1 |X | |Y| 1 s n 1 is n lim log P (X ) log P (X ) f(P ) f(Q) ǫ1 + fmax1 n→∞ n e − n e {kP −Qk1≤δǫ} {kP −Qk1>δǫ} s s i=0 | − |≤ Y (272)

= H(πs) piH(πis) , (262) P Q 1 − − ǫ + fmax k − k (273) i∈X ! ≤ δ X ǫ where p = P (context is is context is s). It is obvious that = ǫ + Kǫ P Q 1, (274) i | k − k where Kǫ = fmax/δǫ. πs = piπis. (263) iX∈X By the strict concavity of entropy functional and the fact that the offsprings of s do not all have the same conditional distribution, we know D. Proof of Lemma 4 M−1 1 s n 1 is n lim log Pe (X ) log Pe (X ) < 0, (264) Since n→∞ n − n s s i=0 H(Y X)= H(X, Y ) H(X), (275) Y | − which implies we can bound f(P ) f(Q) as | − | βs(Xn) lim =0 P -a.s. (265) f(P ) f(Q) n→∞ βs(Xn)+1 | − | = HP (X, Y ) HP (X) HQ(X, Y )+ HQ(X) (276) hence | − − | H (X, Y ) H (X, Y ) + H (X) H (X) . (277) lim βs(Xn)=0 P -a.s. (266) ≤ | P − Q | | P − Q | n→∞ Now, by [45, Lemma 2.7], we have holds. H (X, Y ) H (X, Y ) θ log |X ||Y| , (278) | P − Q |≤ θ n λ n We know Q(q X )= Pw(Xn+1 = q X ) can be expressed H (X) H (X) θ log |X | , (279) | s n| P Q X as a weighted sum of P (X = q X ) for s in the updating | − |≤ θX e n+1 | 22 where θ = P Q and θ = P Q . Since REFERENCES k XY − XY k1 X k X − X k1 θ = P (x, y) Q(x, y) (280) [1] H. Marko, “The bidirectional communication theory–a generalization of | − | ,” IEEE Trans. Commum., vol. COM-21, pp. 1345– x∈X ,y∈Y X 1351, 1973. = P (x, y) Q(x, y) (281) [2] J. L. Massey, “Causality, feedback, and directed information,” in Proc. | − | Int. Symp. Inf. Theory Appl., Honolulu, HI, Nov. 1990, pp. 303–305. xX∈X yX∈Y [3] G. Kramer, Directed Information for Channels with Feedback. Kon- stanz: Hartung-Gorre Verlag, 1998, Dr. sc. thchn. Dissertation, Swiss P (x, y) Q(x, y) (282) Federal Institute of Technology (ETH) Zurich. ≥ − [4] ——, “Capacity results for the discrete memoryless network,” IEEE x∈X y∈Y X X Trans. Inf. Theory, vol. 49, no. 1, pp. 4–21, 2003. [5] S. Tatikonda and S. Mitter, “The capacity of channels with feedback,” = P (x) Q(x) (283) | − | IEEE Trans. Inf. Theory, vol. 55, no. 1, pp. 323–349, 2009. xX∈X [6] Y.-H. Kim, “A coding theorem for a class of stationary channels with feedback,” IEEE Trans. Inf. Theory, vol. 54, no. 4, pp. 1488–1499, 2008. = θX , (284) [7] H. H. Permuter, T. Weissman, and A. J. Goldsmith, “Finite state channels we have with time-invariant deterministic feedback,” IEEE Trans. Inf. Theory, vol. 55, no. 2, pp. 644–662, 2009. f(P ) f(Q) 2θ log |X ||Y| . (285) [8] H. H. Permuter, Y.-H. Kim, and T. Weissman, “Interpretations of di- | − |≤ θ rected information in portfolio theory, , and hypothesis testing,” IEEE Trans. Inf. Theory, vol. 57, no. 3, pp. 3248–3259, Jun. E. Proof of Lemma 5 2011. [9] C. Granger, “Investigating causal relations by econometric models and We first define the α-mixing coefficient of a stationary cross-spectral methods,” Econometrica, vol. 37, no. 3, pp. 424–438, process. 1969. [10] P. Mathai, N. C. Martins, and B. Shapiro, “On the detection of gene Definition 4 (α-mixing coefficient) For a stationary process network interconnections using directed mutual information,” in Proc. UCSD Inf. Theory Appl. Workshop, 2007. X adapted to the filtration ( )∞ , the α-mixing coefficient Fn −∞ [11] A. Rao, A. O. Hero, D. J. States, and J. D. Engel, “Using directed infor- is defined as mation to build biologically relevant influence networks,” J. Bioinform. Comput. Biol., vol. 6, no. 3, pp. 493–519, 2008. α(n) , sup P (A B) P (A)P (B) , (286) [12] S. Verd´u, “Universal estimation of information measures,” in Proc. IEEE | ∩ − | Inf. Theory Workshop, 2005. 0 ∞ where the supremum is over all A −∞ and B n . [13] A. D. Wyner and J. Ziv, “Some asymptotic properties of the entropy of ∈F ∈F a stationary ergodic data source with applications to data compression,” According to [46], if X is a stationary irreducible aperiodic IEEE Trans. Inf. Theory, vol. 35, no. 6, pp. 1250–1258, 1989. Markov process, α(n) tends to zero exponentially fast in n, [14] J. Ziv and N. Merhav, “A measure of relative entropy between individual sequences with application to universal classification,” IEEE Trans. Inf. i.e., there exist C7 > 0 and C8 > 0 such that Theory, vol. 39, no. 4, pp. 1270–1279, 1993. [15] H. Cai, S. R. Kulkarni, and S. Verd´u, “Universal divergence estimation α(n) C e−C8n. (287) ≤ 7 for finite-alphabet sources,” IEEE Trans. Inf. Theory, vol. 52, no. 8, pp. E n 2 3456–3475, 2006. We bound ((1/n) i=1 Vi) as follows: [16] M. Burrows and D. J. Wheeler, A block-sorting lossless data compres- sion algorithm. Digital Systems Research Center, Tech. Rep. 124, n 2 n 1 1 P 2 1994. E V = E V 2 + EV V (288) [17] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens, “The context-tree n i n2 | i| n2 i j i=1 i=1 1≤i

[29] F. Willems and T. Tjalkens, Complexity Reduction of the Context-Tree Lei Zhao received the B.Eng. degree from Tsinghua University, China, in Weighting Algorithm: A Study for KPN Research. Tech. Rep. Univ. 2003, the M.S. degree in Electrical and Computer Engineering from Iowa State Eindhoven, Eindhoven, The Netherlands, EIDMA Rep. RS.97.01, 1997. University, Ames, in 2006, and the Ph.D. degree in Electrical Engineering [30] T. J. Tjalkens, Y. M. Shtarkov, and F. M. J. Willems, “Sequential from Stanford University, California in 2011. weighting algorithms for multi-alphabet sources,” in 6th Joint Swedish– Dr. Zhao is currently working at Jump Operations, Chicago, IL, USA. Russian Int. Workshop Inf. Theory, 1993, pp. 230–234. [31] F. M. J. Willems, “The context-tree weighting method: Extensions,” IEEE Trans. Inf. Theory, vol. 44, no. 2, pp. 792–798, 1998. [32] R. E. Krichevsky and V. K. Trofimov, “The performance of universal encoding,” IEEE Trans. Inf. Theory, vol. 27, no. 2, pp. 199–207, 1981. [33] R. Venkataramanan and S. S. Pradhan, “Source coding with feed- forward: Rate–distortion theorems and error exponents for a general source,” IEEE Trans. Inf. Theory, vol. 53, no. 6, pp. 2154–2179, 2007. [34] J. Birch, “Approximations for the entropy for functions of markov chains,” Ann. Math. Statist., vol. 33, pp. 930–938, 1962. [35] F. L. Gland and L. Mevel, “Exponential forgetting and geometric ergodicity in hidden markov models,” Math. Control Signals Syst., vol. 13, no. 1, pp. 63–93, 2000. [36] B. M. Hochwald and P. Jelenkovi´c, “State learning and mixing in entropy of hidden Markov processes and the Gilbert–Elliott channel,” IEEE Trans. Inf. Theory, vol. 45, no. 1, pp. 128–138, 1999. Young-Han Kim (S’99–M’06–SM’12) received the B.S. degree with honors [37] S. Kleinberg and G. Hripcsak, “A review of causal inference for in electrical engineering from Seoul National University, Seoul, Korea, in biomedical informatics,” J. Biomed. Inform., vol. 44, no. 6, pp. 1102– 1996 and the M.S. degrees in electrical engineering and in statistics, and the 1112, 2011. Ph.D. degree in electrical engineering from Stanford University, Stanford, CA, [38] L. Breiman, “The individual ergodic theorem of information theory,” in 2001, 2006, and 2006, respectively. Ann. Math. Statist., vol. 28, no. 3, pp. 809–811, 1957, correction (1960). In July 2006, he joined the University of California, San Diego, where 31(3), 809–810. he is an Associate Professor of Electrical and Computer Engineering. His [39] M. S. Pinsker, Information and Information Stability of Random Vari- research interests are in statistical signal processing and information theory, ables and Processes. San Francisco: Holden-Day, 1964. with applications in communication, control, computation, networking, data [40] A. R. Barron, “Entropy and the central limit theorem,” Ann. Probab., compression, and learning. vol. 14, pp. 336–342, 1986. Dr. Kim is a recipient of the 2008 NSF Faculty Early Career Develop- [41] L. Breiman, Probability. SIAM: Society for Industrial and Applied ment (CAREER) Award the 2009 US-Israel Binational Science Foundation Mathematics, 1992. Bergmann Memorial Award, and the 2012 IEEE Information Theory Paper [42] A. Tsybakov, Introduction to Nonparametric Estimation. Springer- Award. He is currently on the Editorial Board of the IEEE TRANSACTIONS Verlag, 2008. ON INFORMATION THEORY, serving as an Associate Editor for Shannon [43] B. McMillan, “The basic theorems of information theory,” Ann. Math. theory. He is also serving as a Distinguished Lecturer for the IEEE Information Statist., vol. 24, no. 2, pp. 196–219, 1953. Theory Society. [44] I. S. G´al and J. F. Koksma, “Sur l’ordre de grandeur des fonctions sommables,” C. R. Acad. Sci. Paris, vol. 227, pp. 1321–1323, 1948. [45] I. Csisz´ar and J. K¨orner, Information Theory: Coding Theorems for Discrete Memoryless Systems. Budapest: Akad´emiai Kiad´o, 1981. [46] R. Bradley, “Basic properties of strong mixing conditions. a survey and some open questions,” Probab. Surveys, vol. 2, pp. 107–144, 2005. [47] D. Bosq, “Nonparametric statistics for stochastic processes,” Lecture Notes in Statist, 1996.

Jiantao Jiao (SM’13) received the B.Eng. degree with the highest honor in Tsachy Weissman (S’99-M’02-SM’07-F’13) graduated summa cum laude Electronic Engineering from Tsinghua University, Beijing, China, in 2012. He with a B.Sc. in electrical engineering from the Technion in 1997, and earned is currently working towards the Ph.D. degree in the Department of Electrical his Ph.D. at the same place in 2001. He then worked at Hewlett-Packard Engineering, Stanford University. His research interests include information Laboratories with the information theory group until 2003, when he joined theory and statistical signal processing, with applications in communication, Stanford University, where he is Associate Professor of Electrical Engineering control, computation, networking, data compression, and learning. and incumbent of the STMicroelectronics chair in the School of Engineering. Mr. Jiao is a recipient of the Stanford Graduate Fellowship (SGF), the He has spent leaves at the Technion, and at ETH Zurich. highest award offered by Stanford University. Tsachy’s research is focused on information theory, statistical signal processing, the interplay between them, and their applications. He is recipient of several best paper awards, and prizes for excellence in research. He currently serves on the editorial boards of the IEEE TRANSACTIONS ON INFORMATION THEORY and Foundations and Trends in Communications and Information Theory.

Haim Permuter (M’08) received his B.Sc. (summa cum laude) and M.Sc. (summa cum laude) degrees in Electrical and Computer Engineering from the Ben-Gurion University, Israel, in 1997 and 2003, respectively, and the Ph.D. degree in Electrical Engineering from Stanford University, California in 2008. Between 1997 and 2004, he was an officer at a research and development unit of the Israeli Defense Forces. He is currently a senior lecturer at Ben- Gurion university. Dr. Permuter is a recipient of the Fullbright Fellowship, the Stanford Graduate Fellowship (SGF), Allon Fellowship, and and the 2009 U.S.-Israel Binational Science Foundation Bergmann Memorial Award.