Causality Through Directed

Young-Han Kim

University of California, San Diego

SNU Institute for Research in Finance and Economics April , 

Joint work with Jiantao Jiao (Stanford), Haim Permuter (Ben Gurion), Tsachy Weissman (Stanford), and Lei Zhao (Jump Operations)

Supported in part by National Science Foundation (NSF), US–Israel Binational Science Foundation (BSF), and BSF Bergmann Memorial Award Related publications

Haim H. Permuter, Young-Han Kim, and Tsachy Weissman, “Interpretations of ∙ directed information in portfolio theory, , and hypothesis testing,” IEEE Transactions on , vol. , no. , pp. –, June .

Tsachy Weissman, Young-Han Kim, and Haim H. Permuter, “Directed information, ∙ causal estimation, and communication in continuous time,” IEEE Transactions on Information Theory, vol. , no. , pp. –, March .

Jiantao Jiao, Lei Zhao, Haim H. Permuter, Young-Han Kim, and Tsachy Weissman, ∙ “Universal estimation of directed information,” to appear in IEEE Transactions on Information Theory, .

For more information, visit http://circuit.ucsd.edu/˜yhk ∙

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Shannon’s information measures ()

Entropy: “uncertainty in a random variable X” ∙  H X p x log ( )=ᚰ ( ) p x x ( )

Mutual information: “information about X provided by Y” ∙ I X; Y H X H Y H X, Y ( )= ( )+ ( )− ( )

Relative entropy (Kullback–Leibler ): “distinction between p and q” ∙ p x D p q p x log ( ) ( ‖ )=ᚰ ( ) q x x ( )

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Where do they come from?

Mathematical communication theory (Shannon ) ∙ é Fundamental limits on communication and compression

é Probability theory and statistics

Axiomatic definitions (Aczel–Dar´ oczy´ ) ∙ é “Reasonable” properties for information measures Functional equations: f p q f p f q f H é ( × )= ( )+ ( ) ⇒ ≅

How about finance and economics? ∙

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Gambling in horse races

Horses: ,,..., m ∙ Odds: o  , o  ,..., o m (say, o x m) ∙ ( ) ( ) ( ) ( )≡ Win probabilities: p  , p  ,..., p m ∙ ( ) ( ) ( )

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Optimal gambling

Bets: b  , b  ,..., b m ∙ ( ) ( ) ( ) No short: b x , x ,,..., m é ( )≥ = No margin: b x  é ∑x ( )= In other words, b x lies in the probability simplex é ( )

Payoff: If horse x wins (with probability p x ), then  turns into b x o x ∙ ( ) ( ) ( )

Question How should we choose our portfolio b x ? ( )

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Kelly gambling and log-optimal portfolio

Kelly (), “A new interpretation of information rate”: ∙ b∗ x p x ( )= ( )

Maximize E log b X o X ∙ [ ( ( ) ( ))] é Logarithmic utility

é Growth rate optimality

é Competitive optimality (Bell–Cover )

é Other properties (MacLean–Thorp–Ziemba ) Optimal growth rate: ∙ W∗ X max E log b X o X E log o X H X ( )= b(x) [ ( ( ) ( ))] = ( ( )) − ( )

With o x m, ( )≡ ∗ W X log m H X ( )= − ( )

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Entropy

  H X p x log E log ( )=ᚰ ( ) p x = ឱ p X ុ x ( ) ( )

Amount of randomness (information, uncertainty) in X ∙ Fundamental limit on lossless compression (Shannon ) ∙ Can be generalized to measures other than the counting measure ∙

Conditional entropy: ∙   H X Y p x, y log E log ( | )=ᚰ ( ) p x y = ឱ p X Y ុ x,y ( | ) ( | )

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Gambling with side information

Side information Y about the horse race outcome X ∙ Bets: b x y , x ,,..., m ∙ ( | ) =

Kelly gambling: b∗ x y p x y ∙ ( | )= ( | ) Optimal growth rate: ∙ W∗ X Y max E log b X Y o X ( | )= b(x|y) [ ( ( | ) ( ))] E log o X H X Y = ( ( )) − ( | )

Value of side information (Kelly ) ΔW W∗ X Y W∗ X I X; Y = ( | )− ( )= ( )

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) /

I X; Y H X H Y H X, Y H X H X Y H Y H Y X ( )= ( )+ ( )− ( )= ( )− ( | )= ( )− ( | )

Amount of information about X provided by Y (and vice versa) ∙ For a general stock market (Barron–Cover ): ΔW I X; Y é ≤ ( )

Fundamental limit on communication (Shannon ) ∙ Fundamental limit on lossy compression/quantization (Shannon ) ∙

Can be generalized to any pair of random objects ∙

Conditional mutual information: ∙ I X; Y Z H X Z H Y Z H X, Y Z ( | )= ( | )+ ( | )− ( | )

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Repeated gambling in horse races with memory

Win probabilities: p x , p x x , p x x , x ,..., p x xn− ∙ ( ) ( | ) ( |  ) ( n| ) Odds: o x m ∙ ( i)≡ Bets: b x , b x x , b x x , x ,..., b x xn− ∙ ( ) ( | ) ( |  ) ( n| ) Kelly gambling: b∗ x xi− p x xi− , i ,,... ∙ ( i| )= ( i| ) = Optimal growth rate: ∙   n W∗ Xn log m H Xn log m H X Xi− ( )= − ( ) = − ᚰ ( i | ) n n i=

If the horse race process X is stationary ergodic, then ∙ { n}  n H Xn H∗ X é ( / ) ( ) → ( ) ∗ n ∗ é W X W X ( ) → ∗ ( ) ≐ nW é wealth  almost surely (Shannon , McMillan , Breiman )

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Gambling with causal side information

Side information Y , Y ,... ∙   Bets: b x xi−, yi ∙ ( i| ) Kelly gambling: b∗ x xi−, yi p x xi−, yi , i ,,... ∙ ( i| )= ( i| ) = Optimal growth rate: ∙  n  W∗ Xn Yn log m H X Xi−, Yi log m H Xn Yn ( ‖ )= − ᚰ ( i | ) = − ( ‖ ) n i= n

If the X , Y is stationary ergodic, then  n H Xn Yn H∗ X Y ∙ {( n n)} ( / ) ( ‖ ) → ( ‖ )

Value of causal side information (Permuter–K–Weissman )   ΔW W∗ Xn Yn W∗ Xn H Xn H Xn Yn I Yn Xn = ( ‖ )− ( )= n( ( )− ( ‖ )) = n ( → )

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Directed information

n I Yn Xn H Xn H Xn Yn I X ; Yi Xi− ( → )= ( )− ( ‖ )= ᚰ ( i | ) i=

Amount of information about X causally provided by Y ∙ For a general stock market: ΔW ≤  n I Y n → Xn é ( / ) ( ) Arrow of time: directed and asymmetric ∙ I Yn Xn I Xn Yn ( → ) ̸= ( → )

Fundamental limit on feedback communication ∙ (Tatikonda–Mitter , K , Permuter–Weissman–Goldsmith )

Can be generalized to continuous time (Weissman–K–Permuter ) ∙

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Test for causal dependence

H H

Xi Yi Xi Yi

Controller Output generatorController Output generator

i− i− Xi i i− Yi i− i− Xi i− Yi p(xi|x , y ) p(yi|x , y ) p(xi|x , y ) p(yi|y )

Yi− Yi−

Type-I and type-II error probabilities: α P Ac H , β P A H ∙ = ( | ) = ( | )

Chernoff–Stein lemma for the causal dependence test n n β∗ min β ≐ −I(X →Y ) = A⊆X n×Y n: α<є

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Brief history

∙ Marko (), “The bidirectional communication theory: A generalization of information theory”

é Direction of information flow for mutually coupled statistical systems

é Cybernetics: Group behavior with monkeys ∙ Massey (), “Causality, feedback, and directed information”

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Relationship to other notions for causality

∙ Granger causality (Granger , Geweke ):

n i− LMMSE Yi Yi−p G Xn → Yn = log ( | ) ( ) ᚰ LMMSE Y Yi− , Xi i= ( i| i−p i−p) The higher G Xn → Y n is, the more X influences Y é ( ) If X , Y is Gauss–Markov of order p, then é {( n n)} I Xn → Y n ≡ G Xn → Y n ( ) ( )

∙ Transfer entropy (Schreiber ): T X → Y = I Xi−; Y Yi− i( ) ( i | ) The higher T X → Y is, the more X influences Y (with one step delay) é i( ) If X , Y is stationary, then é {( n n)}  I Xn− → Y n → T X → Y n ( ) ( )

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Causal conditioning

∙ Causally conditional probability (Kramer ):

n p yn xn = p y xi, yi− ( ‖ ) ᚱ ( i | ) i= n p yn xn− = p y xi−, yi− ( ‖ ) ᚱ ( i | ) i= ∙ Causally conditional entropy:

H Yn Xn =− E log p Yn Xn , ( ‖ ) [ ( ‖ )] H Yn Xn− =− E log p Yn Xn− ( ‖ ) [ ( ‖ )]

Chain rules

p xn, yn = p xn yn p yn xn− = p xn yn− p yn xn , ( ) ( ‖ ) ( ‖ ) ( ‖ ) ( ‖ ) H Xn, Yn = H Xn Yn + H Yn Xn− = H Xn Yn− + H Yn Xn ( ) ( ‖ ) ( ‖ ) ( ‖ ) ( ‖ )

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Properties of directed information

I Xn → Yn = H Yn − H Yn Xn , ( ) ( ) ( ‖ ) I Xn− → Yn = H Yn − H Yn Xn− ( ) ( ) ( ‖ )

∙ I Xn → Yn ≤ I Xn; Yn ( ) ( ) ∙ I Xn → Yn = I Xn; Yn if p xn yn− = p xn ( ) ( ) ( ‖ ) ( ) ∙ I Xn → Yn = I Xn; Yn = nI X; Y if X , Y is IID ( ) ( ) ( ) {( n n)}

Conservation law I Xn ; Yn = I Xn → Yn + I Yn− → Xn = I Xn− → Yn + I Yn → Xn ( ) ( ) ( ) ( ) ( )

∙ Measure of causal influence

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Universal estimation of directed information

∙ In reality, the probability distribution may not be known or may not even exist

Something out of nothing ∙ Can we perform as if the distribution were known? ∙ Can we perform as well as the best estimator in a given class?

∙ Answer: Yes! (Jiao–Zhao–Permuter–K–Weissman )

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Universal probability assignments

∙ Probability assignment: q xn ( ) ∙ Sequential probability assignment: q x , q x x , q x x , x ,..., q x xn− ( ) ( | ) ( |  ) ( n| ) ∙ Probability assignment q is universal if  lim D p xn q xn =  n→∞ n ( ( )‖ ( )) for every stationary distribution p

∙ Probability assignment q is pointwise universal if  p Xn ( ) ≤ lim sup log n  p–a.s. n→∞ n q X ( ) for every stationary ergodic distribution p

∙ (Pointwise) universal probability assignments

é Compression-based approaches: Ziv–Lempel (), Willems–Shtarkov–Tjalkens ()

é Ergodic theoretic approaches: Ornstein (), Morvai–Yakowitz–Algoet ()

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Algorithm 

̂I Xn → Yn = Ĥ Yn − Ĥ Yn Xn ( ) ( ) ( ‖ )

  ∙ Ĥ Yn =− log q Yn and Ĥ Yn Xn =− log q Yn Xn ( ) n ( ) ( ‖ ) n ( ‖ )

, Consistency (almost sure and L convergence)

, Essentially optimal convergence rate

/ Erratic for small n

/ Unbounded support

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Algorithm 

̂I Xn → Yn = Ĥ Yn − Ĥ Yn Xn ( ) ( ) ( ‖ )

 n  n ∙ Ĥ Yn = H q y Yi− and Ĥ Yn Xn = H q y x , Xi−, Yi− ( ) ᚰ ( ( i | )) ( ‖ ) ᚰ ( ( i | i )) n i= n i=

̂ , Similar convergence rate as I

, Smooth and bounded support

/ Can be negative

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Algorithms  and 

 n ̂I Xn → Yn = D q y Xi, Yi− q y Yi− , ( ) ᚰ ( ( i | )‖ ( i | )) n i=  n ̂I Xn → Yn = D q x , y Xi , Yi− q y Yi− q x Xi, Yi ( ) ᚰ ( ( i i | )‖ ( i | ) ( i | )) n i=

, Smooth, nonnegative, and bounded support ̂ ̂ / Weaker performance guarantee than I and I

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Performance comparison

Alg.  Alg. 

0.4 0.4 0.35 0.35 0.3 0.3 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 2 3 4 5 2 3 4 5 10 10 10 10 10 10 10 10 n n Alg.  Alg. 

0.4 0.4 0.35 0.35 0.3 0.3 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 2 3 4 5 2 3 4 5 10 10 10 10 10 10 10 10 n n

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Causal influence

X X X X Xn Xn+ Xn+

Y Y Y Yn Yn+

Question Which process influences the other?

∙ Conversation law: I Xn; Yn = I Xn → Yn + I Yn− → Xn ( ) ( ) ( )

∙ If I Xn → Yn ≫ I Yn− → Xn , then X causes Y ( ) ( ) ∙ If I Xn → Yn ≪ I Yn− → Xn , then Y causes X ( ) ( )

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Causal influence Alg.  Alg.  0.6 0.6

n n 0.5 0.5 ← I(X ;Y ) n 0.4 0.4 n n ← I(X →Y ) n 0.3 0.3

0.2 0.2 n− n I(Y →X ) 0.1 0.1 ← n 0 0 2 3 4 5 2 3 4 5 10 10 10 10 10 10 10 10 n n Alg.  Alg.  0.6 0.6

n n 0.5 0.5 ← I(X ;Y ) n 0.4 0.4 n n ← I(X →Y ) n 0.3 0.3

0.2 0.2 n− n I(Y →X ) 0.1 0.1 ← n 0 0 2 3 4 5 2 3 4 5 10 10 10 10 10 10 10 10 n n

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / HSI versus DJIA

4 x 10

3 HSI DJIA 2.5

2

1.5

1

0.5

0 1990 1995 2000 2005 2010

Questions ∙ Are these markets correlated? ∙ Which index leads the other?

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / HSI versus DJIA Alg. Alg. n n n n 0.2 I X ; Y n 0.2 I X ; Y n I(Xn → Y)/n n I(Xn → Y)/n n ( )/ ( )/ I Yn− → Xn n I Yn− → Xn n 0.15 ( )/ 0.15 ( )/

0.1 0.1

0.05 0.05

0 0

2000 2005 2010 2000 2005 2010

Alg.  Alg.  n n n n 0.2 I X ; Y n 0.2 I X ; Y n I(Xn → Y)/n n I(Xn → Y)/n n ( )/ ( )/ I Yn− → Xn n I Yn− → Xn n 0.15 ( )/ 0.15 ( )/

0.1 0.1

0.05 0.05

0 0

2000 2005 2010 2000 2005 2010 Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Delay estimation

Xn Yn { } Δ-unit delay Channel { }

Question Can we estimate Δ efficiently?

∙ Shifted directed information

n−d 0.5 I Yn → Xn−d = H X Xi− − H X Xi−, Yd+i ( d+ ) ᚰ ( i | ) ( i | d+) i= 0.4

0.3 ∙ < n → n−d = If d Δ, then I Yd+ X  ( ) 0.2 ∙ ≥ n → n−d ≫ If d Δ, then I Yd+ X  ( ) 0.1

0 −4 −2 0 2 4 d Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / Concluding remarks

∙ Directed information I Xn → Yn ( )

é Arrow of time + Shannon’s mutual information

é A natural generalization of Granger causality

é Causation beyond correlation

é Universal estimation algorithms (MATLAB codes available)

∙ Looking forward

é Large alphabets

é More than a pair of random processes

é Piecewise stationary processes

é More applications (economics, biology, climate change, ...)

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / References

Aczel,´ J. and Daroczy,´ Z. (). On Measures of Information and Their Characterizations. Academic Press, New York. Barron, A. R. and Cover, T. M. (). A bound on the financial value of information. IEEE Trans. Inf. Theory, IT-, –. Bell, R. M. and Cover, T. M. (). Competitive optimality of logarithmic investment. Math. Oper. Res., (), –. Breiman, L. (). The individual ergodic theorem of information theory. Ann. Math. Statist., (), –. Correction (). (), –. Geweke, J. F. (). Measurement of linear dependence and feedback between multiple time series. J. Amer. Statist. Assoc., (), –. With discussion and with a reply by the author. Granger, C. (). Investigating causal relations by econometric models and cross-spectral methods. Econometrica, (), –. Jiao, J., Zhao, L., Permuter, H. H., Kim, Y.-H., and Weissman, T. (). Universal estimation of directed information. Kelly, J. L., Jr. (). A new interpretation of information rate. Bell Syst. Tech. J., , –. Kim, Y.-H. (). A coding theorem for a class of stationary channels with feedback. IEEE Trans. Inf. Theory, (), –. Kramer, G. (). Directed Information for Channels with Feedback. Hartung-Gorre Verlag, Konstanz. Dr. sc. thchn. Dissertation, Swiss Federal Institute of Technology (ETH) Zurich.

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / References (cont.)

Kullback, S. and Leibler, R. A. (). On information and sufficiency. Ann. Math. Statistics, , –. MacLean, L. C., Thorp, E. O., and Ziemba, W. T. (). The Kelly Capital Growth Investment Criterion: Thenry and Practice. World Scientific, Singapore. Marko, H. (). The bidirectional communication theory: A generalization of information theory. IEEE Trans. Comm., (), –. Massey, J. L. (). Causality, feedback, and directed information. In Proc. IEEE Int. Symp. Inf. Theory Appl., Honolulu, HI, pp. –. McMillan, B. (). The basic theorems of information theory. Ann. Math. Statist., (), –. Morvai, G., Yakowitz, S. J., and Algoet, P. (). Weakly convergent nonparametric forecasting of stationary time series. IEEE Trans. Inf. Theory, (), –. Ornstein, D. (). Guessing the next output of a stationary process. Israel J. Math., , –. Permuter, H. H., Kim, Y.-H., and Weissman, T. (). Interpretations of directed information in portfolio theory, data compression, and hypothesis testing. IEEE Trans. Inf. Theory, (), –. Permuter, H. H., Weissman, T., and Goldsmith, A. J. (). Finite state channels with time-invariant deterministic feedback. IEEE Trans. Inf. Theory, (), –. Schreiber, T. (). Measuring information transfer. Phys. Rev. Lett., , –. Shannon, C. E. (). A mathematical theory of communication. Bell Syst. Tech. J., (), –, (), –.

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) / References (cont.)

Shannon, C. E. (). Coding theorems for a discrete source with a fidelity criterion. In IRE Int. Conv. Rec., vol. , part , pp. –. Reprint with changes (). In R. E. Machol (ed.) Information and Decision Processes, pp. –. McGraw-Hill, New York. Tatikonda, S. and Mitter, S. (). The capacity of channels with feedback. IEEE Trans. Inf. Theory, (), –. Weissman, T., Kim, Y.-H., and Permuter, H. H. (). Directed information, causal estimation, and communication in continuous time. IEEE Trans. Inf. Theory, (), –. Willems, F. M. J., Shtarkov, Y. M., and Tjalkens, T. J. (). The context-tree weighting method: Basic properties. IEEE Trans. Inf. Theory, (), –. Ziv, J. and Lempel, A. (). Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory, IT-(), –.

Young-HanKim (UCSD) DirectedInformation SIRFESeminar(April) /