<<

COMMON , EFFICIENCY, AND ACTIONS

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Lei Zhao August 2011

© 2011 by Lei Zhao. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/bn436fy2758

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Thomas Cover, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Itschak Weissman, Co-Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Abbas El-Gamal

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii Preface

The source coding theorem and channel coding theorem, first established by Shannon in 1948, are the two pillars of information theory. The insight obtained from Shan- non’s work greatly changed the way modern communication systems were thought and built. As the original ideas of Shannon were absorbed by researchers, the mathe- matical tools in information theory were put to great use in , portfolio theory, complexity theory, and probability theory. In this work, we explore the area of common randomness generation, where remote nodes use nature’s correlated random resource and communication to generate a in common. In particular, we investigate the initial efficiency of common randomness generation as the communication rate goes down to zero, and the saturation efficiency as the communication exhausts nature’s randomness. We also consider the setting where some of the nodes can generate action sequences to influence part of nature’s randomness. At last, we consider actions in the framework of source coding. The tools from channel coding and distributed source coding are combined to establish the funda- mental limit of compression with actions.

iv Acknowledgements

The five years I spent at Stanford doing my Ph.D. have been a very pleasant and fulfilling journey. And it is my advisor Thomas Cover, who made it possible. His weekly round-robin group meeting was the best place for research discussion and was also full of interesting puzzles and stories. He revealed the pearls of information theory as well as statistics through all those beautiful examples and always encouraged me on every small findings I obtained. It is a privilege to work with him and I would like to thank him for his support, and guidance. I am also truly grateful to Professor Tsachy Weissman, who taught me amazing universal schemes in information theory and was always willing to let me do “random drawing” on his white boards. I really like his way of asking have-we-convinced- ourselves questions, which often led to surprisingly simple yet insightful discoveries. Professor Abbas El Gamal is of great influence on me. I would like to extend my sincere thanks to him. His broad knowledge on network information theory and his teaching of EE478 were invaluable to my research. I would like to thank my colleagues at Stanford, especially, Himanshu Asnani, Bernd Bandemer, Yeow Khiang Chia, Paul Cuff, Shirin Jalali, Gowtham Kumar, Vinith Misra, Alexandros Manolakos, Taesup Moon, Albert No, Idoia Ochoa, Haim Permuter, Han-I Su, and Kartik Venkat. Last but not least, I am grateful to my family. I thank my parents for their constant support and love. I thank my wife for her love, and for completing my life.

v Contents

Preface iv

Acknowledgements v

1 Introduction 1

2 Hirschfeld-Gebelein-Renyi maximal correlation 4 2.1 HGRcorrelation ...... 4 2.2 Examples ...... 6 2.2.1 Doubly symmetric binary source ...... 7 2.2.2 Z-Channel with Bern(1/2)input...... 7 2.2.3 ErasureChannel ...... 8

3 Common randomness generation 11 3.1 Commonrandomnessandefficiency ...... 11 3.1.1 Commonrandomness andcommoninformation ...... 13 3.1.2 Continuity at R =0 ...... 14 3.1.3 Initial Efficiency (R 0) ...... 14 ↓ 3.1.4 Efficiency at R H(X Y )(saturationefficiency) ...... 16 ↑ | 3.2 Examples ...... 18 3.2.1 DBSC(p)example...... 18 3.2.2 Gaussianexample...... 20 3.2.3 Erasureexample ...... 23 3.3 Extensions...... 24

vi 3.3.1 CRperunitcost ...... 24 3.3.2 Secretkeygeneration...... 24 3.3.3 Non-degenerate V ...... 26 3.3.4 Broadcastsetting ...... 26

4 Common randomness generation with actions 28 4.1 Commonrandomnesswithaction ...... 28 4.2 Example...... 32 4.3 Efficiency ...... 34 4.3.1 Initial Efficiency ...... 34 4.3.2 Saturationefficiency ...... 35 4.4 Extensions...... 35

5 Compression with actions 37 5.1 Introduction...... 37 5.2 Definitions...... 39 5.2.1 Losslesscase...... 39 5.2.2 Lossycase...... 39 5.2.3 Causalobservationsofstatesequence ...... 40 5.3 Losslesscase...... 40 5.3.1 Lossless, noncausal compression with action ...... 40 5.3.2 Lossless, causal compression with action ...... 46 5.3.3 Examples ...... 48 5.4 Lossycompressionwithactions ...... 51

6 Conclusions 54

A Proofs of Chapter 2 56

A.1 Proof of the convexity of ρ(PX PY X ) in PY X ...... 56 ⊗ | | B Proofs of Chapter 3 59 B.1 Proof of the continuity of C(R) at R =0...... 59

vii C Proofs of Chapter 4 61 C.1 ConverseproofofTheorem5...... 61 C.2 Proof for initial efficiency with actions ...... 63 C.3 ProofofLemma 5 ...... 66

D Proofs of Chapter 5 68 D.1 ProofofLemma6...... 68 D.2 ProofofLemma7...... 70

Bibliography 71

viii List of Tables

ix List of Figures

n n 1.1 Generate common randomness: K = K(X ), K′ = K′(Y ) satisfy-

ing P(K = K′) 1 as n . What is the maximum common → → ∞ 1 randomness per symbol, i.e. what is sup n H(K)? ...... 2

2.1 ...... 6 2.2 ...... 6 2.3 X Bern(1/2) ...... 7 ∼ 2.4 ρ(X; Y )=1 2 min p, 1 p ...... 7 − { − } 2.5 Z-channel ...... 8 1 p 2.6 ρ(X; Y )= 1+−p ...... 8 2.7 ErasureChannelq ...... 8

3.1 Common Randomness Capacity: (Xi,Yi) are i.i.d.. Node 1 generates a r.v. K based on the Xn sequence it observes. It also generates a message M and transmits the message to Node 2 under rate constraint n R. Node 2 generates a r.v. K′ based on the Y sequence it observes

and M. We require that P(K = K′) approaches 1 as n goes to infinity. The entropy of K measures the amount of common randomness those two nodes can generate. What is the maximum entropy of K? ... 12

3.2 The probability structure of Un...... 17

3.3 DBSC example: X Bern(1/2), pY X (x x)=(1 p), pY X (1 x x)= p. 18 ∼ | | − | − | 3.4 C(R) for p =0.08...... 19 3.5 GaussianExample ...... 21 3.6 Auxiliary r.v. U inGaussianexample...... 21

x 3.7 Gaussian example: C(R) for N =0.5...... 22 3.8 Erasureexample ...... 23 3.9 Erasure example: C R curve...... 23 − 3.10 Commonrandomnessperunitcost...... 24 3.11 SecretKeyGeneration ...... 25 3.12CRbroadcastsetup...... 26

4.1 Common Randomness Capacity: X is an i.i.d. source. Node { i}i=1,... 1 generates a r.v. K based on the Xn sequence it observes. It also generates a message M and transmits the message to Node 2 under rate constraint R. Node 2 first generates an action sequence An as a function of M and receives a sequence of side information Y n, where n n n Y (A ,X ) p(y a, x). Then Node 2 generates a r.v. K′ based on | ∼ | n both M and Y sequence it observes and M. We require P(K = K′) to be close to 1. The entropy of K measures the amount of common randomness those two nodes can generate. What is the maximum entropy of K? ...... 29 4.2 CRwithActionexample ...... 33 4.3 Correlate A with X ...... 33 4.4 CR with action example: option one: set A X; option two: correlate ⊥ A with X...... 34

5.1 Compression with actions. The Action encoder first observes the state sequence Sn and then generates an action sequence An. The ith out- put Y is the output of a channel p(y a, s) when a = A and s = S . i | i i The compressor generates a description M of 2nR bits to describe Y n. The remote decoder generates Yˆ n based on M and it’s available side information Zn as a reconstruction of Y n...... 38 5.2 Binary example with side information Z = ...... 48 ∅ H2(b) H2(p) dH2 5.3 The threshold b∗ solves − = , b [0, 1/2]...... 50 b db ∈ 5.4 Comparison between the non-causal and causal rate-cost functions. TheparameteroftheBernoullinoiseissetat0.1...... 51

xi Chapter 1

Introduction

Given a pair of random variables (X,Y ) with joint distribution p(x, y), what do they have in common? Different quantities can be justified as the right measure of “com- mon” in different settings. For example, in linear estimation, correlation determines the minimum square error (MMSE) when we use one random variable to esti- mate the other. And the MMSE suggests that the larger the absolute value of the correlation, the more “commonness” X and Y have. In information theory, insight about p(x, y) can often be gained when independent and identically distributed (i.i.d.) copies, (Xi,Yi), i = 1, ..., n, are considered. In source coding with side information, the celebrated Slepian-Wolf Theorem [21] shows that when compressing X n loss- { i}i=1 lessly, the rate reduction by having side information Y n is the mutual information { i}i=1 I(X; Y ) between X and Y . It makes a lot of sense that a large rate reduction suggests a lot in common between X and Y , which indicates that I(X; Y ) is a good measure. A more direct attempt addressing the commonness was first considered by G´acs and K¨orner in [10]. In their setting, illustrated in Fig. 1.1, nature generates (Xn,Y n) ∼ i.i.d. p(x, y). Node 1 observes Xn, and Node 2 observes Y n. The task is for the two nodes to generate common randomness (CR), i.e., a random variable K in common. The entropy of the common random variable is the number of common bits gener- ated by nature’s resource at either node. The supremum of the normalized entropy, 1 n H(K), is defined as the common information between X and Y . It would be an extremely interesting measure of commonness if not for the fact that it is zero for a

1 CHAPTER 1. INTRODUCTION 2

Xn Y n

Node 1 Node 2

K K′

n n Figure 1.1: Generate common randomness: K = K(X ), K′ = K′(Y ) satisfying P(K = K′) 1 as n . What is the maximum common randomness per symbol, → 1 →∞ i.e. what is sup n H(K)? large class of joint distributions. Witsenhausen [28] used Hirschfeld-Gebelein-Renyi maximal correlation (HGR correlation) to sharpen the result by G´acs and K¨orner. Surprisingly, if the HGR correlation between X and Y is strictly less than 1, not a single bit in common can be generated by the two nodes. In this thesis, we investigate the role of HGR correlation in common randomness generation with a rate-limited communication link between Node 1 and Node 2, with and without actions. In particular, we link the HGR correlation with initial efficiency, i.e., the initial rate of common randomness unlocked by communication, thus giving an operational justification of using HGR correlation as a measure of commonness. Furthermore, we extend common randomness generation to the setting where one node can take actions to affect the side information. A single letter expression for common randomness capacity is obtained, based on which the initial efficiency and saturation efficiency are derived. The maximum HGR correlation conditioned on a fixed action determines the initial efficiency. In the last chapter we consider the problem of compression with actions. While traditionally in source coding, nature fixes the source distribution, in our setting, we introduce the idea of using actions to affect nature’s source. Notation: We use capital letter X to denote a random variable, small letter x to denote the corresponding realization, calligraphic letter to denote the alphabet X of X, and to denote the cardinality of the alphabet. The subscripts in joint |X| CHAPTER 1. INTRODUCTION 3

distributions are mostly omitted. For example pXY (x, y) is written as p(x, y). But to emphasize the probability structure, we sometimes write the joint distribution as

PX PY X , where PX is the marginal of X and PY X as the conditional distribution of ⊗ | | Y given X. We use X Y to indicate that X and Y are independent, and X Y Z ⊥ − − to indicate that X and Z are conditionally independent given Y . Subscripts and n j superscripts are used in the standard way: X =(X1, ..., Xn) and Xi =(Xi, ..., Xj). Most of the notations follow [8]. Chapter 2

Hirschfeld-Gebelein-Renyi maximal correlation

2.1 HGR correlation

We focus on random variables with finite alphabet.

Definition 1. The HGR correlation [12,14,18] between two random variables (r.v.) X and Y , denoted as ρ(X; Y ), is defined as

ρ(X; Y ) = max Eg(X)f(Y ) (2.1) subject to Eg(X)=0, Ef(Y )=0, Eg2(X) 1, Ef 2(Y ) 1. ≤ ≤

If neither X nor Y is degenerate, i.e., a constant, then the inequalities can be replaced by equality in the constraints. An equivalent characterization was proved by R´enyi in [18]:

ρ2(X; Y ) = sup E E2(g(Y ) X) (2.2) Eg(Y )=0,Eg2(Y ) 1 | ≤   Note that HGR correlation is a function of the joint distribution p(x, y) and does not dependent on the support of X and Y . We sometimes use ρ(p(x, y)) or ρ(PX PY X ) ⊗ | 4 CHAPTER 2. HIRSCHFELD-GEBELEIN-RENYI MAXIMAL CORRELATION 5

to emphasize the joint . The HGR correlation shares quite a few properties with mutual information ρ(X; Y ).

Positivity [18]: 0 ρ(X; Y ) 1 • ≤ ≤ ρ(X; Y ) = 0 iff X Y . ◦ ⊥ ρ(X; Y ) = 1 iff there exists a non-degenerate random variable V such that ◦ V is both a function of X and a function of Y .

Data processing inequality: If X,Y and Z form a Markov chain X Y Z, • − − then ρ(X; Y )= ρ(X; Y,Z) ρ(X; Z). ≥ Proof. Consider any function g such that Eg(X)=0,Eg2(X) = 1. By the Markovity X Y Z, E [E2(g(X) Y )] = E [E2(g(X) Y,Z)]. Thus using the − − | | alternative characterization Eq.(2.2), we have

ρ(X; Y )= ρ(X; Y,Z) ρ(X; Z). ≥

2 Convexity: Fixing PX , ρ (PX PY X ) is convex in PY X . • ⊗ | | 1, w.p. λ; Proof. Consider r.v.’s X,Y1,Y2. Let0 <λ< 1, and let Θ = , ( 2, w.p. 1 λ − where Θ is independent of (X,Y1,Y2). Let Y = YΘ. We have

ρ2(X; Y ) ρ2(X; Y , Θ) Θ ≤ Θ λρ2(X; Y )+(1 λ)ρ2(X; Y ), ≤ 1 − 2

where last inequality comes from the following lemma.

Lemma 1. Assume X Z, where Z has a finite alphabet . Let ρ(X; Y,Z) be ⊥ Z the R´enyi correlation between X and (Y,Z).

ρ2(X; Y,Z) P (z)ρ2(X; Y Z = z), (2.3) ≤ Z | v X CHAPTER 2. HIRSCHFELD-GEBELEIN-RENYI MAXIMAL CORRELATION 6

where ρ(X; Y Z = z)= ρ(PX PY X,Z=z). | ⊗ |

Proof. See Appendix A.1.

2 However, we note here that ρ (PX PY X ) is not concave in PX when fixing PY X , ⊗ | | which differs from mutual information. We provide a numerical example: Consider P = [1/2, 1/4, 1/4]T and P = [1/3, 1/3, 1/3]. Let P = θP +(1 θ)P . 1 2 θ 1 − 2 2 We show the plots of ρ (Pθ PY X ) as a function of θ for two different PY X matrices ⊗ | | in the following figures:

0.107 0.098

0.106 0.097

0.105 0.096 0.104 0.095 0.103 0.094 0.102 (X;Y) (X;Y)

2 2 0.093 ρ 0.101 ρ 0.092 0.1 0.091 0.099

0.098 0.09

0.097 0.089 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Time sharing θ Time sharing θ

Figure 2.1: Figure 2.2:

0.0590 0.4734 0.4677

For the left figure, PY X =  0.3252 0.2415 0.4333 , and for the right figure, |  0.1778 0.6230 0.1992    0.3162 0.6139 0.0699 

PY X =  0.6351 0.2702 0.0948 . |  0.5519 0.3570 0.0911      2.2 Examples

In this section, we calculate the HGR correlation for a few simple examples. CHAPTER 2. HIRSCHFELD-GEBELEIN-RENYI MAXIMAL CORRELATION 7

2.2.1 Doubly symmetric binary source

Let X be a Bern(1/2) r.v. and Y be the output of a binary symmetric channel with cross probability p and input X. Since X and Y are binary, the HGR correlation can be easily computed as [9]

ρ(X; Y )=1 2 min p, 1 p . − { − }

When p = 0, i.e., X and Y are independent, ρ = 0; when p = 1, X and Y are ±

1

0.9 X Y 0.8 1 p 0.7 − ) 0 0 0.6 Y p ; 0.5

X 0.4 (

ρ 0.3 p 0.2 0.1

0 1 1 0 0.2 0.4 0.6 0.8 1 1 p p − Figure 2.3: X Bern(1/2) Figure 2.4: ρ(X; Y )=1 2 min p, 1 p ∼ − { − } essentially identical and ρ achieves its maximum value 1. These values agrees with one’s intuition about the commonness measure on X and Y .

2.2.2 Z-Channel with Bern(1/2) input

Let X be a Bern(1/2). And let us consider the Z-channel 2.5 of with probability p, that relates X and Y . The HGR correlation can be computed as

1 p ρ(X; Y )= − 1+ p r Note that ρ2(X; Y )= 1+ 2 , which is a convex function in p. − 1+p CHAPTER 2. HIRSCHFELD-GEBELEIN-RENYI MAXIMAL CORRELATION 8

X 1 1 Y 0.9 0.8

0 0 0.7 )

0.6 Y

; 0.5

X 0.4 ( 0.3 p ρ 0.2

0.1

1 1 0 1 p 0 0.2 0.4 0.6 0.8 1 − p Figure 2.5: Z-channel 1 p Figure 2.6: ρ(X; Y )= 1+−p q 2.2.3 Erasure Channel

Let (X,Y ) be two random variables with some general distributions p(x, y) and Y be an erased version of Y with erasure probability q, shown in Fig. 2.7. e

X p(x, y) Y Y

e e

Figure 2.7: Erasure Channel

It is shown that an erasure erases a portion q of the information between X and Y , i.e., I(X; Y ) = (1 q)I(X; Y ) [4]. Interestingly, for HGR correlation (HGR − correlation squared to be precise), a similar property holds as proved in the following e lemma:

Lemma 2. ρ2(X; Y )=(1 q)ρ2(X; Y ) − Proof. If either X or Y is degenerate,e the proof is trivial. Thus, we only consider the case where neither X nor Y is degenerate. Define Θ = 1 Y =e . Note that Θ is { e } independent of X. For any f and g such that Ef(X) = 0, Ef 2(X) = 1, Eg(Y )=0

e CHAPTER 2. HIRSCHFELD-GEBELEIN-RENYI MAXIMAL CORRELATION 9

and Eg2(Y ) = 1, we have

Ege (X)f(Y ) = E E[g(X)f(Y ) Θ] Θ e | = qE[g(X)f(e) Θ=1]+(1 q)E[g(X)f(Y ) Θ = 0] |e − | (a) = qf(e)Eg(X)+(1 q)E[g(X)f(Y )] − (b) = (1 q)E[g(X)f(Y )] − = (1 q)E[g(X)]E[f(Y )]+(1 q)E (g(X) E[g(X)])(f(Y ) E[f(Y )]) − − − − (c) n o = (1 q)E[(g(X) E[g(X)])(f(Y ) E[f(Y )])] − − − (d) (1 q) Var(g(X)) Var(f(Y ))ρ(X; Y ) ≤ − (e) 1 (1 q)p1 ρ(X; Y ) ≤ − 1 q r − = ρ(X; Y ) 1 q, − p where (a) is due to the independence between Θ and X; (b) and (c) is due to the fact Ef(X) = 0; (d) comes from the definition of HGR correlation between X and Y ; (e) is because

1 = Eg2(Y ) = E[g2(Y ) Θ] e | = qEg2(e)+(1 q)Eg2(Y ) e − (1 q)Eg2(Y ) ≥ − (1 q) Var(g(Y )); ≥ −

1 √1 q g∗(y), if y , Equality can be achieved by setting f = f ∗ and g(y)= − ∈Y ( 0, y = e. where f and g are the functions achieving the HGR correlation betweene eX and Y . ∗ ∗ e Thus ρ2(X; Y )=(1 q)ρ2(X; Y ). e − There aree some other interesting non-trivial examples in the literature: CHAPTER 2. HIRSCHFELD-GEBELEIN-RENYI MAXIMAL CORRELATION10

Gaussian: [16] If (X,Y ) joint Gaussian distribution with correlation r, then • ∼

ρ(X; Y )= r . | |

Partial sums: [7] Let Yi, i = 1, ..., n be i.i.d. r.v.’s with finite . Let • k Sk = i=1 Yi, k =1, .., n be the partial sums. Then we have P ρ(Sk,Sn)= k/n. p Chapter 3

Common randomness generation

3.1 Common randomness and efficiency

If the HGR correlation between X and Y is close to 1, intuitively there is a lot in common between the two. The obstacle of generating common randomness in Fig. 1.1 is the lack of communication between the two nodes. It turns out that a communication link can be of great help facilitating common randomness generation. This setting was first considered by Ahlswede and Csisz´ar [2]. The system has two nodes. Node 1 observes Xn and Node 2 observes Y n, where (Xn,Y n) i.i.d. p(x, y). ∼ A (n, 2nR) scheme, shown in Fig. 3.1, consists of

a message encoding function f : n [1 : 2nR], M = f (Xn), • m X 7→ m a CR encoding function at Node 1 f : K = f (Xn), • 1 1 n a CR encoding function at Node 2 f : K′ = f (M,Y ). • 2 2 Definition 2. A common randomness-rate pair (C, R) is said to be achievable if there exists a sequence of schemes at rate R such that

P(K = K′) 1 as n , • → →∞ 1 lim infn H(K) C, • →∞ n →

11 CHAPTER 3. COMMON RANDOMNESS GENERATION 12

1 1 lim supn n H(K K′) 0 . • →∞ | → In words, the entropy of K measures the amount of common randomness those two nodes can generate.

Xn Y n

M [1 : 2nR] Node 1 ∈ Node 2

K K′

Figure 3.1: Common Randomness Capacity: (Xi,Yi) are i.i.d.. Node 1 generates a r.v. K based on the Xn sequence it observes. It also generates a message M and transmits the message to Node 2 under rate constraint R. Node 2 generates a r.v. K′ n based on the Y sequence it observes and M. We require that P(K = K′) approaches 1 as n goes to infinity. The entropy of K measures the amount of common randomness those two nodes can generate. What is the maximum entropy of K?

Definition 3. The supremum of all the common randomness achievable at rate R is defined as the common randomness capacity at rate R. That is

C(R) = sup C :(C, R) is achievable { }

Theorem 1. [2] The common randomness capacity at rate R is

C(R) = max I(X; U) (3.1) p(u x):R I(X;U) I(Y ;U) | ≥ − If private randomness generation is allowed at Node 1, then

maxp(u x):I(U:X) I(U;Y ) R I(U; X), R H(X Y ); C(R)= | − ≤ ≤ | (3.2) ( R + I(X; Y ), R>H(X Y ). | 1This is a technical condition to constrain the cardinality of K. Mathematically, the condition guarantees that the converse proof works out CHAPTER 3. COMMON RANDOMNESS GENERATION 13

The C(R) curve is thus a straight line with slope 1 for R>H(X Y ). Although we | focus on 0 R H(X Y ) in this thesis, in most of the figures we plot the straight ≤ ≤ | line part for completeness. We note here that computing C(R) is highly related with the information bottle neck method developed in [25]. The usage of common randomness in generating coordinated actions is discussed in detail in [6].

3.1.1 Common randomness and common information

Let us clarify the relation between common randomness and common information.

Definition 4. [10] The maximum common r.v. V between X and Y satisfies:

There exists functions g and f such that V = g(X)= f(Y ). •

For any V ′ such that V ′ = g′(X) = f ′(Y ) for some deterministic functions f ′ • and g′, V ′ is a function of V .

Definition 5. [10] The common information between X and Y is defined as H(V ) where V is the maximum common r.v. between X and Y .

It turns out that common information is equal to common randomness at rate 0, i.e., H(V )= C(0) [5].

Lemma 3. It is without loss of optimality to assume that V is a function of U when optimizing maxp(u x):R I(X;U) I(Y ;U) I(X; U). | ≥ − Proof. Because for any U such that I(X; U) I(Y ; U) R and U X Y hold, we − ≤ − − can construct a new auxiliary r.v. U ′ = [U, V ]. Note that

Markov chain U ′ X Y holds. • − − CHAPTER 3. COMMON RANDOMNESS GENERATION 14

The rate constraint: •

I(X; U ′) I(Y ; U ′) − = I(X; U, V ) I(Y ; U, V ) − = I(X; U V ) I(Y ; U V ) | − | = I(X; U, V ) I(X; V ) I(Y ; U, V )+ I(Y ; V ) − − = I(X; U) I(Y ; U) − R ≤

The common randomness generated: I(X; U ′) I(X; U). • ≥

Thus using U ′ as the new auxiliary r.v. preserves the rate and does not decrease the common randomness.

3.1.2 Continuity at R =0

The C(R) curve is concave for R 0 thus continuous for R > 0. The following ≥ theorem establishes the continuity at R = 0.

Theorem 2. The common randomness capacity as a function of the communication rate R is continuous at R =0, i.e., limR 0 C(R)= C(0). ↓ Proof. See Appendix.

The value of C(0) is equal to the common information defined in [10]. We note here that C(0) > 0 if and only if ρ(X; Y ) = 1.

3.1.3 Initial Efficiency (R 0) ↓ If the commonness between X and Y is large, then it is natural to expect that the first few bits of communication should be able to unlock a huge amount of common randomness. It is indeed the case as shown in the following theorem. Furthermore, the HGR correlation ρ plays the key role in the characterization of the initial efficiency. CHAPTER 3. COMMON RANDOMNESS GENERATION 15

Theorem 3. The initial efficiency of common randomness generation is characterized as

C(R) 1 lim = . R 0 R 1 ρ2(X; Y ) ↓ − In words, the initial efficiency is the initial number of bits of common randomness unlocked by the first few bits of communications. Comments:

Since ρ(X; Y )= ρ(Y ; X), the slope is symmetric in X and Y . Thus if we reverse • the direction of the communication link, i.e., the message is sent from Node 2 to Node 1 in Fig. 3.1, the initial efficiency remains the same.

The initial efficiency increases with the HGR correlation ρ between X and Y . • Without communication, as long as ρ< 1, the common randomness capacity is 0. But with communication, the first few bits can “unlock” a huge amount of common randomness if ρ(X; Y ) is close to 1.

Proof. If ρ(X; Y ) = 1, then C(0) > 0 which yields the + slope. For the case ∞ ρ(X; Y ) < 1, we have

C(R) (a) I(X; U) lim = sup (3.3) R 0 R p(u x) I(X; U) I(Y ; U) ↓ | − (b) 1 = I(Y ;U) 1 supp(u x) I(X;U) − | (c) 1 = 1 ρ2(X; Y ) − 1 where (a) comes from the fact C(R) is a concave function; (b) is because function 1 x − is monotonically increasing for x [0, 1); (c) comes from the following lemma [9]. ∈ I(Y ;U) 2 Lemma 4. [9] supp(u x) I(X;U) = ρ (X,Y ) | CHAPTER 3. COMMON RANDOMNESS GENERATION 16

3.1.4 Efficiency at R H(X Y ) (saturation efficiency) ↑ | At R = H(X Y ), C(R) reaches its maximum value 2 H(X). That is the point where | Xn is losslessly known at Node 2. In other words, nature’s resource is exhausted by the system. It is of interest to check the slope of C(R) when R goes up to H(X Y ). | A natural guess is 1, since one pure random bit (which is independent of nature’s (Xn,Y n)) sent over the communication link can yield 1 bit in common between the two nodes. As shown in the erasure example in the next section, this guess is not correct in general. Here, we provide a sufficient condition for the saturation slope to be 1.

Theorem 4. The efficiency of common randomness generation at R = H(X Y ) is 1 | if there exist x , x such that for all y , if p(x ,y) > 0, then p(x ,y) > 03. 1 2 ∈ X ∈Y 1 2 Proof. We have

C(H(X Y )) C(R) lim | − (3.4) R H(X Y ) H(X Y ) R ↑ | | − (a) C(H(X Y )) I(X; U) = inf | − p(u x) H(X Y ) (I(X; U) I(U; Y )) | | − − (b) H(X) I(X; U) = inf − p(u x) H(X Y ) (I(X; U) I(U; Y )) | | − − H(X U) = inf | p(u x) H(X U) (H(Y U) H(Y X)) | | − | − | 1 = inf (H(Y U) H(Y X)) p(u x) | − | | 1 H(X U) − | (c) 1 = H(Y U) H(Y X) | − | 1 infp(u x) H(X U) − | | where (a) comes from the concavity of C(R); (b) is because C(H(X Y )) = H(X); | 1 And (c) is because of the monotonicity of function 1 x for x [0, 1). − ∈ 2If private randomness is allowed, C(R) is a straight line with slope 1 for R>H(X Y ) [2]. The result in this section thus give a sufficient condition for the slope at R = H(X Y ) to be| continuous. 3If two input letters are of the same conditional distribution p(y x), then| we view them as one letter. Also, the letters with zero probability are discarded. | CHAPTER 3. COMMON RANDOMNESS GENERATION 17

H(Y U) H(Y X) | − | The next step is to show infp(u x) H(X U) = 0 under the condition given | | in the theorem. First note that H(Y U) H(Y X) 0 because of U X Y . H(Y U) H(Y X) | − | ≥ − − | − | Thus infp(u x) H(X U) 0. Without loss of generality, we can assume that | | ≥ = 1, 2, ..., M , = 1, 2, ..., N and that P(x =1,y) > 0 implies P(x =2,y) > 0. X { } Y { } Choose a sequence of positive numbers ǫn converging to 0. Construct a sequence of U ’s with cardinality 1, ..., M such that n { } P (1) = P (1) ǫn P (2), Un X 1 ǫn X • − − P (2) = 1 P (2), Un 1 ǫn X • − P (u)= P (u), u =3, ..., M, • Un X which is illustrated in Fig. 3.2. Note that these are valid distributions because we preserve the marginal distribution of X.

Un X Y 1 ǫn PX (1) PX (2) 1 1 1 − 1 ǫn − ǫn 1 ǫ 1 P (2) − n 2 2 1 ǫn X 2 − 1 3 PX (3) 3 3 ...... 1 PX (M) M M N

Figure 3.2: The probability structure of Un.

As n goes to infinity, it can be shown that

The denominator H(X U ) behaves as ǫ log ǫ , i.e. H(X U ) = Θ(ǫ log ǫ ); • | n n n | n ∼ n n The numerator H(Y U) H(Y X) behaves linearly, i.e., H(Y U) H(Y X) = • | − | | − | ∼ Θ(ǫn).

H(Y Un) H(Y X) Thus lim | − | = 0, which completes the proof. n H(X Un) →∞ | For convenience, we introduce saturation efficiency in the following way:

Definition 6. The slope of C(R) when R approaches Rm from below is defined as the saturation efficiency, where Rm is the threshold such that C(Rm)= H(X). CHAPTER 3. COMMON RANDOMNESS GENERATION 18

3.2 Examples

3.2.1 DBSC(p) example

Let X be a Bernoulli (1/2) random variable and let Y be the output of a BSC channel with cross probability p< 1/2, and with X as the input, shown in Fig. 3.3.

X Y 1 p 0 − 0 p

p 1 1 1 p −

Figure 3.3: DBSC example: X Bern(1/2), pY X (x x)=(1 p), pY X (1 x x)= p. ∼ | | − | − |

H(X U) = H (α) for some α [0, 1/2]. Mrs Gerber’s Lemma [29]provides the | 2 ∈ following lower bound on H(Y U): |

1 H(Y U) H (H− (H(X U)) p), (3.5) | ≥ 2 2 | ∗ = H (α p) 2 ∗ where (α p)= α(1 p)+(1 α)p. Thus ∗ − −

I(X; U) = H(X) H(X U)=1 H (α), − | − 2 I(X; U) I(Y ; U) = H(X) H(X U) H(Y )+ H(Y U) − − | − | = H(Y U) H(X U) | − | H (α p) H (α). ≥ 2 ∗ − 2

x, w.p. 1 α; Equality can be achieved by setting p(u x)= − , | ( 1 x, w.p. α − as shown in Fig. 3.2.1. CHAPTER 3. COMMON RANDOMNESS GENERATION 19

U X 1 α − 0 0 α

α

1 1 1 α −

We can write C(R) in parametric form:

C = 1 H (α) (3.6) − 2 R = H (α p) H (α), (3.7) 2 ∗ − 2

for α [0, 1/2]. Fig. 3.4 shows C(R) for p =0.08. ∈

1.5

1 C

0.5

0 0 0.2 0.4 0.6 0.8 1 R

Figure 3.4: C(R) for p =0.08.

The initial efficiency:

C(R) lim (3.8) R 0 ↓ R 1 H (α) = lim − 2 α 1/2 H (α p) H (α) ↑ 2 ∗ − 2 CHAPTER 3. COMMON RANDOMNESS GENERATION 20

1 α log − = lim − 2 α 1 (1 2p)α p 1 α α 1/2 − − − ↑ (1 2p) log2 (1 2p)α+p log2 −α − − − 1 1 1 α + α = lim − (1 2p) 1 2p 1 1 α 1/2 − − ↑ (1 2p) 1 (1 2p)α p (1 2−p)α+p +( 1 α + α ) − − − − − − − 1 =  1 (1 2p)2 − − Note that the HGR correlation between X and Y is (1 2p)2. − The saturation efficiency, C′(R−) as R approaches H(X Y ): | C(H(X Y )) C(R) lim | − R H(X Y ) H(X Y ) R ↑ | | − 1 (1 H (α)) = lim − − 2 α 0 H (p) (H (α p) H (α)) ↓ 2 − 2 ∗ − 2 H (α)) = lim 2 α 0 H (p) H (α p)+ H (α) ↓ 2 2 2 − ∗ 1 α log − = lim 2 α 1 (1 2p)α p 1 α α 0 − − − ↓ (1 2p) log2 (1 2p)α+p + log2 −α − − − log α = lim 2 α 0 log α ↓ 2 = 1

3.2.2 Gaussian example

Although we mainly consider discrete random variables with finite alphabet, the results can be extended to continuous random variables as well. In this section, we consider a Gaussian example. Let Y = X + Z, where X (0, 1), Z (0, N), ∼ N ∼ N and X and Z are independent, illustrated in Fig. 3.5. Let h(X U)= 1 log (2πeα) for some 0 <α 1. The entropy power inequality [19] | 2 2 ≤ CHAPTER 3. COMMON RANDOMNESS GENERATION 21

Z (0, N) ∼N

X (0, 1) Y ∼N L Figure 3.5: Gaussian Example

gives the following lower bound on h(Y U): |

1 2h(X U) 2h(Z U) h(Y U) log 2 | +2 | (3.9) | ≥ 2 2 1 2 1 log (2πeα) 2 1 log (2πeN) = log 2 2 2 +2 2 2 2 2 1   = log (2πe(α + N)) 2 2

Equality can be achieved by X = U +U ′ where U U ′, U (0, 1 α), U ′ (0,α), ⊥ ∼N − ∼N shown in Fig. 3.6

V (0,α) ∼N

U (0, 1 α) X ∼N − L Figure 3.6: Auxiliary r.v. U in Gaussian example.

We write C(R) in a parametric form:

1 C = log α (3.10) −2 2 1 α + N R = log (3.11) 2 (1 + N)α for α (0, 1]. Fig. 3.7 shows the case N =0.5. ∈ The initial efficiency is calculated in the following way:

C(R) 1 log α lim = lim − 2 2 (3.12) R 0 α 1 1 α+N ↓ R ↑ 2 log (1+N)α CHAPTER 3. COMMON RANDOMNESS GENERATION 22

2

1.8

1.6

1.4

1.2 C 1

0.8

0.6

0.4

0.2

0.2 0.4 0.6 0.8 1 1.2 1.4 R

Figure 3.7: Gaussian example: C(R) for N =0.5

1 = lim − 2α α 1 1 1 ↑ 2(α+N) − 2α 1 = 1+ N

Note the ordinary correlation between X and Y is 1/√1+ N. For a pair of joint Gaussian random variables the HGR correlation is equal to the ordinary correlation [16]. One can use Theorem 3 to obtain the same expression. Asymptotic saturation efficiency

dC(R) dC(α) lim = lim dα R α 0 dR ↑∞ dR ↓ dα 1 log α = lim − 2 2 α 0 1 α+N ↓ 2 log (1+N)α 1 = lim − 2α α 0 1 1 ↓ 2(α+N) − 2α = 1

For continuous r.v.’s, nature’s randomness is not exhausted at any finite R. It is always more efficient to generate common randomness from nature’s resources than from communicating private randomness generated locally. CHAPTER 3. COMMON RANDOMNESS GENERATION 23

3.2.3 Erasure example

X, w.p. 1 q Let Y be an randomly erased version of X, i.e., Y = − , shown in ( e, w.p. q. Fig. 3.8.

X Y

e

Figure 3.8: Erasure example

For any U such that U X Y holds, I(Y ; U) = (1 q)I(X; U), I(X; U) − − − − I(X; Y ) = qI(X; U). Thus C(R) = R for 0 R H(X Y ), where H(X Y ) = q ≤ ≤ | | q log , shown in Fig. 3.9. 2 |X| C

H(X)

R pH(X)

Figure 3.9: Erasure example: C R curve −

1 The initial efficiency is therefore q . Since ρ(X; X) = 1 and Y is an erased version 1 1 of X, we have ρ(X; Y )= √1 q. Note that q = 1 ρ2(X;Y ) . − − C(R) 1 The saturation efficiency is limR H(X Y ) = , which is not equal to 1. ↑ | R p CHAPTER 3. COMMON RANDOMNESS GENERATION 24

3.3 Extensions

3.3.1 CR per unit cost

The communication link between Node 1 and Node 2 in Fig. 3.1 is a bit pipe, which is essentially a noisyless channel. It turns out that the common randomness capacity remains unchanged when we replace the bit pipe with a noisy channel with the same capacity [2]. More interestingly, one may consider the case where the channel inputs are subject to some cost constraints β. The initial efficiency of channel capacity as C a function of β is solved in the seminal paper [26]. The initial efficiency of the overall system, illustrated in Fig. 3.10, is thus the product of the initial efficiency of common randomness generation and the capacity per unit cost of the channel.

Xn Y n

Node 1 (β) Node 2 C

K K′

Figure 3.10: Common randomness per unit cost.

Corollary 1. The initial efficiency of Fig. 3.10 (common randomness per unit cost) is equal to C(β) 1 (β) lim = lim C β 0 β 1 ρ2(X; Y ) · β 0 β ↓ − ↓ (β) We refer to [26] the calculation of limβ 0 C . ↓ β

3.3.2 Secret key generation

Common randomness generation is closely related to secret key generation [1]. Sup- pose there is an eavesdropper listening to the communication link (Fig. 3.11). We CHAPTER 3. COMMON RANDOMNESS GENERATION 25

would like the common randomness generated by Node 1 and Node 2 to be kept away from the eavesdropper. One commonly used secrecy constraint is that

1 lim sup I(M; K)=0, n n →∞ where M [1 : 2nR] is the message Node 1 sends to Node 2. ∈ Xn Y n

M [1 : 2nR] Node 1 ∈ Node 2 K′

K 1 Eavesdropper I(M; K) ǫ n ≤

Figure 3.11: Secret Key Generation

The secret key capacity is shown to be [1]: C(R) = maxR I(X;U) I(Y ;U) I(Y ; U). ≥ − We can calculate the initial efficiency of the secret key capacity in the following way:

C(R) I(Y ; U) lim = sup R 0 R I(X; U) I(Y ; U) ↓ − I(X; U) = sup 1 I(X; U) I(Y ; U) − − 1 = 1, 1 ρ2(X; Y ) − − which is the initial efficiency without the secrecy constraint minus one. It makes sense, because the eavesdropper observers every bit Node 1 communicates to Node 2. CHAPTER 3. COMMON RANDOMNESS GENERATION 26

3.3.3 Non-degenerate V

If the maximum common r.v. V is not a constant, the slope of C(R) as R 0 (It C(R) ↓ differs from limR 0 ) can be calculated in the following way: ↓ R C(R) C(0) I(Y ; U) H(V ) lim − = sup − R 0 R I(X; U) I(Y ; U) ↓ − (a) I(Y ; U, V ) H(V ) = sup − I(X; U, V ) I(Y ; U, V ) − I(Y ; U V ) = sup | I(X; U V ) I(Y ; U V ) | − | 1 = I(X;U V ) | 1 sup I(Y ;U V ) − | 1 = 1 max ρ2(X; Y V = v) − v | where (a) is due to Lemma 3.

3.3.4 Broadcast setting

The common randomness generation setup can be generalized to multiple nodes. A broadcast setting was considered in [2], shown in Fig. 3.12. The goal is for all three nodes to generate a random variable K in common. The common randomness

n Y1 Xn

Node 2 K R Node 1

Node 3 K

K n Y2

Figure 3.12: CR broadcast setup CHAPTER 3. COMMON RANDOMNESS GENERATION 27

capacity is proved [2] to be

C(R) = max I(X; A, U) p(u x):R I(X;U) I(Yi;U) R,i=1,2 | ≥ − ≤ We provide a conjecture that deals with the initial efficiency in the broadcast setting:

Conjecture 1. The initial efficiency of the setup in Fig. 3.12 is:

C(R) 1 lim = , R 0 R 1 Ψ2(X; Y,Z) ↓ − where Ψ(X; Y,Z) is a modified HGR correlation between X and Y,Z, defined in the following way:

Ψ(X; Y,Z) = max min Eg(X)f(Y ),Eg(X)h(Z) { } where the maximization is among all functions g f and h such that Eg(X)=0, Ef(Y )= 0, Eh(Z)=0,Eg2(X) 1, Ef 2(Y ) 1, Eh2(Z) 1. ≤ ≤ ≤ Proof. Achievability: Similar to the HGR correlation, there is an alternative characterization of Ψ(X; Y,Z):

Ψ2(X; Y,Z) = max min E(E[g(X) Y ])2, E(E[g(X) Z])2 { | | } where the maximization is among function g such that Eg(X)=0,Eg2(X) 1. ≤ Applying the maximizer g∗( ) in the achievability scheme in [9], one can show that · 1 the initial efficiency 1 Ψ2(X;Y,Z) is achievable. − Chapter 4

Common randomness generation with actions

4.1 Common randomness with action

Recently, in the line of work by Weissman, et al. [27], action was introduced as a feature that one node can explore to boost the performance of lossy compression. We adopt their setting but consider common randomness generation. The setup is shown in Fig. 4.1. Comparing with the no action case, the key difference is that after receiving the message M, Node 2 first generates an action sequence An based on M, i.e., An = f (M). It then gets the side information Y n according to p(y x, a), i.e., a | Y n (An,Xn) n p(y a , x ). One scenario where this setting applies is that Node | ∼ i=1 i| i i 2 requests sideQ information from some center through actions. The ith action determines the type of the side information correlated with Xi that the data center n sends back to Node 2. Node 2 then generates K′ based both on the Y sequence n nR n it observes and the message M(X ) [1 : 2 ], K′ = f (Y , M). The common ∈ 2 randomness capacity at rate R is defined in the same way as in the no action case.

28 CHAPTER 4. COMMON RANDOMNESS GENERATION WITH ACTIONS 29

Xn An Y n

M [1 : 2nR] Node 1 ∈ Node 2

K K′

Figure 4.1: Common Randomness Capacity: Xi i=1,... is an i.i.d. source. Node 1 generates a r.v. K based on the Xn sequence{ } it observes. It also generates a message M and transmits the message to Node 2 under rate constraint R. Node 2 first generates an action sequence An as a function of M and receives a sequence of side information Y n, where Y n (An,Xn) p(y a, x). Then Node 2 generates a r.v. n | ∼ | K′ based on both M and Y sequence it observes and M. We require P(K = K′) to be close to 1. The entropy of K measures the amount of common randomness those two nodes can generate. What is the maximum entropy of K?

Theorem 5. The common randomness action capacity at rate R is

C(R) = max I(X; A, U)

where the joint distribution is of the form p(a, u x)p(x)p(y a, x), and the maximization | | is among all p(a, u x) such that |

I(X; A)+ I(X; U A) I(Y ; U A) R. | − | ≤

Cardinality of U can be bounded by +1. |U| ≤ |X||A| Setting A = , we recover the no action result. ∅

Achievability proof

Codebook generation

Generate 2n(I(X;A)+ǫ) An(l ) sequences according to n p (a ), l 2n(I(X;A)+ǫ). • 1 i=1 A i 1 ∈ Q CHAPTER 4. COMMON RANDOMNESS GENERATION WITH ACTIONS 30

n n(I(X;U A)+ǫ) n For each A (l ) sequence, generate 2 | U (l ,l ) sequences according • 1 1 2 n n(I(X;U A)+ǫ) to pU A(ui ai), l2 [1 : 2 | ]. i=1 | | ∈ Q n n(I(X;U A,Y )+2ǫ) For each A (l ) sequence, partition the set of l indices into 2 | • 1 2 equal sized bins, (l ). B 3

Encoding

For simplicity, we will assume that the encoder is allowed to randomize, but the can be readily absorbed into the codebook generation stage, and hence, does not use up the encoder’s private randomization.

Given xn, the encoder selects the index L [1 : 2n(I(X;A)+ǫ)] of the an(L ) • A ∈ A n n (n) sequence such that (x , a (L )) ǫ . If there is none, it selects an index A ∈ T uniformly at random from [1 : 2n(I(X;A)+ǫ)]. If there is more than one such index, it selects an index uniformly at random from the set if indices such that n n (n) (x , a (l)) ǫ . ∈ T Given xn and the selected an(L ), the encoder then selects an index L [1 : • A U ∈ n(I(X;U A)+ǫ) n n n (n) 2 | ] such that (x , a (L ),u (L , L )) ǫ . A A U ∈ T n(I(X;U A,Y )+2ǫ) The encoder sends out L and L [1 : 2 | ] such that L (L ). • A B ∈ U ∈ B B

Decoding

n The decoder first takes actions based on the transmitted A (LA) sequence. Therefore, Y n is generated based on Y n n p(y x , a (L )). Given an and side information ∼ i=1 i| i i A n ˆ y , the decoder then tries to decodeQ the LU index. That is, it looks for the unique LU n n n (n) index in bin (L ) such that (y , a (L ),u (L , Lˆ )) ǫ . Finally, the decoder B B A A U ∈ T declares LA, LˆU as the common indices.

Analysis of probability of error

The analysis of probability of error follows standard analysis. An error occurs if any of the following two events occur. CHAPTER 4. COMMON RANDOMNESS GENERATION WITH ACTIONS 31

n n n n (n) 1. (a (L ),u (L , L ),X ,Y ) / ǫ . A A U ∈ T n n n 2. There exists more than one LˆU (LB) such that (Y , a (LA),u (LA, LˆU )) (n) ∈ B ∈ ǫ . T The probability of the first error goes to zero as n goes to infinity since we generated enough sequences to cover Xn in the codebook generation stage. The fact that the probability of error for the second error event goes to zero as n follows from → ∞ standard Wyner-Ziv analysis.

Analysis of common randomness rate

We analyze the common randomness rate averaged over codebooks.

H(L , L )= H(L , L ,Xn ) H(Xn , L , L ) A U |C A U |C − |C A U H(Xn ) H(Xn , L , L , An(L ), U n(L , L )) ≥ |C − |C A U A A U nH(X) H(Xn U n, An). (4.1) ≥ − |

The second step follows from the fact that Xn is independent of the codebook and the third step follows conditioning reduces entropy. We now proceed to upper bound n n n n n n (n) H(X U , A ). Define E := 1 if (X , U , A ) / ǫ and 0 otherwise. | ∈ T

H(Xn U n, An) H(Xn, E U n, An) | ≤ | = H(E)+ H(Xn E, U n, An) | 1+P(E = 0)H(Xn E =0, U n, An)+P(E = 1)H(Xn E =1, U n, An) ≤ | | (a) 1+ n(H(X U, A)+ δ(ǫ)) + nP(E = 1) log ≤ | |X| = n(H(X U, A)+ δ′(ǫ)). (4.2) |

n n n (n) (a) follows from the fact that when E =0, (U , A ,X ) ǫ . Hence, there are at ∈ T n(H(X U,A)+δ(ǫ)) n most 2 | possible X sequences. The last step follows from P(E = 1) 0 → as n , which in turn follows from the encoding scheme. Combining (4.1) with → ∞ CHAPTER 4. COMMON RANDOMNESS GENERATION WITH ACTIONS 32

(4.2) then gives the desired lower bound on the achievable common randomness rate.

1 H(L , L ) H(X) H(X U, A) δ′(ǫ) n A U |C ≥ − | −

= I(X; U, A) δ′(ǫ). −

Converse: See Appendix C.1.

4.2 Example

By correlating the action sequence An with Xn and communicating the action se- quence with Node 2, we incur a communication rate cost I(X; A). That only gen- erates I(X; A) in the rate of CR generation. Using 1 bit of communication to get 1 bit common randomness is of course sub-optimal, but the benefit comes in the second stage where conditioned on the An sequence, U n is sent to Node 2. The com- munication rate required is I(X; U A) I(Y ; U A) and the rate of CR generated is | − | I(X; U A). | One greedy scheme is to simply fix the action and just repeat it (so there is no need to communicate An). We use the following example to show explicitly that in general this kind of scheme is suboptimal. Let X be a r.v. uniformly distributed over the set 1, 2, 3, 4 . There are two { } actions A = 1 and A = 2. The probability structure conditioned on each sequence is shown in 4.2.

Lemma 5. For the setup in Fig. 4.2,

setting A X: the optimal achievable (C, R) pair is given as • ⊥ 3 R C(R)= + , R [0,p/2] 2 p ∈ CHAPTER 4. COMMON RANDOMNESS GENERATION WITH ACTIONS 33

A =1 A =2

X Y X Y

1 1 1 1 1 p −

2 2 2 2 p e e p

3 3 3 3

1 p 4 − 4 4 4

Figure 4.2: CR with Action example

correlating A with X as shown in Fig. 4.3, the following (C, R) pair is achiev- • able:

C(α) = 2 α − R(α) = 1 H (α), α [0, 1/2]. − 2 ∈

Proof. See Appendix

X A 1 α 1, 2 − 0 α

α 3, 4 1 1 α − Figure 4.3: Correlate A with X

It can be shown that the (C, R) pair achieved by setting A X cannot be the ⊥ CHAPTER 4. COMMON RANDOMNESS GENERATION WITH ACTIONS 34

optimal one for all R in general. We illustrate this by a numerical example p = 0.6 with the results plotted in Fig. 4.4.

1.9 Option one Option two 1.85

1.8

1.75

1.7 C

1.65

1.6

1.55

1.5 0 0.05 0.1 0.15 0.2 R

Figure 4.4: CR with action example: option one: set A X; option two: correlate A with X. ⊥

4.3 Efficiency

4.3.1 Initial Efficiency

For simplicity, we assume that ρ(PX PY X,A=a) < 1, a . ⊗ | ∀ ∈A C(R) lim (4.3) R 0 R ↓ I(X; A, U) = sup p(a,u x) I(X; A)+ I(X; U A) I(Y ; U A) | | − | 1 = I(Y ;U A) | 1 supp(a,u x) I(X;A)+I(X;U A) − | | 1 = 2 1 maxa ρ (X,Y A = a) − ∈A |

where ρ(X,Y A = a) = ρ(PX PY X,A=a) and the last step is proved in Ap- | ⊗ | pendix. C.2. CHAPTER 4. COMMON RANDOMNESS GENERATION WITH ACTIONS 35

4.3.2 Saturation efficiency

Similar to the no action case, when the communication rate reaches the threshold such that Xn can be losslessly reconstructed at Node 2, nature’s randomness Xn is ex- hausted by the system. Thus the maximum CR H(X) (without private randomness) is achieved. This threshold Rm can be computed as [27]:

Rm = min I(X; A)+ H(X A, Y ). p(a x) | | The following theorem consider the slope of CR generation when R R . ↑ m Theorem 6. If there exists a p(a x) such that | I(X; A)+ H(X A, Y )= R • | m For each action a, P(A = a) > 0, there exist x , x such that P(X = x A = • 1 2 ∈ X 1| a)) > 0, P(X = x A = a) > 0, if p(y, x A = a) > 0, then p(y, x A = a) > 0, 2| 1| 2| y . ∀ ∈Y then dC(R) lim =1 R Rm dR ↑ Essentially, we require the condition in the no action setting to hold for each active action when R R . ↑ m

4.4 Extensions

Theorem 5 extends to the case where there is a cost function Λ and a cost constraint Γ on the action sequence, i.e., Λ(An)= 1 n Λ(A ) Γ. n i=1 i ≤ Corollary 2. The common randomnessP capacity with rate constraint R and cost constraint Γ is

C(R, Γ) = max I(X; A, U) p(a, u x):Λ(A) Γ | ≤ R I(X; A)+ I(X; U A, Y ) ≥ | CHAPTER 4. COMMON RANDOMNESS GENERATION WITH ACTIONS 36

Proof. Simply note that the achievablity proof and converse carry over to this setting directly.

Theorem 5 also extends naturally to the case where there are multiple receivers with different side information (Fig. 4.4).

n n A Y1 Xn

Node 2 K2 R Node 1

Node 3 K3

K1 n n A Y2

Corollary 3. The common randomness capacity with rate constraint R with two receivers and side information structure (Y n,Y n) Xn, An i.i.d. p(y ,y x, a) is 1 2 | ∼ 1 2| given by

C(R) = max I(X; A, U) p(a, u x):Λ(A) Γ | ≤ R I(X; A)+ I(X; U A, Y ), i =1, 2 ≥ | i Because the action sequence of each node is a function of the same message both receives, one node knows the action sequence of the other node. Therefore we do

not lose optimality by setting Ai = (A1i, A2i), where A1i and A2i are the individual actions.

Proof. We may simply repeat the achievablity proof for each receiver, and recognize

that the auxiliary random variable UQ in the converse proof C.1 works for both receivers. Chapter 5

Compression with actions

5.1 Introduction

Consider an independent, identically distributed (i.i.d) binary sequence Sn, S ∼ Bern(1/2). From standard source coding theory [19], we need at least one bit per source symbol to describe the sequence for lossless compression. But suppose now that we are allowed to make some modifications, subject to cost constraints, to the sequence before compressing it, and we are only interested in describing the modified sequence losslessly. The problem then becomes one of choosing the modifications so that the rate required to describe the modified sequence is reduced, while staying within our cost constraints. More concretely, for the binary sequence Sn, if we are allowed to flip more than n/2 ones to zero, then the rate required to describe the modified sequence is essentially zero. But what happens when we are allowed to flip fewer than n/2 ones? As a potentially more practical example, imagine we have a number of robots working on a factory floor and the positions of all the robots need to be reported to a remote location. Letting S represent the positions of the robots, we would expect to send H(S) bits to the remote location. However, this ignores the fact that the robots can also take actions to change their positions. A local command center can first “take a picture” of the position sequence and then send out action commands to the robots based on the picture so that they move in cooperative way such that

37 CHAPTER 5. COMPRESSION WITH ACTIONS 38

the final position sequence requires fewer bits to describe. The command center may face two issues in general: cost constraints and uncertainty. A cost constraint occurs because each robot should save its power and not move too far away from its current location. The uncertainty is a result of the robots not moving exactly as instructed by the local command center. Motivated by the preceding examples, we consider the problem illustrated in Fig. 5.1 (Formal definitions will be given in the next section). Sn is our observed state sequence. We model the constraint as a general cost function Λ( , , ) and the · · · uncertainty in the final output Y by a channel p(y a, s). | n Sn Z n n nR A Y 2 n PY X,S Yˆ | Action Encoder Compressor Decoder Figure 5.1: Compression with actions. The Action encoder first observes the state n n sequence S and then generates an action sequence A . The ith output Yi is the output of a channel p(y a, s) when a = A and s = S . The compressor generates a | i i description M of 2nR bits to describe Y n. The remote decoder generates Yˆ n based on M and it’s available side information Zn as a reconstruction of Y n.

Our problem setup is closely related to the channel coding problem when the state information is available at the encoder. The case where the state information is causally available was first solved by Shannon in [20]. When the state information is non-causally known at the encoder, the channel capacity result was derived in [11] and [13]. Various interesting extensions can be found in [15, 17, 22–24]. The difference in our approach described here is that we make the output of the channel as compressible as possible. We give formal definitions for our problem are given in the next section. Our main results when the decoder requires lossless reconstruction are given in section 5.3, where we characterize the rate-cost tradeoff function for the setting in Fig. 5.1. We also characterize the rate-cost function when Sn is only causally known at the action encoder. In section 5.4, we extend the setting to the lossy case where the decoder requires a lossy version of Y n. CHAPTER 5. COMPRESSION WITH ACTIONS 39

5.2 Definitions

We give formal definitions for the setups under consideration in this section. We will follow the notation of [8]. Sources (Sn,Zn) are assumed to be i.i.d.; i.e. (Sn,Zn) n ∼ i=1 pS,Z(si, zi). Q 5.2.1 Lossless case

Referring to Figure 5.1, a (n, 2nR) code for this setup consists of

an action encoding function f : n n; • e S →A a compression function f : n M [1 : 2nR]; • c Y → ∈ a decoding function f : [1 : 2nR] n Yˆ n. • d × Z → n n n , 1 n The average cost of the system is E Λ(A ,S ,Y ) n i=1 EΛ(Ai,Si,Yi). A rate- cost tuple (R, B) is said to be achievable if there existsP a sequence of codes such that

n n n lim sup Pr(Y = fd(fc(Y ),Z ))=0, (5.1) n 6 →∞ lim sup EΛ(An,Sn,Y n) B, (5.2) n ≤ →∞

n n n n where Λ(A ,S ,Y ) = i=1 Λ(Ai,Si,Yi)/n. Given cost B, the rate-cost function, R(B), is then the infimumP of rates R such that (R, B) is achievable.

5.2.2 Lossy case

We also consider the setup where the decoder requires a lossy version of Y n. The definitions remain largely the same, with the exception that the probability of error constraint, inequality (5.1), is replaced by the following distortion constraint:

n n n 1 lim sup E d(Y , Yˆ ) = lim sup E d(Yi, Yˆi) D. (5.3) n n n ≤ i →∞ →∞ X CHAPTER 5. COMPRESSION WITH ACTIONS 40

A rate R is said to be achievable if there exists a sequence of (n, 2nR) codes satisfying both the cost constraint (inequality 5.2) and the distortion constraint (inequality 5.3). Given cost B and distortion D, the rate-cost-distortion function, R(B,D), is then the infimum of rates R such that the tuple (R,B,D) is achievable.

5.2.3 Causal observations of state sequence

In both the lossless and lossy case, we will also consider the setup when the state sequence is only causally known at the action encoder. The definitions remain the same, except for the action encoding function which is now restricted to the following form: For each i [1 : n], f : i . ∈ e,i S →A

5.3 Lossless case

In this section, we present our main results for the lossless case. Theorem 7 gives the rate-cost function when the state sequence is noncausally available at the action encoder, while Theorem 8 gives the rate-cost function when the state sequence is causally available.

5.3.1 Lossless, noncausal compression with action

Theorem 7 (Rate-cost function for lossless, noncausal case). The rate-cost function for the compression with action setup when state sequence Sn is noncausally available at the action encoder is given by

R(B) = min I(V ; S Z)+ H(Y V,Z), (5.4) p(v s),a=f(s,v):EΛ(S,A,Y ) B | | | ≤ where the joint distribution is of the form p(s,v,a,y)= p(s)p(v s)1 f(s,v)=a p(y a, s). | { } | The cardinality of the auxiliary random variable V is upper bounded by +2. |V| ≤ |S| Remarks

Replacing a = f(s, v) by a general distribution p(a s, v) does not decrease the • | minimum in (5.4). For any joint distribution p(s)p(s v)p(a s, v), we can always | | CHAPTER 5. COMPRESSION WITH ACTIONS 41

find a random variable W and a function f such that W is independent of S,V

and Y , and A = f(V,W,X). Consider V ′ = (V, W ). The Markov condition

V ′ (A, S) (Y,Z) still holds. Thus H(Y V ′,Z)+ I(V ′; S Z) is achievable. − − | | Furthermore,

I(V ′; S Z)+ H(Y V ′,Z) | | = I(V, W ; S Z)+ H(Y V,W,Z) | | I(V, W ; S Z)+ H(Y V,Z) ≤ | | = I(V ; S Z)+ H(Y V,Z). | |

R(B) is a convex function in B. • For each cost function Λ(s,a,y), we can replace it with a new cost function • involving only s and a by defining Λ′(s, a)= E[Λ(S,A,Y ) S = s, A = a]. Note | that Y is distributed as p(y s, a) given S = s, A = a. | Achievability of Theorem 7 involves an interesting observation in the decoding oper- ation, but before proving the theorem, we first state a corollary of Theorem 7, the case when side information is absent (Z = ). We will also sketch an alternative ∅ achievability proof for the corollary, which will serve as a contrast to the achievability scheme for Theorem 7.

Corollary 4 (Side information is absent). If Z = , then rate-cost function is given ∅ by

R(B) = min I(V ; S)+ H(Y V ) p(v s),a=f(s,v):EΛ(S,A,Y ) B | | ≤ for some p(s,v,a,y)= p(s)p(v s)1 f(s,v)=a p(y a, s). | { } |

Achievability for Corollary 1

Code book generation: Fix p(v s) and f(s, v) and ǫ> 0. | CHAPTER 5. COMPRESSION WITH ACTIONS 42

Generate 2n(I(S;V )+ǫ) vn(l) sequences independently, l [1 : 2n(I(V ;S)+ǫ)], each • ∈ n according to pV (vi) to cover S .

For each V n Qsequence, the Y n sequences that are jointly typical with V n are • (n(H(Y V )+ǫ) indexed by 2 | numbers.

Encoding and Decoding:

The action encoder looks for a V n in the code book that is jointly typical with • n S and generates Ai = f(Si,Vi), i =1, ..., n.

The compressor looks for a Vˆ n in the codebook that is jointly typical with the • channel output Y n and sends the index of that Vˆ n sequence to the decoder. The compressor then sends the index of Y n as described in the second part of code book generation.

The decoder simply uses both indices from the compressor to reconstruct Y n. • Using standard typicality arguments, we can show that the encoding succeeds with high probability and the probability of error can be made arbitrarily small. Remark: Note that the index of Vˆ n is not necessarily equal to V n. That is, the V n codeword chosen by the action encoder can be different from the Vˆ n codeword chosen by the compressor. But this is not an error event since we still recover the same Y n even if a different V n codeword was used. This scheme, however, does not extend to the case when side information is avail- able at the decoder. The term H(S Z,V ) in Theorem 7 requires us to bin the set of | Y n sequences according to the side information available at the decoder. If we were to extend the above achievability scheme, we would bin the set of Y n sequences to n(H(Y Z,V )+ǫ) n 2 | bins. The compressor would find a Vˆ sequence that is jointly typical with Y n, send the index to the decoder using a rate of I(V ; S Z)+ ǫ, and then send | the index of the bin which contains Y n. The decoder would then look for the unique Y n sequence in the bin that is jointly typical with Vˆ n and Zn. Unfortunately, while the Vˆ n codeword is jointly typical with Y n with high probability, it is not necessarily jointly typical with Zn, since Vˆ n may not be equal to V n (V n is jointly typical with CHAPTER 5. COMPRESSION WITH ACTIONS 43

Zn with high probability as V n is jointly typical with Sn with high probability and V S Z). One could try to overcome this problem by insisting that the compres- − − sor finds the same V n sequence as the action encoder, but this requirement imposes additional constraints on the achievable rate. Instead of requiring the compressor to find a jointly typical V n sequence, we use an alternative approach to prove Theorem 7. We simply bin the set of all Y n sequences to n(I(V ;S Z)+H(Y Z,V )+ǫ) 2 | | bins and send the bin index to the decoder. The decoder looks for the unique Y n sequence in bin M such that (V n(l),Y n,Zn) are jointly typical for some l [1 : 2n(I(V ;S)+ǫ)]. Note that there can more than one V n(l) sequence which is ∈ jointly typical with (Y n,Zn), but this is not an error event as long as the Y n sequence in bin M is unique. We now give the details of this achievability scheme.

Proof of achievability for Theorem 7

Codebook generation

Generate 2n(I(V ;S)+δ(ǫ)) V n codewords according to n p(v ) • i=1 i For the entire set of possible Y n sequences, bin themQ uniformly at random to • 2nR bins, where R>I(V ; S) I(V ; Z)+ H(Y Z,V ), (M). − | B

Encoding

Given sn, the encoder looks for a vn sequence in the codebook such that • n n (n) (v ,s ) ǫ . If there is more than one, it randomly picks one from the set of ∈ T typical sequences. If there is none, it picks a random index from [1 : 2nI(V ;S)+δ(ǫ)].

It then generates an according to a = f(v ,s ) for i [1 : n]. • i i i ∈ At the second encoder, it takes the output yn sequences and sends out the bin • index M such that yn (M). ∈ B CHAPTER 5. COMPRESSION WITH ACTIONS 44

Decoding

n n n n (n) The decoder looks for the unique yˆ sequence such that (v (l), yˆ , z ) ǫ • ∈ T for some l [1 : 2n(I(V ;S))] andy ˆn (M). If there is none or more than one, ∈ ∈ B it declares an error.

Analysis of probability of error

Define the following error events

:= (V n(L),Zn,Y n) / (n) , E0 { ∈ Tǫ } := (V n(l),Zn, Yˆ n) (n) El { ∈ Tǫ for some Yˆ n = Y n, Yˆ n (M) . 6 ∈ B }

By symmetry of the codebook generation, it suffices to consider M = 1. The probability of error is upper bounded by

2n(I(V ;S)+δ(ǫ)) P( ) P( )+ P( ). E ≤ E0 El Xl=1 P( ) 0 as n following standard analysis of probability of error. It remains E0 → → ∞ to analyze the second error term. Consider P( ) and define El

(V n,Zn) := (V n(l),Zn, Yˆ n) (n) for some Yˆ n = Y n, Yˆ n (1) El { ∈ Tǫ 6 ∈ B }

. We have

P( )=P( (V n,Zn)) El El = P(V n(l)= vn,Zn = zn) P( (vn, zn) vn, zn) El | n n (n) (v ,z ) ǫ X∈T = (P(V n(l)= vn,Zn = zn).

n n (n) (v ,z ) ǫ X∈T CHAPTER 5. COMPRESSION WITH ACTIONS 45

P(Y n = yn vn, zn) P( (vn, zn) vn, zn,yn) | El | yn ! X (a) (P(V n(l)= vn,Zn = zn). ≤ n n (n) (v ,z ) ǫ X∈T

n n n n n(H(Y Z,V )+δ(ǫ) R) P(Y = y v , z )2 | − | yn ! X (b) = (P(V n(l)= vn) P(Zn = zn).

n n (n) (v ,z ) ǫ X∈T n(H(Y Z,V )+δ(ǫ) R) 2 | − n(H(V,Z)+δ(ǫ)) n(H(V ) δ(ǫ)) n(H(Z) δ(ǫ)) 2 2− − 2−  − . ≤ n(H(Y Z,V )+δ(ǫ) R) 2 | − n(H(Y V,Z) I(V ;Z) R 4δ(ǫ)) =2 | − − − .

(a) follows since the set of Y n sequences are binned uniformly at random indepen- n n(H(Y Z,V )+δ(ǫ)) n dent of other Y sequences, and the fact that there are at most 2 | Y sequences which are jointly typical with a given typical (vn, zn). (b) follows from the fact that the codebook generation is independent of (Sn,Zn). Therefore, for any fixed l, V n(l) is independent of Zn. Hence, if R I(V ; S) I(V ; Z)+ H(Y Z,V )+6δ(ǫ), ≥ − |

2n(I(V ;S)+δ(ǫ)) nδ(ǫ) P( ) 2− 0, El ≤ → Xl=1 as n . →∞ We now turn to the proof of converse for Theorem 7

Proof of converse for Theorem 7

Given a (n, 2nR) code for which the probability of error goes to zero with n and n i n i 1 satisfies the cost constraint, define Vi =(Z \ ,Si+1,Y − ), we have CHAPTER 5. COMPRESSION WITH ACTIONS 46

nR H(M Zn) ≥ | = H(M,Y n Zn) H(Y n M,Zn) | − | (a) = H(M,Y n Zn) nǫ | − n = H(Y n Zn) nǫ | − n n i 1 n = H(Y Y − ,Z ) nǫ i| − n i=1 Xn i 1 n n n i 1 n = H(Y Y − ,S ,Z )+ I(Y ; S Y − ,Z ) nǫ i| i+1 i i+1| − n i=1 Xn n (b) i 1 n n i 1 n n = H(Y Y − ,S ,Z )+ I(Y − ; S S ,Z ) nǫ i| i+1 i| i+1 − n i=1 i=1 Xn Xn (c) i 1 n n i 1 n n i = H(Y Y − ,S ,Z )+ I(Y − ,S ,Z \ ; S Z ) nǫ i| i+1 i+1 i| i − n i=1 i=1 Xn n X (d) = H(Y V ,Zi)+ I(V ; S Zi) nǫ i| i i i| − n i=1 i=1 X X = nH(Y , V ,Q,Z )+ nI(V ; S Q, Z ) nǫ Q | Q Q Q Q| Q − n where (a) is due to Fano’s inequality. (b) follows from Csisz´ar sum identity. (c) holds because (Sn,Zn) is an i.i.d source. Note that the Markov conditions, V (S , A ) Y i − i i − i and V S Z hold. Finally, we introduce Q as the time sharing random variable, i − i − i i.e., Q Unif[1, ..., n], and set V = (V , Q), Y = Y and S = S , which completes ∼ Q Q Q the proof.

5.3.2 Lossless, causal compression with action

Our next result gives the rate-cost function for the case of lossless, causal compression with action. CHAPTER 5. COMPRESSION WITH ACTIONS 47

Theorem 8 (Rate-cost function for lossless, causal case). The rate for the compres- sion with action when the state information is causally available at the action encoder is given by R(B) = min H(Y V,Z) (5.5) p(v),a=f(s,v):EΛ(S,A,U) B | ≤ where the joint distribution is of the form p(s,v,a,y)= p(s)p(v)1 f(s,v)=a p(y a, s). { } | Achievability sketch (to write up): Here V simply serves as a time-sharing random variable. Fix a p(v) and f(s, v). We first generate a V n sequence and reveal it to the action encoder, the compressor and the decoder. The encoder generates Ai = n n(H(Y V,Z)+ǫ) f(Si,Vi). The compressor simply bins the set of Y sequences to 2 | bins and sends the index of the bin which contains Y n. The decoder recovers Y n by finding the unique Y n sequence in bin M such that (V n,Zn,Y n) are jointly typical. Remark: Just as in the non-causal case, the achievability is closely related to the channel coding strategy in [11], our achievability in this section uses the “Shannon Strategy” in [20]. In both cases, the optimal channel coding strategy yield the most compressible output when the message rate goes to zero. Proof of Converse: Given a (n, 2nR) code that satisfies the constraints, define i 1 n i Vi =(S − ,Z \ ). We have

nR H(M Zn) ≥ | = H(M,Y n Zn) H(Y n M,Zn) | − | (a) = H(M,Y n Zn) nǫ | − n n n = H(Y Z ) nǫn n | − i 1 n i = H(Y Y − ,Z ,Z \ ) nǫ i| i − n i=1 Xn i 1 i 1 i 1 n i H(Y Y − , A − ,S − ,Z ,Z \ ) nǫ ≥ i| i − n i=1 Xn (b) i 1 i 1 n i = H(Y A − ,S − ,Z ,Z \ ) nǫ i| i − n i=1 Xn (c) = H(Y V ,Z ) nǫ i| i i − n i=1 X CHAPTER 5. COMPRESSION WITH ACTIONS 48

(d) = nH(Y V ,Q,Z ) nǫ Q| Q Q − n

where (a) is due to Fano’s inequality; (b) follows from the Markov chain Y i − i 1 i 1 n i 1 i 1 i 1 (S − , A − ,Z ) Y − ; (c) follows since A − is a function of S − . Note that A − i is now a function of Si and Vi. Finally, we introduce Q as the time sharing random variable, i.e., Q Unif[1, ..., n]. Thus, by setting V =(V , Q) and Y = Y , we have ∼ Q Q completed the proof.

5.3.3 Examples

In this subsection, we consider an example with state sequence Sn i.i.d. Bern(1/2) ∼ and Z = . We have two actions available, A = 0 and A = 1. The cost constraint is ∅ on the of action A = 1, EA B. The channel output Y = S A S ≤ i i ⊕ i ⊕ Ni where is the modulo 2 sum and S are i.i.d. Bern(p) noise, p < 1/2. The ⊕ { Ni} example is illustrated in Fig. 5.2.

Sn i.i.d Bern(1/2) ∼ Sn i.i.d Bern(p) N ∼

n Action An Y M + + Compressor ∈ Decoder Y n Encoder 1, .., 2nR { } EA B ≤ Figure 5.2: Binary example with side information Z = . ∅

We use the following lemma to simplify the optimization problem in Eq. (5.4) applied to the binary example.

Lemma 6. For the binary example, it is without loss of optimality to have the fol- lowing constraints when solving the optimization problem of Eq. (5.4):

= 0, 1, 2 , P(V =0)=P(V =1)= θ/2, for some θ [0, 1]. • V { } ∈ The function a = f(s, v) is of the form: f(s, 0) = s, f(s, 1) = 1 s and • − f(s, 2)=0. CHAPTER 5. COMPRESSION WITH ACTIONS 49

P(S =0 V =1)=P(S =1 V =0)=∆ and P(S =0 V =2)=1/2. • | | | ∆θ B. • ≤ Note that the constraints guarantee that P(S =0)=P(S =1)=1/2.

Proof. See Appendix. D.1

Using Lemma 6, we can simplify the objective function in Eq. (5.4) in the following way:

H(Y V )+ I(V ; S) | = H(Y V ) H(S V )+ H(S) | − | = H(S A S V ) H(S V )+1 ⊕ ⊕ N | − | θ = (H(0 S V = 0) H(∆)) 2 ⊕ N | − θ + H(1 S V = 1) H(∆) 2 { ⊕ N | − } +(1 θ) H(S S V = 2) 1 +1 − { ⊕ N | − } = θ (H (p) H(∆)) + 1 2 − where H ( ) is the binary entropy function, i.e., H (δ)= δ log δ (1 δ) log(1 δ). 2 · 2 − − − −

R(B) = min θ (H2(p) H(∆))+ 1 θ [2B,1], θ∆ B − ∈ ≤ B = 1 + min (H2(p) H2(∆)) ∆ [B,1/2] ∆ − ∈ H (∆) H (p) = 1 B max 2 − 2 − ∆ [B,1/2] ∆ ∈ ∗ H(b ) H2(p) 1 B −∗ , if 0 B < b∗ = − b ≤ (5.6) ( 1 H2(B)+ H2(p), if b∗ B 1/2 − ≤ ≤ where b∗ is the solution of the following function:

H (b) H (p) dH 2 − 2 = 2 , b [0, 1/2] (5.7) b db ∈ CHAPTER 5. COMPRESSION WITH ACTIONS 50

which is illustrated in Fig. 5.3.

H (b) vs b 2 1 ∗ H2(b ) H2(p) dH2 −∗ = 0.9 b db b=b∗ 0.8

0.7 H(b∗) H(p) 0.6 −

(b) 0.5 2 (p,H(p)) H

0.4 b∗ 0.3

0.2

0.1 p

0 b∗ 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 b

H2(b) H2(p) dH2 Figure 5.3: The threshold b∗ solves − = , b [0, 1/2] b db ∈

Now let us shift our attention to the causal case of the binary example, i.e., Si is only causally available at the action encoder.

Lemma 7. For the causal case of the binary example, it is without loss of optimality to have the following constraints when solving the optimization problem in Eq. (5.5):

= 0, 1 , P(V =0)= θ, for some θ [0, 1]. • V { } ∈ The function a = f(s, v) is of the form: f(s, 0) = s, f(s, 1)=0. • θ B. • 2 ≤ Proof. See Appendix. D.2

Using Lemma 7, we can simplify the objective function in Eq. (5.5) in the following way:

R(B) = min H(Y V ) | = min θH(Y V = 0)+(1 θ)H(Y V = 1) θ [0,1], θ B | − | ∈ 2 ≤ CHAPTER 5. COMPRESSION WITH ACTIONS 51

= min θH(Z V = 0)+(1 θ)H(S Z V = 1) θ [0,1], θ B | − ⊕ | ∈ 2 ≤ = min θH2(p)+(1 θ) θ [0,1], θ B − ∈ 2 ≤ 2BH (p)+(1 2B), 0 B 1/2; = 2 − ≤ ≤ ( H2(p), 1/2 B. ≤ For the binary example with p = 0.1, we plot the rate-cost function R(B) for both cases in the following figure.

R(B) vs. B, p=0.1 1 Causal Non−Causal 0.9

0.8

0.7 R(B)

0.6

0.5 H (0.1) 2

0.4 0 0.1 0.2 0.3 0.4 0.5 Cost Constraint B

Figure 5.4: Comparison between the non-causal and causal rate-cost functions. The parameter of the Bernoulli noise is set at 0.1.

5.4 Lossy compression with actions

In this section, we extend our setup to the lossy case. We give an achievable rate- cost-distortion region when Sn is available noncausally at the action encoder and characterize the rate-cost-distortion function when Sn is available causally at the encoder and Z = . ∅ Theorem 9. An upper bound on the rate-cost function for the case with non-causal CHAPTER 5. COMPRESSION WITH ACTIONS 52

state information is given by

R(B) min I(V ; S Z)+ I(Yˆ ; Y V,Z) (5.8) ≤ EΛ(S,A,Y ) B,Ed(Y,Yˆ ) D | | ≤ ≤ where the joint distribution is of the form :

p(s, v, a, y, y,ˆ z)= p(s, z)p(v s)1 f(s,v)=a p(y a, s)p(ˆy y, v). | { } | | Sketch of achievability: The codebook generation and the encoding for the action encoder is largely the same as that for the lossless case. We generate 2n(I(V ;S)+ǫ) n n n n(I(Yˆ ;Y V )+ǫ) n V sequences according to i=1 pV (vi), and for each v , generate 2 yˆ sequences according to n p(ˆy v ). The set of vn sequences are partitioned to i=1Q i| i 2n(I(V ;Z)+2ǫ) equal sized bins, (m ), and for each m , the set ofy ˆn sequeces are Q B 0 0 n(I(Yˆ ;Z V )+2ǫ) n partitioned to 2 | equal sized bins, (m , m ). Given a sequence s , the B 0 1 action encoder finds the vn sequence which is jointly typical with sn and takes actions according to A = f(s , v ) for i [1 : n]. At the compressor, we first find av ˆn i i i ∈ n n n n n (n) that is jointly typical with Y and then, ay ˆ such that (ˆv , yˆ ,y ) ǫ . The ∈ T compressor then sends the indices M , M such that the selectedv ˆn (M )andy ˆn 0 1 ∈ B 0 ∈ (M , M ). The decoder first recoversv ˆn by looking for the uniquev ˆn (m ) such B 0 1 ∈ B 0 n n (n) n n that (ˆv , z ) ǫ . next, it recoversy ˆ by looking for the uniquey ˆ (m , m ) ∈ T ∈ B 0 1 n n n (n) such that (ˆv , z , yˆ ) ǫ . From the rates given, it is easy to see that all encoding ∈ T and decoding steps succeed with high probability as n . →∞ We now turn to the case when sn is causally known at the action encoder. In this case, we are able to characterize the rate-cost-distortion function when no side information is available, Z = . ∅ Theorem 10. The rate-cost-distortion function for the case with causal state infor- mation and no side information is given by

R(B) = min I(Y ; Yˆ V,Z) (5.9) a=f(s,v):EΛ(S,A,U) B,Ed(Y,Yˆ ) D | ≤ ≤ where the joint distribution is of the form p(s, v, a, y, yˆ)= p(s)p(v)1 a=f(s,v) p(y a, s)p(ˆy y, v). { } | | The achievability is straightforward, with V as the time sharing random variable known to all parties and follows similar analysis as Theorem 8. CHAPTER 5. COMPRESSION WITH ACTIONS 53

Converse: Given a (n, 2nR) code satisfying the cost and distortion conditions, we have

nR H(M) ≥ I(M; Y n) ≥ n i 1 = I(M; Y Y − ) i| i=1 Xn (a) = I(M; Y V ) i| i i=1 X (b) n I(Yˆ ; Y V ) ≥ i i| i i=1 X (c) = nI(Yˆ ; Y V , Q) Q Q| Q

i 1 n where in (a) we set Vi = Y − . (b) holds because Yˆ is a function of M; Note that

Vi is independent of Si. In (c) we introduce Q as the time sharing random variable, i.e., Q Unif[1, ..., n]. Thus, by setting V = (V , Q) and Y = Y , we have shown ∼ Q Q that R(B,D) I(Y,Yˆ V ) where V S. It is, however, equivalent to the region in ≥ | ⊥ the theorem because of the following:

Replacing p(ˆy y, v) by a general distribution p(ˆy a,y,v,s) does not decrease the • | | minimum in (5.9) since the mutual information term I(Y ; Yˆ V ) only depends | on the marginal distribution of p(ˆy,y,s).

Replacing a = f(s, v) by a general distribution p(a s, v) does not decrease the • | minimum in (5.9), because for any joint distribution p(s)p(s v)p(a s, v)p(y a, s)p(ˆy y,s), | | | | I(Yˆ ; Y V = v) is a concave function in p(y s, v), which is a linear function of | | p(a s, v). | Chapter 6

Conclusions

In this thesis, we first revisited G´acs and K¨orner’s definition of common information. It is equal to the common randomness that two remote nodes, with access to X and Y respectively, can generate without communication. The fact that this quantity is degenerate for most cases motivated us to investigate the initial efficiency of common randomness generation when the communication rate goes to zero. It turned out that 1 the initial efficiency is equal to 1 ρ(X;Y )2 , where ρ is the Hirschfeld-Gebelein-Renyi − maximal correlation between X and Y . This result gave Hirschfeld-Gebelein-Renyi maximal correlation an operational justification as a measure of commonness between two random variables. The result also indicated that communication is the key to unlock common randomness. And then we turned to the saturation efficiency as the communication exhausts nature’s randomness. We provided a sufficient condition for the saturation efficiency to be 1, which implies the continuity of the slope of common randomness generation at that point. An example was given to show that the slope is not continuous in general. In the next part of the thesis, we introduced common randomness generation with actions, in which a node can take actions to influence the random variables received from nature. A single letter expression of the common randomness-rate function was obtained. We showed through an example that the greedy approach of fixing the “best” action is not optimal in general when communication rate is strictly positive. But as the rate goes down to zero, the initial efficiency in the action setting was proved

54 CHAPTER 6. CONCLUSIONS 55

1 to be 2 , i.e., the reciprocal of one minus the square of Hirschfeld- 1 maxa∈A ρ (X,Y A=a) − | Gebelein-Renyi maximal correlation conditioned on the best action. The saturation efficiency with actions was analyzed similarly to the no-action setting. In the last part of the thesis, we kept the action feature, but shifted our focus to source coding. The idea that one could modify a source subject to a cost constraint before compression was formulated in an information theoretical setting. Techniques from both channel coding and source coding were combined to obtain a single letter expression of the rate-cost function. In our achievability scheme, modification of the source sequence is essentially equivalent to setting up cloud centers of the source sequence. Compression of the modified sequence is carried out via a classic binning approach. Interestingly, this approach does not require correct decoding of the cloud center. Appendix A

Proofs of Chapter 2

A.1 Proof of the convexity of ρ(PX PY X) in PY X ⊗ | | The inequality holds trivially if any one of the r.v.’s is degenerate (i.e. is equal to a constant with probability 1). We exclude the case in the following proof. Fix arbitrary functions f and g such that Eg(X) = Ef(Y,Z) = 0, Eg2(X) = 1, and Eg2(Y,Z) = 1. Without loss of generality we can assume Eg(X)f(Y,Z) 0 ≥ (otherwise we can consider f instead of f). Define µ(z)= E[f(Y,Z) Z = z]. − |

Eg(X)f(Y,Z) (A.1) = Eg(X)(f(Y,Z) µ(Z)+ µ(Z)) − = Eg(X)(f(Y,Z) µ(Z)) + Eg(X)µ(Z) − (a) = Eg(X)(f(Y,Z) µ(Z)) + Eg(X)Eµ(Z) − (b) = Eg(X)(f(Y,Z) µ(Z)), −

2 where (a) is because X Z and (b) is due to Eg(X) = 0. Define η = Ef (Y,Z) . E(f(Y,Z) µ(Z))2 ⊥ − q

56 APPENDIX A. PROOFS OF CHAPTER 2 57

Note that

E (f(Y,Z) µ(Z))2 (A.2) − = E (f(Y,Z) µ(Z))2 Z Z − | 2 EZ f (Y,Z) Z  ≤ | = Ef 2(Y,Z). 

Thus η 1. Consider a new function f ′ = η[f(Y,Z) µ(Z)]. Note that Ef ′(Y,Z)= ≥ − 2 η[Ef(Y,Z) Eµ(Z)] = 0 and E(f ′(Y,Z)) = 1. Furthermore Eg(X)f ′(Y,Z) = − ηEg(X)f(Y,Z) Eg(X)f(Y,Z). Thus it is sufficient to consider f with the property ≥ that E[f(Y,Z) Z = z] = 0, which enable us to write the optimization problem for | solving ρ(X; Y,Z) in the following equivalent form:

max Eg(X)f(Y,Z) (A.3) subject to Eg(X)=0,

EY Z=zf(Y, z)=0 z, | ∀ Eg(X)= Ef 2(Y,Z)=1.

Define s = E[f 2(Y,Z) Z = z]. To simplify the notation, let p = P (z) and z | z Z ρz = ρ(PX PpY X,Z=z). We have the constraint: ⊗ |

2 pzsz = 1 (A.4) z X Note that

max Eg(X)f(Y,Z) (A.5) = max E [Eg(X)f(Y,Z) Z] Z | = max p [Eg(X)f(Y,Z) Z = z] z | z X (a) max pzszρ(PX PY X,Z=z) ≤ ⊗ | z X APPENDIX A. PROOFS OF CHAPTER 2 58

= pzszρz z X (b) p ρ2 ≤ z z s z X where (a) is due to the fact that given Z = z, (X,Y ) has joint distribution P X ⊗ PY X,Z=z and (b) is based on the following argument: | Consider the following optimization problem with optimization variable s , z z ∈ Z

max pzρzsz z X subject to p s2 =1,s 0, z z z z ≥ ∀ z X Using the method of Lagrange multiplier, we construct L(s,λ)= p ρ s λ p s2. z z z z− z z z ∂L Solving = pzρz 2λpzsz = 0, we obtain sz = ρz/(2λ), z . Using the constraint ∂sz − ∀ ∈ Z P P 2 2 2 z pzsz = 1, we have λ = z pzρz/2, which yields z pzρz as the maximum. P This completes the proofpP of Lemma 1. pP Appendix B

Proofs of Chapter 3

B.1 Proof of the continuity of C(R) at R =0

Fix an arbitrary ǫ> 0.

ǫ I(X; U) I(Y ; U) (B.1) ≥ − (a) = I(X; U Y ) | = p(y)I(X; U Y = y) | y X = p(y)D(p(x, u Y = y) p(x Y = y)p(u Y = y)). | || | | y X Thus D (p(x, u y) p(x y)p(u y)) ǫ , y . Via Pinkser’s inequality, | || | | ≤ miny∈Y p(y) ∀ ∈ Y p(x) q(x) 2 D(p q), we obtain x | − | ≤ ln 2 || P q p(x, u y) p(x y)p(u y) ǫ′ (B.2) | | − | | | ≤ x,u X p(x y) p(u x, y) p(u y) ǫ′ ⇒ | | | − | | ≤ x,u X p(x y) p(u x) p(u y) ǫ′, ⇒ | | | − | | ≤ x,u X

59 APPENDIX B. PROOFS OF CHAPTER 3 60

ǫ where ǫ′ = and the last steps is due to the Markov chain U X Y . miny∈Y p(y) − − Thus for eachq (x, y) pair such that p(x, y) > 0, we have p(u x) p(u y) δ, where | | − | | ≤ δ = ǫ . min(x,y):p(x,y)>0 p(x,y) Let V be maximum common r.v. of p(x, y). There exist deterministic functions g and f such that V = g(X) = f(Y ). For each v, pick an arbitrary y∗ from the y’s such that g(y)= y. Thus we create a mapping y∗ = y∗(v).

We claim that for if x and y are in the same block, than p(u x) p(u y) δ′, | | − | | ≤ where δ′ = (2 +1)δ. This is due to the fact that if x and y satisfy g(x)= f(y), then |X| there exists a sequence (x, y1), (x1,y1), (x1,y2), ..., (xn,y) such that the probability of each pair is strictly positive [10]. Using triangle inequality:

p(u x) p(u y) | | − | | p(u x) p(u y ) + p(u y ) p(u y) ≤ | | − | 1 | | | 1 − | | δ + p(u y ) p(u y) ≤ | | 1 − | | δ + p(u x ) p(u y ) + p(u x ) p(u y) ≤ | | 1 − | 1 | | | 1 − | | 2δ + p(u x ) p(u y) ≤ | | 1 − | | ... (2n + 1)δ ≤ (2 + 1)δ ≤ |X| = δ′

Consider a new distribution p∗(x,y,u) = p(x, y)p(u y∗(f(y))). Note that p p∗ | || − ||1 goes to zeros as ǫ goes to 0. Therefore limǫ 0 I(X; U V ) I∗(X; U V ) = 0. Fur- → | | − | | thermore under distribution p∗, X V U holds. Thus − −

lim I(X; U) = lim I(X; U, V ) ǫ 0 ǫ 0 → → = I(X; V ) + lim I∗(X; U V ) ǫ 0 → | = I(X; V ) = H(V ) Appendix C

Proofs of Chapter 4

C.1 Converse proof of Theorem 5

We bound the rate R as follows:

nR H(M) ≥ = I(Xn; M) = I(Xn; M,Y n) I(Xn; Y n M) n − | n i 1 n n = I(X ; M,Y X − ) I(X ; Y M) i | − | i=1 Xn (a) n i 1 n n = I(X ; M,Y ,X − ) I(X ; Y M) i − | i=1 Xn (b) n i 1 n n = I(X ; M, A ,Y ,X − ) I(X ; Y M) i i − | i=1 Xn n n n i i 1 n n = I(X ; A )+ I(X ; M,Y \ ,X − A ,Y )+ I(X ; Y A ) I(X ; Y M) i i i | i i i i| i − | i=1 i=1 i=1 Xn Xn Xn n i i 1 n i 1 = I(X ; A )+ I(X ; M,Y \ ,X − A ,Y )+ I(Y ; X A ) I(Y ; X Y − , M) i i i | i i i i| i − i | i=1 i=1 i=1 X X X  

61 APPENDIX C. PROOFS OF CHAPTER 4 62

n n (c) n i i 1 I(X ; A )+ I(X ; M,Y \ ,X − A ,Y ) ≥ i i i | i i i=1 i=1 Xn Xn n n i i 1 n i 1 = I(X ; A )+ I(X ; K,M,Y \ ,X − A ,Y ) I(X ; K M,Y ,X − , A ) i i i | i i − i | i i=1 i=1 i=1 Xn Xn Xn (d) n i i 1 n i 1 = I(X ; A )+ I(X ; K,M,Y \ ,X − A ,Y ) I(X ; K M,Y ,X − ) i i i | i i − i | i=1 i=1 i=1 Xn Xn X n i i 1 n n = I(X ; A )+ I(X ; K,M,Y \ ,X − A ,Y ) I(X ; K M,Y ) i i i | i i − | i=1 i=1 Xn Xn n i i 1 n I(X ; A )+ I(X ; K,M,Y \ ,X − A ,Y ) H(K M,Y ) ≥ i i i | i i − | i=1 i=1 Xn Xn (e) n i i 1 I(X ; A )+ I(X ; K,M,Y \ ,X − A ,Y ) H(K K′) ≥ i i i | i i − | i=1 i=1 Xn Xn i 1 I(X ; A )+ I(X ; K,M,,X − A ,Y ) H(K K′) ≥ i i i | i i − | i=1 i=1 Xn Xn (f) = I(X ; A )+ I(X ; U A ,Y ) H(K K′) i i i i| i i − | i=1 i=1 Xn Xn n (g) = I(X ; A )+ I(X ; U A ) I(Y ; U A ) H(K K′) i i i i| i − i i| i − | i=1 i=1 i=1 Xn X n X = I(X ; A , U ) I(Y ; U A ) H(K K′) i i i − i i| i − | i=1 i=1 X X (h) H(K K′) = n I(X ; A , U Q) I(Y ; U A , Q) | Q Q Q| − Q Q| Q − n   H(K K′) = n I(X ; A , U , Q) I(Y ; U A ) | Q Q Q − Q Q| Q − n   H(K K′) n I(X ; A , U , Q) I(Y ; U , Q A ) | ≥ Q Q Q − Q Q | Q − n   n where (a) is because Xis’ are i.i.d.; (b) and (d) are due to the fact A is a function of M; And (c) comes from the following chain of inequalities:

n i 1 I(Y ; X A ) I(Y ; X Y − , M) (C.1) i i| i − i | n i 1 n = I(Y ; X A ) I(Y ; X Y − ,M,A ) i i| i − i | APPENDIX C. PROOFS OF CHAPTER 4 63

i 1 n n i 1 n = H(Y A ) H(Y Y − ,M,A ) H(Y X , A )+ H(Y X ,Y − ,M,A ) i| i − i| − i| i i i| i 1 n = H(Y A ) H(Y Y − ,M,A ) H(Y X , A )+ H(Y X , A ) i| i − i| − i| i i i| i i 0 ≥

n i i 1 n i where the third equality comes from the Markov chain Y (X , A ) (X \ ,Y − ,M,A \ ); i− i i − n i 1 (e) is because K′ is a function of M and Y ; in (f), we set Ui = (K,M,X − ). Note that U (X , A ) Y , which justifies (g); in (h), we introduce a time-sharing i − i i − i random variable Q, which is uniformly distributed on 1, ..., n and independent of { } (Xn,K,M,Y n). We bound the entropy of K as follows:

(a) H(K) = I(Xn; K) (C.2) i 1 = I(X ; K X − ) i | i 1 = I(Xi; K,X − )

= I(Xi; Ui) = nI(X ; U Q) Q Q| = nI(XQ; UQ, Q)

n where (a) is due to the fact that K is a function of X . Set X = XQ, Y = YQ and

U = [UQ, Q], which finishes the proof.

C.2 Proof for initial efficiency with actions

The goal is to prove that

I(Y ; U A) 2 sup | = max ρm(X,Y A = a) p(a,u x) I(X; A)+ I(X; U A) a | | | ∈A Define

∆ (P )= p(a, u x): p(a x)p (x)= p (a), a 1 A | | X A ∀ ∈A x n X o APPENDIX C. PROOFS OF CHAPTER 4 64

∆ (δ)= p(a, u x): I(X; A)+ I(X; U A) δ 2 | | ≤ n o That is ∆(P ) is the set of conditional distributions p(a, u x) such that the induced A | marginal distribution of A is PA and ∆2(δ) is the set of conditional distributions p(a, u x) such that I(X; A)+ I(X; U A) does not exceed δ. | | I(Y ; U A) sup | p(a,u x) I(X; A)+ I(X; U A) | | I(Y ; U A) = sup sup sup | PA δ 0 ∆(P ) T ∆ (δ) I(X; A)+ I(X; U A) ≥ A 2 | I(Y ; U A) = sup lim sup | δ 0 I(X; A)+ I(X; U A) PA ↓ ∆(PA) T ∆2(δ) | where the last step can be proved by the following lemma on concavity argument:

Lemma 8. Fixing an arbitrary marginal distribution PA, define

f(δ) = sup I(Y ; U A). ∆(PA) T ∆2(δ) |

Then f(δ) is concave in δ.

Proof. Fixing the marginal distribution P , consider any p(a ,u x) ∆(P ) ∆ (δ ) A 1 1| ∈ A 2 1 and p(a2,u x) ∆(PA) ∆2(δ2). Construct p(a, u x)= λp(a1,u1 x)+(1 λ)p(a2,u2 x). | ∈ | | − T | Introducing a time sharing r.v. Q which equals 1 w.p. λ and 2 w.p. 1 λ. We have T −

I(X; A, U, Q) = I(X; A, U Q) | = λI(X; A , U Q )+(1 λ)I(X; A , U Q ) 1 1| 1 − 2 2| 2 λδ + (1 λ)δ ≤ 1 − 2 and

I(Y ; U, Q A) I(Y ; U A, Q) | ≥ | = λI(Y ; U A )+(1 λ)I(Y ; U A ) 1| 1 − 2| 2 APPENDIX C. PROOFS OF CHAPTER 4 65

Note that (A, U ′) is a valid distribution, where U ′ = [U, Q]. Thus

f(λδ + (1 λ)δ ) λf(δ )+(1 λ)f(δ ), 1 − 2 ≥ 1 − 2 which completes the concavity proof of f.

Note that

I(Y ; U A) I(Y ; U A = a) | max | I(X; A)+ I(X; U A) ≤ a:PA(a)>0 I(X; U A = a) | 2 | max ρ (PX A=a PY X,A=a) ≤ a:PA(a)>0 | ⊗ | where the first inequality is a consequence of [4, Lemma 16.7.1] and the last inequality comes from Lemma 4. Therefore

I(Y ; U A) sup lim sup | δ 0 I(X; A)+ I(X; U A) PA ↓ ∆(PA) T ∆2(δ) | 2 sup lim sup max ρ (PX A=a pY X,A=a) δ 0 a:p (a)>0 | | ≤ pA ↓ ∆(PA) T ∆2(δ) A ⊗ (a) 2 = sup max ρ (pX PY X,A=a) a :PA(a)>0 ⊗ | PA ∈A 2 = max ρ (PX PY X,A=a) a | ∈A ⊗ where (a) can be proved by observing that for a fixed marginal distribution P , δ 0 A ↓ implies that PX PX A=a l1 0 for a ,PA(a) > 0 and ρ(PX′ PY X,A=a) as a || − | || ↓ ∈ A ⊗ | function of PX′ is uniformly continuous around PX′ = PX . This upper bound is actually 2 achievable. We can simply fix the action a that maximizes maxa ρ (PX PY X,A=a) ∈A ⊗ | and use Lemma 4 to complete the proof. APPENDIX C. PROOFS OF CHAPTER 4 66

C.3 Proof of Lemma 5

Set A X ⊥ By symmetry, it is without loss of optimality to set A = 1. The maximum common 1, if X = 1; r.v. V between X and Y has the following form: V =  2, if X = 2; For any   3, if X =3, 4. U such that U X Y :  − −  (a) I(U; X) I(U; Y ) = I(UV ; X) I(V U; Y ) − − = I(V ; X) I(V ; Y )+ I(U; X V ) I(U; Y V ) − | − | (b) = I(U; X V ) I(U; Y V ) | − | (c) 1 = [I(U; X V = 3) I(U; Y V = 3)] 2 | − | (d) 1 = [I(U; X V = 3) (1 p)I(U; X V = 3)] 2 | − − | p = I(U; X V = 3) 2 | where (a) is due to Lemma. 3; (b) is because V is a deterministic function of X and a deterministic function of Y ; (c) is due to the fact conditioned on V =1 or V = 2, X = Y ; (d) is because condition on V = 3, Y is an erased version of X. On the other hand

I(U; X) = I(U, V ; X) = I(V ; X)+ I(U; X V ) | = H(V )+ I(U; X V ) | 1 = H(V )+ I(U; X V = 3) 2 | 3 1 = + I(U; X V = 3) 2 2 |

Thus the achievable (C, R) pair when A X is of the form C = 3 + R . Note that ⊥ 2 p 0 I(U; X V = 3) 1 thus 0 R p/q. ≤ | ≤ ≤ ≤ APPENDIX C. PROOFS OF CHAPTER 4 67

Correlate A with X through Fig. 4.3

We construct a r.v. V in the following way to facilitate the proof: V has the support set 1, 2, 3 and is a deterministic function of (X, A): { } 1, if X = 1; If A = 1, V =  2, if X = 2; •   3, if X =3, 4.

  1, if X = 3; If A = 2, V =  2, if X = 4; •   3, if X =1, 2.  Note that conditioned on A, V is the maximum common r.v. between X and Y . We simply set U = V (This is not optimal in general but good enough to beat the A X ⊥ choice for some R). The communication rate is

I(X; A)+ I(V ; X A) I(V ; Y A) = I(X; A)+ H(V A) H(V A) | − | | − | = I(X; A) = 1 H (α) − 2

On the other hand, the common randomness generated is

I(X; A, V ) = I(X; V A)+ I(X; A) | = 1 H (α)+ I(X; V A) − 2 | = 1 H (α)+ H(V A) − 2 | 1 α 1 α = 1 H (α)+ H(α, − , − ) − 2 2 2 = 2 α − Appendix D

Proofs of Chapter 5

D.1 Proof of Lemma 6

Fixing a v, the function a = f(s, v) has only four possible forms: a = s, a = 1 s, − a = 0 and a = 1. Thus, we can divide into four groups: V

= v : f(s, v)= s V0 { } = v : f(s, v)=1 s V1 { − } = v : f(s, v)=0 V2 { } = v : f(s, v)=1 (D.1) V3 { }

First, it is without loss of optimality to set = . That is because for each v , V3 ∅ ∈V3 we can change the function to f(s, v) = 0. The rate I(V ; S)+ H(Y V ) does not | change and the cost EA only decreases. Rewrite the objective function in the following way

I(V ; S)+ H(Y V ) = H(Y V ) H(S V )+ H(S) | | − | = H(S A Z V ) H(S V )+ H(S) ⊕ ⊕ | − | = H (p) H(S V = v) p(v) 2 − | v 0 X∈V 

68 APPENDIX D. PROOFS OF CHAPTER 5 69

+ H (p) H(S V = v) p(v) 2 − | v 1 X∈V  + H(S S V = v) H(S V = v) p(v) ⊕ N | − | v 2 X∈V  where the last step is obtained by plugging in the actual form of a = f(s, v) for each group of v. Second, it is sufficient to have = 1 and = 1. To prove this, let v , v . |V0| |V1| 1 2 ∈V0 Note that H(S V = v) is a concave function in p(s V = v). Thus if we replace v , v | | 1 2 by a v3 with p(v3)= p(v1)+ p(v2) and

p(v1) p(v2) p(s V = v3) = p(s V = v1)+ p(s V = v2), | p(v1)+ p(v2) | p(v1)+ p(v2) | we preserve the distribution of S, the cost EA but we reduce the first term, i.e., H (p) H(S V = v) p(v), in Eq. (D.2). Therefore, we can set = 0 and v 0 2 0 ∈V − | V { } = 1 . VP1 {}  Third, note that for each v , ∈V2

H(Y V = v) H(S V = v) | − | = H(S A Z V = v) H(S V = v) ⊕ ⊕ | − | = H(S S V = v) H(S V = v) ⊕ N | − | 0 (D.2) ≥

Last, if P(S = 0 V = 0) = P(S = 1 V = 1), consider a new auxiliary random | 6 | variable V ′ with the following distribution:

′ = 0, 1, 2 , P(V ′ =0)=P(V ′ = 1) = (P(V = 0)+P(V = 1))/2 • V { }

The function a = f(s, v′) is of the form: f(s, 0) = s, f(s, 1) = 1 s and • − f(s, 2) = 0.

P(S =0 V ′ =2)=1/2 and • | APPENDIX D. PROOFS OF CHAPTER 5 70

P(S =1 V ′ =0)=P(S =0 V ′ = 1) | | P(S =1 V = 0)P(V = 0)+P(S =0 V = 1)P(V = 1) = | | . P(V = 0)+P(V = 1)

Comparing (S,V ′) with (S,V ), we can check that the cost EA and the distribution of S are preserved. Meanwhile, the objective function is reduced, which completes the proof.

D.2 Proof of Lemma 7

Similar to the proof of Lemma 6, we divide in to , , , . Using the same V V0 V1 V2 V3 argument, we show that = . Rewrite the objective function H(Y V ) in the V3 ∅ | following way:

H(Y V ) (D.3) | = H(S A S V ) ⊕ ⊕ N | = H (p)p(v)+ H (p)p(v)+ H(S S V = v)p(v) 2 2 ⊕ N | v 0 v 1 v 2 X∈V X∈V X∈V = H2(p) p(v)+ p(v),

v 0 S 1 v 2 ∈VX V X∈V which implies that it is sufficient to consider the case = 1, = and = 1. |V0| V1 ∅ |V2| And this completes the proof. Bibliography

[1] R. Ahlswede and I. Csisz´ar, “Common Randomness in Information Theory and Cryptography – Part I: Secret sharing ”, IEEE Trans. Inf. Theory, vol. 39, no. 4, , pp. 1121– 1132, January, 1998.

[2] R. Ahlswede and I. Csisz´ar, “Common Randomness in Information Theory and Cryptography – Part II: CR Capacity”, IEEE Trans. Inf. Theory, vol. 44, no. 1, , pp. 225–240, January, 1998.

[3] R. F. Ahlswede, and J. K¨orner, “Source coding with side information and a converse for degraded broadcast channels,” IEEE Trans. Inf. Theory, vol. 21, no. 6, pp. 629-637, 1975.

[4] T. Cover and J. Thomas, “Elements of Information Theory”, John Wiley&Sons, 2nd Edition, 2006.

[5] I Csisz´ar and P. Narayan, “Common Randomness and Secret Key Generation with a Helper”, IEEE Trans. Inf. Theory, vol. 46, no. 2, pp. 344–366, March, 2000.

[6] P. Cuff, T. Cover, and H. Permuter, “Coordination capacity,” IEEE Trans. Inf. Theory, vol. 56, no. 9, pp. 4181–4206, September 2010.

[7] A. Dembo, A. Kagan, and L. A. Shepp, “Remarks on the maximum correlation coefficient”, Bernoulli, no. 2, pp. 343–350, April 2001.

[8] A. El Gamal, and Y. H. Kim, “Lectures on Network Information Theory,” 2010, available online at ArXiv: http://arxiv.org/abs/1001.3404.

71 BIBLIOGRAPHY 72

[9] E. Erkip and T. Cover, “The Effciency of Investment Information”, IEEE Trans. Inf. Theory, vol. 44, no. 3, pp. 1026–1040, May 1998.

[10] P. G´acs and J. K¨orner, “Common information is far less than mutual informa- tion”, Problems of Control and Information Theory, vol. 2, no. 2, pp. 119-162, 1972

[11] S. I. Gelfand and M. S. Pinsker, “Coding for Channel with Random Parameters,” Probl. Contr. and Inform. Theory, vol. 9, no. I, pp. 1931, 1980.

[12] H. Gebelein, “Das statistische problem der Korrelation als variationsund Eigen- wertproblem und sein Zusammenhang mit der Ausgleichungsrechnung,” Z. f¨ur angewandte Math. und Mech., vol. 21, pp. 364-379, 1941.

[13] C. Heegard and A. El Gamal,“On the Capacity of Computer Memory with De- fects,” IEEE Trans. Inform. Theory, vol. 29, no. 5, pp. 731739, September 1983

[14] H. O. Hirschfeld, “A connection between correlation and contingency,” Proc. Cambridge Philosophical Soc., vol. 31, pp. 520-524, 1935

[15] Y. H. Kim, A. Sutivong, and T.M. Cover, “State amplification,”, IEEE Trans. Inform. Theory,” Vol. 54, no. 5, pp. 1850 – 1859, May 2008

[16] H. O. Lancaster, “ Some properties of the bivariate consid- ered in the form of a ,” Biometrika, 44, pp. 289–292, 1957

[17] Pulkit Grover, Aaron Wagner, and Anant Sahai, “Information Embedding meets Distributed Control”, IEEE Information Theory Workshop, January 2010 in Cairo, Egypt.

[18] A. R´enyi, “On measures of dependence”, Acta Mathematica Hungarica, vol. 10, no.3-4, pp.441–451, 1959.

[19] C. Shannon, “A Mathematical Theory of Communication,” Bell System Techni- cal Journal, Vol. 27, pp. 379-423, 623-656, 1948. BIBLIOGRAPHY 73

[20] C. Shannon, “Channels with side information at the transmitter,” IBM J. Res. Develop., Vol. 2, pp. 289-293, 1958.

[21] D. Slepian, J. Wolf, “Noiseless coding of correlated information sources”, IEEE Trans. Inf. Theory, vol. 19, no. 4, pp. 471–480.

[22] S. Sigurjonsson, and Y. H. Kim, “On multiple user channels with causal state in- formation at the transmitters,” in Proceedings of IEEE International Symposium on Information Theory, Adelaide, Australia, Sep. 2005

[23] A. Sutivong, and T. Cover, “Rate vs. Distortion Trade-off for Channels with State Information”, in Proceedings of the 2009 IEEE Symposium on Information Theory, Lausanne, Switzerland, June 2002.

[24] O. Sumszyk, and Y. Steinberg, “Information embedding with reversible stego- text”, in Proceedings of the 2009 IEEE Symposium on Information Theory, Seoul, Korea, Jun. 2009

[25] N. Tishby, F.C. Pereira, and W. Bialek, “The Information Bottleneck method,” The 37th annual Allerton Conference on Communication, Control, and Comput- ing, Sept. 1999, pp. 368-377

[26] S. Verdu, “On channel capacity per unit cost,” IEEE Trans. Inf. Theory, vol. 36, no. 5, pp. 1019–1030, September 1990.

[27] T. Weissman and H. Permuter, “Source Coding with a Side Information ‘Vending Machine’ ”, IEEE Trans. Inf. Theory, submitted 2009.

[28] H. S. Witsenhausen, “ On sequences of pairs of dependent random variables.”, SIAM J. APPL. Math. vol. 28, no. 1, January 1975.

[29] A. Wyner, and J. Ziv, “A theorem on the entropy of certain binary sequences and applications-I,” IEEE Trans. Inf. Theory, vol. 19, no. 6, pp. 769-772, 1973.

[30] A. Wyner and J. Ziv, “The rate distortion function for source coding with side information at the receiver”, IEEE Trans. Inf. Theory, vol. 22, no. 1, pp. 1–10, 1976