<<

arXiv:1912.07083v2 [cs.IT] 28 Sep 2020 h aeo onl asinrno variables. random Gaussian ca special jointly of of handful case a the t for only finding known is are solutions common explicit Wyner’s of difficulty i common main Wyner’s of The value The variable. third the and pair the Compactnes independent. conditionally pair the makes descri simulati that be distributed can in information common as Wyner’s variables, well random as caching) coded (i of problems problem theory information network in lies significance ntepeetppr esuyti eaaini h special the in relaxation this study we paper, present the probl In theory information network to related directly o again bound upper an with independence conditional replacing by nti aucitwsprilypeetda the at presented partially was manuscript this in etme 9 2020 29, September asne wteln (email: Switzerland Lausanne, nIfrainSine n Systems, and Sciences Information on h oki hsmnsrp a upre npr yteSwiss the by part in supported was manuscript this in work The .Sl n .Gspraewt h colo optradComm and Computer of School the with are Gastpar M. and Sula E. ntesm ae 1 eto .] ye lopooe natu a proposes also Wyner 4.2], Section [1, paper same the In ye’ omnIfrain[]i esr fdpnec be dependence of measure a is [1] Information Common Wyner’s coding Gaus of problem. optimality open the long-standing establishes also ar technique auxiliaries proof Gaussian the that proof the is contribution independe main conditional replaces relaxation The variables. nWnrsCmo nomto nthe in Information Common Wyner’s On ye’ omnIfrain ryWnrntok Gaussian network, Gray-Wyner Information, Common Wyner’s stu are relaxation natural a and Information Common Wyner’s rxe Sula, Erixhen { erixhen.sula,michael.gastpar rneo,N,USA. NJ, Princeton, tdn ebr IEEE Member, Student 09IE nomto hoyWorkshop, Theory Information IEEE 2019 asinCase Gaussian .I I. } @epfl.ch). ne Terms Index NTRODUCTION Abstract ainlSineFudto ne rn 624 n yEPFL by and 169294, Grant under Foundation Science National c yabudo h odtoa uulifrain The information. mutual conditional the on bound a by nce pia,laigt lsdfr oml.A corollary, a As formula. closed-form a to leading optimal, e m,icuigteGa-ye orecdn ewr [2]. network coding source Gray-Wyner the including ems, ncto Sciences, unication smaue ntrso h uulifrainbetween information mutual the of terms in measured is s inaxlaisfrteGusa ryWnrntok a network, Gray-Wyner Gaussian the for auxiliaries sian n ihe Gastpar, Michael and cuigacnnclifraintertcmdlo the of model information-theoretic canonical a ncluding e,icuigtebnr ymti obesuc and source double symmetric binary the including ses, aeo onl asinrno variables. random Gaussian jointly of case is relaxation This information. mutual conditional the n fraini h iiu fti uulinformation. mutual this of minimum the is nformation e ytesac o h otcmattidvariable third compact most the for search the by bed no hrdrnons.Seicly o arof pair a for Specifically, randomness. shared of on a eaaino i omnifrain obtained information, common his of relaxation ral eotmlcoc o h hr aibe Indeed, variable. third the for choice optimal he ae lig odtoa needne source independence, conditional filling, water , idi h pca aeo asinrandom Gaussian of case special the in died we w admvrals t operational Its variables. random two tween cl oyehiu ´deaed asne(EPFL), Lausanne F´ed´erale de Polytechnique Ecole ´ ib,See n tthe at and Sweden Visby, elw IEEE Fellow, 00Ana Conference Annual 2020 h work The . DRAFT 1 2

A. Related Work and Contribution

The development of Wyner’s common information started with the consideration of a particular network source coding problem, now referred to as the Gray-Wyner network [2]. From this consideration, Wyner extracted the compact form of the common information in [1], initially restricting attention to the case of discrete random variables. Extensions to continuous random variables are considered in [3], [4], with a closed-form solution for the Gaussian case. Our work provides an alternative and fundamentally different proof of this same formula (along with a generalization). In the same line of work Wyner’s common information is computed in additive Gaussian channels [5]. A local characterization of Wyner’s common information is provided in [6], by optimizing over weakly dependent random variables. In [7] Witsenhausen managed to give closed-form formulas for a class of distributions he refers to as “L-shaped.” The concept of Wyner’s common information has also been extended using other information measures [8]. Other related works include [9], [10]. Wyner’s common information has many applications, including to communication networks [1], to caching [11, Section III.C] and to source coding [12]. For Gaussian sources the Gray-Wyner network [2] problem still remains unsolved. A closed form solution is given in [2] by assuming that the auxiliaries are Gaussian. Partial progress was made in [13], [4], when the sum of the common rate and the private rates is exactly equal to the joint rate distortion function. For this corner case, it is known that Wyner’s common information is the smallest rate needed on the common link. In the present paper, we solve the Gray-Wyner network [2] for Gaussian sources, encompassing all previous partial results. Other variants of Wyner’s common information include [14], [15]. In [14], the conditional independence constraint is replaced by the conditional maximal correlation constraint, whereas in [15], the mutual information objective is replaced by the . The relaxation of Wyner’s common information studied in this paper is different from the above variants in the sense that it can be expressed using only mutual information, and thus it can be expressed as a trade-off curve in the Gray-Wyner region. The main difficulty in dealing with Wyner’s common information is the fact that it is not a convex optimization problem. Specifically, while the objective is convex, the constraint set is not a convex set : taking convex combi- nations does not respect the constraint of conditional independence. The main contributions of our work concern explicit solutions to this non-convex optimization problem in the special case when the underlying random variables are jointly Gaussian. Our contributions include the following: 1) We establish an alternative and fundamentally different proof of the well-known formula for (standard) Wyner’s common information in the Gaussian case, both for scalars and for vectors. Our proof leverages the technique of factorization of convex envelopes [16]. 2) In doing so, we establish a more general formula for the Gaussian case of a natural relaxation of Wyner’s common information. This relaxation was proposed by Wyner. In it, the constraint of conditional independence is replaced by an upper bound on the conditional mutual information. The quantity is of independent interest, for example establishing a rigorous connection between Canonical Correlation Analysis and Wyner’s Common Information [17]. 3) As a corollary, our proof technique also solves a long-standing open problem concerning the Gaussian Gray-

September 29, 2020 DRAFT 3

Wyner network. Specifically, we establish the optimality of Gaussian auxiliaries for the latter.

B. Notation

We use the following notation. Random variables are denoted by uppercase letters such as X and their realizations by lowercase letters such as x. The alphabets in which they take their values will be denoted by calligraphic letters such as . Random column vectors are denoted by boldface uppercase letters and their realizations by X boldface lowercase letters. Depending on the context we will denote the random column vector also as Xn :=

(X1,X2,...,Xn). We denote matrices with uppercase letters, e.g., A, B, C. For the cross- matrix of

X and Y, we use the shorthand notation KXY, and for the covariance matrix of a random vector X we use the shorthand notation KX := KXX. In slight abuse of notation, we will let K(X,Y ) denote the covariance matrix of the stacked vector (X, Y )T . We denote the identity matrix of dimension 2 2 with I and the Kullback-Leibler × 2 divergence with D(. .). The diagonal matrix is denoted by diag(.). We denote log+ (x) = max(log x, 0). ||

II. PRELIMINARIES

A. Wyner’s Common Information

Wyner’s common information is defined for two random variables X and Y of arbitrary fixed joint distribution p(x, y).

Definition 1. For random variables X and Y with joint distribution p(x, y), Wyner’s common information is defined as

C(X; Y ) = inf I(X, Y ; W ) such that I(X; Y W )=0. (1) p(w x,y) | | Wyner’s common information satisfies a number of interesting properties. We state some of them below in Lemmas 1 and 2 for a generalized definition given in Definition 2. We note that explicit formulas for Wyner’s common information are known only for a small number of special cases. The case of the doubly symmetric binary source is solved completely in [1] and can be written as 1 √1 2a C(X; Y )=1+ h (a ) 2h − − 0 , (2) b 0 − b 2   where a denotes the probability that the two sources are unequal (assuming without loss of generality a 1 ). In 0 0 ≤ 2 this case, the optimizing W is Equation (1) can be chosen to be binary. Further special cases of discrete-alphabet sources appear in [18]. 1 1+ ρ Moreover, when X and Y are jointly Gaussian with correlation coefficient ρ, then C(X; Y ) = 2 log 1 |ρ| . −| | 1 1 Note that for this example, I(X; Y ) = 2 log 1 ρ2 . This case was solved in [3], [4] using a parameterization of − conditionally independent distributions. We note that an alternative proof follows from our arguments below.

B. A Natural Relaxation of Wyner’s Common Information

Wyner, in [19, Section 4.2], defines an auxiliary quantity Γ(δ1,δ2). Starting from this definition, it is natural to introduce the following quantity:

September 29, 2020 DRAFT 4

Definition 2. For jointly continuous random variables X and Y with joint distribution p(x, y), we define

Cγ (X; Y ) = inf I(X, Y ; W ) such that I(X; Y W ) γ. (3) p(w x,y) | ≤ | With respect to [19, Section 4.2], we have that C (X; Y ) = H(X, Y ) Γ(0,γ). Comparing Definitions 1 γ − and 2, we see that in Cγ (X; Y ), the constraint of conditional independence is relaxed into an upper bound on the conditional mutual information. Specifically, for γ = 0, we have C0(X; Y ) = C(X; Y ), the regular Wyner’s common information. In this sense, it is tempting to refer to Cγ (X; Y ) as relaxed Wyner’s common information. The following lemma summarizes some basic properties.

Lemma 1. Cγ (X; Y ) satisfies the following basic properties: 1) C (X; Y ) max I(X; Y ) γ, 0 . γ ≥ { − } 2) Data processing inequality: If X Y Z form a Markov chain, then C (X; Z) min C (X; Y ), C (Y ; Z) . − − γ ≤ { γ γ } 3) C (X; Y ) is a convex and continuous function of γ for γ 0. γ ≥ 4) If Z is independent of (X, Y ), then Cγ ((X,Z); Y )= Cγ (X; Y ).

Proofs are given in Appendix A.

A further property of Cγ (X; Y ) is a tensorization result for independent pairs, which we will use below to solve the case of the Gaussian vector source.

Lemma 2 (Tensorization). Let (X , Y ) n be n independent pairs of random variables. Then { i i }i=1 n C (Xn; Y n) = min C (X ; Y ). (4) γ n n γi i i γi : γi=γ { }i=1 i=1 i=1 P X The proof is given in Appendix B. The lemma has an intuitive interpretation in R2 plane. If we express

n n 2 n n n n Cγ (X ; Y ) as a region in R , which is determined by (γ, Cγ(X ; Y )), then the computation of (γ, Cγ (X ; Y )) is simply the Minkowski sum of the individual regions which are determined by (γi, Cγi (Xi; Yi)).

Remark 1. Not surprisingly, for probabilistic models beyond independent pairs of random variables, one cannot generally order the quantities on the left and right hand sides in Equation (4), respectively. To see that the right hand side in Equation (4) can be an upper bound to the left hand side, suppose first that X1 = X2 and Y1 = Y2. Then,

Cγ (X1,X2; Y1, Y2) = Cγ (X1; Y1) Cγ/2(X1; Y1)+ Cγ/2(X2; Y2) = minγ +γ γ Cγ (X1; Y1)+ Cγ (X2; Y2), ≤ 1 2≤ 1 2 since Cγ (X; Y ) is a non-increasing function of γ. By contrast, to see that the right hand side in Equation (4) can be a lower bound to the left hand side, consider now binary random variables and let X1 and Y1 be independent and uniform. Let X = X Z and Y = Y Z, where Z is binary uniform and independent, and denotes 2 1 ⊕ 2 1 ⊕ ⊕ modulo-addition. Then, C (X ; Y )= C (X ; Y )=0, while C (X ,X ; Y , Y ) C (Z; Z)=1 γ, where the γ 1 1 γ 2 2 γ 1 2 1 2 ≥ γ − inequality is due to the Data Processing Inequality, i.e., Item 2) of Lemma 1, and it is straightforward to establish that for discrete random variables Z, we have C (Z; Z)= H(Z) γ. γ −

September 29, 2020 DRAFT 5

III. THE SCALAR GAUSSIAN CASE

One of the main technical contributions of this work is a closed-form formula for Cγ (X; Y ) in the case where X and Y are jointly Gaussian.

Theorem 3. When X and Y are jointly Gaussian with correlation coefficient ρ, then

2γ 1 + 1+ ρ 1 √1 e− Cγ (X; Y )= log | | − − . (5) 2γ 2 1 ρ · 1+ √1 e− ! − | | − The proof is given below in Section III-B.

C(X; Y ) Cγ (X; Y ) I(X; Y ) γ −

I(X; Y )

I(X; Y ) γ

Fig. 1. Cγ (X; Y ) for jointly Gaussian X and Y for the case ρ = 1/2, thus, we have C(X; Y ) = log √3 and I(X; Y ) = log(2/√3). The dashed line is the lower bound from Lemma 1, Item 1).

A. Preliminary Results for the Proof of Theorem 3

The following results are used as intermediate tools in the proof of the main results.

Theorem 4. For K 0, 0 <λ< 1, there exists a 0 K′ K and (X′, Y ′) (0,K′) such that (X, Y ) p    ∼ N ∼ X,Y with covariance matrix K the following inequality holds:

inf h(Y W )+ h(X W ) (1 + λ)h(X, Y W ) h(Y ′)+ h(X′) (1 + λ)h(X′, Y ′). (6) W | | − | ≥ −

1 Proof. The theorem is a consequence of [20, Theorem 2], for a specific choice of p = λ +1. The rest of the proof is given in Appendix C.

To leverage Theorem 4, we need to understand the covariance matrix K′. In [20], the right hand side in Equation (6) is further lower bounded as h(Y )+ h(X) (1 + λ)h(X, Y ), where (X, Y ) (0,K) (correlation − ∼ N coefficient of matrix K is ρ and the diagonal entries are unity), which holds for λ < ρ. This choice establishes the hypercontractivity bound (1 + ρ)I(W ; X, Y ) I(W ; X)+ I(W ; Y ) (for jointly Gaussian X, Y and any W ). ≥

September 29, 2020 DRAFT 6

Unfortunately, for the problem of Wyner’s common information, this leads to a loose lower bound, which can be seen as follows:

Cγ=0(X; Y ) = inf I(X, Y ; W ) (7) p(w x,y):I(X;Y W )=0 | | = inf I(X, Y ; W ) (8) p(w x,y):I(X,Y ;W )+I(X;Y ) I(W ;X) I(W ;Y )=0 | − − inf I(X, Y ; W ) (9) ≥ p(w x,y):I(X;Y ) ρI(X,Y ;W ) 0 | − ≤ = inf I(X, Y ; W ) (10) p(w x,y):I(X,Y ;W ) I(X;Y ) | ≥ ρ 1 I(X; Y ) 1 log 1 ρ2 = = − , (11) ρ 2 ρ where (9) follows from (1 + ρ)I(W ; X, Y ) I(W ; X)+ I(W ; Y ). ≥ We now show that by a different lower bound on the right hand side in Equation (6), we can indeed get a tight

lower bound for the problem of Wyner’s common information as well as its relaxation Cγ (X; Y ). Specifically, we have the following lower bound:

Lemma 5. For (X′, Y ′) (0,K′), the following inequality holds ∼ N 2 1 1 λ 2 (1 ρ) (1 + λ) min h(X′)+ h(Y ′) (1 + λ)h(X′, Y ′) log log (2πe) − , (12) − ≥ 2 1 λ2 − 2 1 λ 1 ρ − − K′:0 K′      ρ 1 where λ ρ. ≤ Proof. The proof is given in Appendix D.

B. Proof of Theorem 3

The proof of the converse for Theorem 3 involves two main steps. In this section, we prove that one optimal distribution is jointly Gaussian via a variant of the factorization of convex envelope. Then, we tackle the resulting optimization problem with Lagrange duality. Let us start form the lower bound first.

Lemma 6. When X and Y are jointly Gaussian with correlation coefficient ρ and unit , then C (X; Y ) γ ≥ 1 + 1+ ρ 1 √1 e−2γ log | | − − − . 2 1 ρ 1+√1 e 2γ −| | · −   Proof. The lower bound is derived in the following lines

Cγ (X; Y ) = inf I(X, Y ; W ) (13) W :I(X;Y W ) γ | ≤ inf(1 + µ)I(X, Y ; W ) µI(X; W ) µI(Y ; W )+ µI(X; Y ) µγ (14) ≥ W − − − 1 = h(X, Y ) µγ + µ inf h(X W )+ h(Y W ) (1 + )h(X, Y W ) (15) − W | | − µ | 1 h(X, Y ) µγ + µ min h(X′)+ h(Y ′) (1 + )h(X′, Y ′) (16) ≥ − − µ 1 ρ K′:0 K′      ρ 1

September 29, 2020 DRAFT 7

1 µ µ2 1 (1 ρ)2(µ + 1) log (2πe)2(1 ρ2) µγ + log log (2πe)2 − (17) ≥ 2 − − 2 µ2 1 − 2 µ 1 − − 1+ ρ 1 √1 e 2γ log+ | | − − − (18) 2γ ≥ 1 ρ · 1+ √1 e− ! − | | − 1 where (14), is a bound for all µ 0; (16) follows from Theorem 4 where (X′, Y ′) (0,K′), µ := and for ≥ ∼ N λ the assumption 0 <λ< 1 to be satisfied we need µ> 1; (17) follows from Lemma 5 for µ 1 and (18) follows ≥ ρ by maximizing the function 1 µ µ2 1 (1 ρ)2(µ + 1) g(µ) := log (2πe)2(1 ρ2) µγ + log log (2πe)2 − , (19) 2 − − 2 µ2 1 − 2 µ 1 − − 1 1 for µ ρ . Now we need to choose the tightest bound where µ ρ , which is maxµ 1 g(µ) and function g is ≥ ≥ ≥ ρ concave in µ, ∂2g 1 = < 0. (20) ∂µ2 −µ(µ2 1) − By studying the monotonicity we obtain ∂g 1 µ2 1 = log − γ, (21) ∂µ −2 µ2 − since the function is concave the maximum has to be when the derivative vanishes which leads to the optimal solution µ = 1 , where µ 1 . Substituting for the optimal µ we obtain √1 e−2γ ρ ∗ − ∗ ≥ ∗ 2γ 1 1 + 1+ ρ 1 √1 e− Cγ (X; Y ) g = log − − . (22) 2γ 2γ ≥ √1 e− 2 1 ρ · 1+ √1 e− !  −  − −

Now let us move the attention to the upper bound. Let us assume (without loss of generality) that X and Y have unit variance and are non-negatively correlated with correlation coefficient ρ 0. Since they are jointly Gaussian, ≥ we can express them as

X = σW + 1 σ2N (23) − X p Y = σW + 1 σ2N , (24) − Y p where W, N ,N are jointly Gaussian, and where W (0, 1) is independent of (N ,N ). Letting the X Y ∼ N X Y covariance of the vector (NX ,NY ) be

1 α K(NX ,NY ) = (25) α 1   2 ρ α √ 2γ for some 0 α ρ, we find that we need to choose σ = 1−α . Specifically, let us select α = 1 e− , for ≤ ≤ − − 1 1 some 0 γ 2 log 1 ρ2 . For this choice, we find I(X; Y W )= γ and ≤ ≤ − | 1 (1 + ρ)(1 α) I(X, Y ; W )= log − . (26) 2 (1 ρ)(1 + α) −

September 29, 2020 DRAFT 8

IV. THE VECTOR GAUSSIAN CASE

In this section, we consider the case where X and Y are jointly Gaussian random vectors. The key observation is that in this case, there exist invertible matrices A and B such that AX and BY are vectors of independent pairs, exactly like in Lemma 2. Therefore, we can use that theorem to give an explicit formula for the relaxed Wyner’s common information between arbitrarily correlated jointly Gaussian random vectors, as stated in the following theorem.

X Y Theorem 7. Let and be jointly Gaussian random vectors of length n and covariance matrix K(X,Y). Then, n C (X; Y) = min C (X ; Y ), (27) γ n γi i i γi: γi=γ i=1 i=1 P X where

2γi 1 + (1 + ρi)(1 √1 e− ) Cγi (Xi; Yi)= log − − (28) 2 (1 ρ )(1 + √1 e 2γi ) − i − − 1/2 1/2 1/2 1/2 and ρi (for i = 1,...,n) are the singular values of KX− KXYKY− , where KX− and KY− are defined to mean that only the positive eigenvalues are inverted.

Remark 2. Note that we do not assume that KX and KY are of full rank. Moreover, note that the case where X and Y are of unequallength is included: Simply invoke Lemma 1, Item 4), to append the shorter vector with independent Gaussians so as to end up with two vectors of the same length.

Proof. Note that the mean is irrelevant for the problem at hand, so we assume it to be zero without loss of generality. 1/2 The first step of the proof is to apply the same transform used, e.g., in [12]. Namely, we form Xˆ = KX− X and 1/2 1/2 1/2 Yˆ = KY− Y, where KX− and KY− are defined to mean that only the positive eigenvalues are inverted. Let

us denote the rank of KX by rX and the rank of KY by rY . Then, we have

Ir 0 K = X (29) Xˆ   0 0n rX − and  

Ir 0 K = Y (30) Yˆ   0 0n rY − 1/2 1/2   Moreover, we have KXˆ Yˆ = KX− KXYKY− . Let us denote the singular value decomposition of this matrix T by KXˆ Yˆ = RXΛRY. Define X˜ = RXXˆ and Y˜ = RYYˆ , which implies that KX˜ = KXˆ ,KY˜ = KYˆ , and

KX˜ Y˜ = Λ. The second step of the proof is to observe that the mappings from X to X˜ and from Y to Y˜ , respectively, are linear one-to-one and mutual information is preserved under such transformation. Hence, we have C (X; Y)= C (X˜ ; Y˜ ). The third, and key, step of the proof is now to observe that (X , Y ) n are n independent γ γ { i i }i=1 pairs of random variables. Hence, we can apply Lemma 2. The final step is to apply Theorem 3 separately to each of the independent pairs, thus establishing the claimed formula.

In the remainder of this section, we explore the structure of the allocation problem in Theorem 7, that is, the

problem of optimally choosing the values of γi. As we will show, the answer is of the water-filling type. That is,

September 29, 2020 DRAFT 9

there is a “water level” γ∗. Then, all γi whose corresponding correlation coefficient ρi is large enough will be set

equal to γ∗. The remaining γi, corresponding to those i with low correlation coefficient ρi, will be set to their

respective maximal values (all of which are smaller than γ∗). To establish this result, we prefer to change notation

as follows. We define α = √1 e 2γi . With this, we can express the allocation problem in Theorem 7 as i − − n n X Y 1 + (1 + ρi)(1 αi) 1 1 Cγ ( ; ) = min log − such that log 2 γ. (31) α1,α2, ,αn 2 (1 ρ )(1 + α ) 2 1 α ≤ ··· i=1 i i i=1 i X − X − Moreover, defining 1 1+ ρ 1 1 C(ρ)= log , I(ρ)= log , (32) 2 1 ρ 2 1 ρ2 − − we can rewrite Equation (31) as n n + Cγ (X; Y) = min (C(ρi) C(αi)) such that I(αi) γ. (33) α1,α2, ,αn − ≤ ··· i=1 i=1 X X Theorem 8. The solution to the allocation problem of Theorem 7 can be expressed as n + C (X; Y)= (C(ρ ) β∗) , (34) γ i − i=1 X where β∗ is selected such that n min f(β∗), I(ρ ) = γ, (35) { i } i=1 X where 2 1 (exp(2β∗)+1) f(β∗)= log . (36) 2 4 exp(2β∗) Proof of Theorem 8. Note that (33) can be rewritten as n n 1 + Cγ (X; Y) = min C(ρi) C(I− (γi)) such that γi γ, (37) γ1,γ2, ,γn − ≤ ··· i=1 i=1 X  X and thus, for notational compactness, let us define

2x 1 1 1+ √1 e− g(x)= C(I− (x)) = log − , (38) 2 1 √1 e 2x − − − which is a strictly concave, strictly increasing function. We also define its inverse, 2 1 1 1 1 1 (exp(2x)+1) f(x)= g− (x)= I(C− (x)) = log 2 = log , (39) 2 exp(2x) 1 2 4 exp(2x) 1 − − exp(2x)+1) which is a strictly convex, strictly increasing function.   Without loss of generality, suppose that ρ ρ ρ . The objective function is composed of n terms 1 ≥ 2 ≥···≥ n which can be active or not, meaning that they can be either positive or zero. Since the function C(ρ) is increasing in ρ, we have that C(ρ ) C(ρ ) C(ρ ). To summarize the intuition of the proof, note that the n-th 1 ≥ 2 ≥··· ≥ n term, i.e., (C(ρ ) g(γ ))+ , will be inactive first. Therefore, by increasing γ then the terms will become inactive n − n in a decreasing fashion until we are left with only the first term active and the rest inactive.

September 29, 2020 DRAFT 10

Let us start with the case when they are all active, which means that n (C(ρ ) g(γ ))+ = n (C(ρ ) g(γ )) i=1 i − i i=1 i − i Then, by the concavity of g(γi), we have P P n γ g(γ ) ng( ), (40) i ≤ n i=1 X γ γ thus an optimal choice is γ∗ = n , for all i. Hence, in our notation, in this case β∗ = g( n ). Clearly, all the terms are active in the interval 0 γ nI(ρ ), with the reasoning that if the n-th terms is active then the rest of ≤ ≤ n the terms is active too. Next, consider the case when the n-th term is inactive and the rest is active. Therefore, n + n 1 (C(ρ ) g(γ )) = − (C(ρ ) g(γ )) and by the concavity of g(γ ), we have i=1 i − i i=1 i − i i n 1 P P − γ g(γ ) (n 1)g , (41) i ≤ − n 1 i=1 X  −  γ γn thus an optimal choice is γ∗ = n− 1 , for all i 1, 2, ,n 1 . The optimal choice for γn is γn = I(ρn), − ∈ { ··· − } which makes the n-th term exactly zero. This scenario will happen in the interval, nI(ρ ) < γ I(ρ ) + (n n ≤ n − γ I(ρn) 1)I(ρn 1). Instead, the corresponding β∗ in our notation is β∗ = g −n 1 . In general, let us consider the − − case when k-th term is active and k +1-th is inactive. By a similar argument as above, the optimal choice is n γ γi −Pi=k+1 γ∗ = n k for i 1, 2, , k and γi = I(ρi) for i k +1, ,n . This scenario will happen in the − ∈{ ··· } ∈{ ··· } interval (k + 1)I(ρ )+ n I(ρ ) <γ kI(ρ )+ n I(ρ ). Importantly, observe that the optimal γ k+1 i=k+2 i ≤ k i=k+1 i i can be rewritten as γi = minP I(ρi),γ∗ , therefore the solutionP to the allocation problem can be expressed as { } n + C (X; Y)= (C(ρ ) g(γ∗)) , (42) γ i − i=1 X where γ∗ is selected such that n min γ∗, I(ρ ) = γ. (43) { i } i=1 X The solution to the allocation problem can be rewritten as n + C (X; Y)= (C(ρ ) β∗) , (44) γ i − i=1 X where β∗ is selected such that n min f(β∗), I(ρ ) = γ. (45) { i } i=1 X

Theorem 8 shows that the allocation problem has a natural reverse water-filling interpretation which can be

visualized in two dual ways. First, we could consider the space of the γi parameters, which leads to Figure 2: None

of the γi should be selected larger than the corresponding I(ρi), and those γi that are strictly smaller than their maximum value should all be equal. This graphically identifies the optimal value γ∗, and thus, the resulting solution to our optimization problem. Alternatively, we could consider directly the space of the individual contributions to the objective, denoted by C(ρi) in Equation (37), which leads to Figure 3.

September 29, 2020 DRAFT 11

I(ρ1)

I(ρ2)

I(ρn 2) − γ∗ γ γ1 γ2 n 2 γn 1 ········· − − γn 1 2 n 2 n 1 n − −

Fig. 2. Example of reverse water-filling. The (whole) bars represent the γi-s which make Cγi (Xi; Yi) = 0, and the shaded area of the bars n − − is the proper allocation γi to minimize the original problem. In this example, γ = Pi=1 γi is chosen such that Cγn−1 (Xn 1; Yn 1) =

Cγn (Xn; Yn) = 0.

C(ρ ) C(ρ ) β∗ 1 1 − C(ρ2)

C(ρ ) β∗ 2 − C(ρn 2) C(ρn 2) β∗ − − − β∗

········· 1 2 n 2 n 1 n − −

Fig. 3. Example of reverse water-filling. The (whole) bars represent the (standard) Wyner’s common information of each individual pair, ∗ + respectively. The shaded area of the bars is the respective contribution to Cγ (X; Y). In this example, γ is chosen such that (C(ρn−1) β ) = − ∗ + (C(ρn) β ) = 0. −

R1 x Xˆ D

R0 (X, Y ) E

R2 ˆ Dy Y

Fig. 4. The Gray-Wyner Network

V. THE GAUSSIAN GRAY-WYNER NETWORK

The Gray-Wyner network [2] is composed of one sender and two receivers, as illustrated in Figure 4. In a nutshell, the sender compresses two underlying correlated sources X and Y (with fixed p(x, y)) into three descriptions. Here,

we follow the notation and formal problem statement as given in [2, Section II]. The central description, of rate R0, is provided to both receivers. Additionally, each receiver also has access to a tailored private description at rates

R1 and R2, respectively. At the receivers, reconstruction is accomplished to within a fidelity criterion. For given

fidelity requirements ∆1 and ∆2 in the reconstruction of sources X and Y, respectively, we seek to characterize

September 29, 2020 DRAFT 12

the set of achievable rate triples (R0, R1, R2), again following [2, Section II] to the letter. The full solution, up to the optimization over an auxiliary, is characterized in [2, Theorem 8], see [2, Equations (40a)-(40b)]. Namely, define the regions

(W ) (∆1, ∆2)= (R0, R1, R2): R0 I(X, Y ; W ), R1 RX W (∆1), R2 RY W (∆2) . (46) R ≥ ≥ | ≥ |  Here, RX W ( ) and RY W ( ) denote the conditional rate-distortion functions of X and Y, respectively, given W, | · | · see [21]. Then, the optimal region, denoted by ∗(∆ , ∆ ), is the (set) closure of the union of these regions over R 1 2 all choices of W. The difficulty with this result is taking the union over all W. For the jointly Gaussian source (X, Y ) subject to mean-squared error distortion, the complete solution remains unknown. An account of this special case already appears in [2, Section 2.5(B)]. Partial progress was made in [13],

[4] for the special case where R0 + R1 + R2 = RX,Y (∆1, ∆2). Here, RX,Y (∆1, ∆2) denotes the rate-distortion

function of jointly encoding X and Y to fidelities ∆1 and ∆2, respectively. By a simple cut-set argument, any scheme must satisfy R + R + R R (∆ , ∆ ), and thus, if a scheme attains this bound with equality, 0 1 2 ≥ X,Y 1 2 it is necessarily optimal. It is immediate that if R0 is large enough, then it is possible to meet with equality

R0 + R1 + R2 = RX,Y (∆1, ∆2). However, to date, no progress has been reported for the general case where it is not possible to attain this cut-set bound with equality. The main contribution of the present paper is a closed-form solution for the general case. Specifically, our techniques allow us to establish that restricting the union over all W to only jointly Gaussian auxiliaries is without loss of optimality. To keep notation simple, we consider the following symmetric projection of the optimal rate region:

R (X, Y ) = min R such that (R , R , R ) ∗(∆, ∆) and R + R α. (47) ∆,α 0 0 1 2 ∈ R 1 2 ≤

Using Equation (46), we can express R∆,α(X, Y ) explicitly as the following optimization problem:

R∆,α(X, Y ) = inf I(X, Y ; W )

such that I(X; Xˆ W )+ I(Y ; Yˆ W ) α and E[(X Xˆ)2] ∆ and E[(Y Yˆ )2] ∆, (48) | | ≤ − ≤ − ≤ where the infimum is over all distributions p(w, x,ˆ yˆ x, y). | Then, we have the following theorem:

Theorem 9. Let X and Y be jointly Gaussian with mean zero, equal variance σ2, and with correlation coefficient ρ. Let the distortion measure be mean-squared error. Then,

1 + 1+ρ if 2 α 2 2 log 2 ∆ eα+ρ 1 , σ (1 ρ) ∆e σ σ2 − − ≤ ≤ R∆,α(X, Y )= 2  1 + 1 ρ α 2 log 2− , if ∆e σ (1 ρ).  2 ∆ e2α  σ4 ≤ − Proof. First, we observe that for mean-squared error, the source variance is irrelevant: A scheme attaining distortion ∆ for sources of variance σ2 is a scheme attaining distortion ∆/σ2 on unit-variance sources, and vice versa.

September 29, 2020 DRAFT 13

Therefore, for ease of notation, in the sequel, we assume that the sources are of unit variance. Then, we can bound:

R∆,α(X, Y ) = inf I(X, Y ; W ) (49) W,X,ˆ Yˆ :I(X;Xˆ W )+I(Y ;Yˆ W ) α | | ≤ E[(X Xˆ )2] ∆ − ≤ E[(Y Yˆ )2] ∆ − ≤ inf I(X, Y ; W )+ ν(I(X; Xˆ W )+ I(Y ; Yˆ W ) α) (50) ≥ W,X,ˆ Yˆ :E[(X Xˆ)2] ∆ | | − − ≤ E[(Y Yˆ )2] ∆ − ≤ = inf h(X, Y ) να + ν(h(X W )+ h(Y W )) h(X, Y W ) W,X,ˆ Yˆ :E[(X Xˆ)2] ∆ − | | − | − ≤ E[(Y Yˆ )2] ∆ − ≤ ν(h(X W, Xˆ)+ h(Y W, Yˆ )) (51) − | | 1 h(X, Y ) να + ν inf h(X W )+ h(Y W ) h(X, Y W ) ≥ − W | | − ν | + inf ν(h(X W, Xˆ)+ h(Y W, Yˆ )) (52) W,X,ˆ Yˆ :E[(X Xˆ)2] ∆ − | | − ≤ E[(Y Yˆ )2] ∆ − ≤ 1 h(X, Y ) να + ν min h(X′)+ h(Y ′) h(X′, Y ′) ≥ − · − ν 1 ρ 0 K′      ρ 1

ˆ ˆ + ν  min h(X W, X) + min h(Y W, Y ) (53) · (W,X,ˆ Yˆ ) G: − | (W,X,ˆ Yˆ ) G: − | ∈P ∈P E[(X Xˆ )2] ∆ E[(Y Yˆ )2] ∆  − ≤ − ≤    1 = h(X, Y ) να ν log (2πe∆) + ν min h(X′)+ h(Y ′) h(X′, Y ′) (54) − − · − ν 1 ρ 0 K′      ρ 1 1 ν ν2 1 ν (1 ρ)2 = log (2πe)2(1 ρ2) να ν log (2πe∆) + log − log (2πe)2 − (55) 2 − − − 2 2ν 1 − 2 2ν 1 − − 1 + 1+ρ α 2 log 2∆eα+ρ 1 , if 1 ρ ∆e 1 = − − ≤ ≤ (56) 1 + 1 ρ2 α  log − , if ∆e 1 ρ.  2 ∆2e2α ≤ − where (50) follows from weak duality for ν 0; (52) follows from bounding the infimum of the sum with the sum  ≥ of the infima of its summands, and the fact that relaxing the constraints cannot increase the value of the infimum; (53) follows from Theorem 4 where ν := 1 and for the constraint 0 λ< 1 (indeed we can also include zero) 1+λ ≤ to be satisfied we need 1 <ν 1 and [22, Lemma 1] on each of the terms; (54) follows by observing 2 ≤ h(X W, Xˆ)= h(X Xˆ W, Xˆ) (57) | − | h(X Xˆ) (58) ≤ − 1 log(2πe∆), (59) ≤ 2 where the last step is due to the fact that E[(X Xˆ)2] ∆; (55) follows from Lemma 5 for ν 1 ; and (56) − ≤ ≥ 1+ρ follows from maximizing 1 ν ν2 1 ν (1 ρ)2 ℓ(ν) := log (2πe)2(1 ρ2) να ν log (2πe∆) + log − log (2πe)2 − , (60) 2 − − − 2 2ν 1 − 2 2ν 1 − −

September 29, 2020 DRAFT 14

1 for 1 ν 1+ρ . Now we need to choose the tightest bound max1 ν 1 ℓ(ν). Note that the function ℓ is concave ≥ ≥ ≥ ≥ 1+ρ since ∂2ℓ 1 = < 0. (61) ∂ν2 −ν(2ν 1) − Since it also satisfies monotonicity ∂ℓ ν(1 ρ) = log − , (62) ∂ν (2ν 1)∆eα − ∆eα its maximal value occurs when the derivative vanishes, that is, when ν = 2∆eα 1+ρ . Substituting for the optimal ∗ − ν we get ∗ ∆eα 1 1+ ρ R (X, Y ) ℓ = log+ , (63) ∆,α ≥ 2∆eα 1+ ρ 2 2∆eα 1+ ρ  −  − for 1 ν 1 , which means the expression is valid for 1 ρ ∆eα 1. ≥ ∗ ≥ 1+ρ − ≤ ≤ The other case is ∆eα 1 ρ. In this case note that ν(1 ρ) ν∆eα (2ν 1)∆eα for ν 1. This implies ≤ − − ≥ ≥ − ≤ ν(1 ρ) ∂ℓ (2ν 1)∆− eα 1, thus we have ∂ν 0. Since the function is concave and increasing the maximum is attained at − ≥ ≥ ν =1, thus ∗ 1 1 ρ2 R (X, Y ) ℓ (1) = log+ − , (64) ∆,α ≥ 2 ∆2e2α where the expression is valid for ∆eα 1 ρ. As stated at the beginning of the proof, this is the correct formula ≤ − assuming unit-variance sources. For sources of variance σ2, it suffices to replace ∆ with ∆/σ2, which leads to the expression given in the theorem statement.

VI. CONCLUDING REMARKS

We studied a natural relaxation of Wyner’s common information, whereby the constraint of conditional indepen- dence is replaced by an upper bound on the conditional mutual information. This leads to a novel and different optimization problem. We established a number of properties of this novel quantity, including a chain rule type formula for the case of independent pairs of random variables. For the case of jointly Gaussian sources, both scalar and vector, we presented a closed-form expression for the relaxed Wyner’s common information. Finally, using the same tool set, we fully characterize the lossy Gaussian Gray-Wyner network subject to mean-squared error.

APPENDIX A

PROOF OF LEMMA 1

For Item 1), the inequality follows from the fact that mutual information is non-negative. If γ I(X; Y ), we ≥ may select W to be a constant, thus we have equality to zero. If γ < I(X; Y ), then the lower bound proved in the next item establishes that we cannot have equality to zero. Also, observe that the Lagrangian for the relaxed Wyner’s common information problem of Equation (3) is L(λ, p(w x, y)) = I(X, Y ; W )+λ(I(X; Y W ) γ). From | | − Lagrange duality, we thus have the lower bound Cγ (X; Y ) infp(w x,y) L(λ, p(w x, y)), for all positive λ. Setting ≥ | |

September 29, 2020 DRAFT 15

λ =1, we have infp(w x,y)(I(X, Y ; W )+I(X; Y W ) γ) = infp(w x,y)(I(X; Y )+I(X; W Y )+I(Y ; W X) γ)= | | − | | | − I(X; Y ) γ. For Item 2), observe that for fixed p(x,y,z), we can write −

Cγ (X; Y ) = inf I(X, Y ; W ) (65) p(x,y,z)p(w x,y):I(X;Y W ) γ | | ≤ inf I(X,Z; W ), (66) ≥ p(x,y,z)p(w x,y):I(X;Y W ) γ | | ≤ due to the Markov chain (X,Z) (X, Y ) W. Moreover, note that since we consider only joint distributions of − − the form p(x, y)p(z y)p(w x, y), we also have the Markov chain (X, W ) Y Z, which implies the Markov chain | | − − X (W, Y ) Z. The latter implies I(X; Y W ) I(X; Z W ). Hence, − − | ≥ |

Cγ (X; Y ) inf I(X,Z; W ) Cγ (X; Z). (67) ≥ p(x,y,z)p(w x,y):I(X;Z W ) γ ≥ | | ≤ By the same token, C (Y ; Z) C (X; Z), which completes the proof. Item 3) follows directly from [19, Corollary γ ≥ γ 4.5]. For Item 4), on the one hand, we have

Cγ ((X,Z); Y ) = inf I(X,Z,Y ; W ) (68) p(w x,y,z):I(X,Z;Y W ) γ | | ≤ inf I(X, Y ; W )+ I(Z; W X, Y ) (69) ≤ p(w x,y):I(X;Y W )+I(Z;Y W,X) γ | | | | ≤

= Cγ (X; Y ) (70)

where in Equation (69) we add the constraint that W is selected to be independent of Z, which cannot reduce the value of the infimum. Clearly, for such a choice of W, we have I(Z; Y W, X)=0 and I(Z; W X, Y )=0, which | | thus establishes the last step. Conversely, observe that

Cγ ((X,Z); Y ) = inf I(X,Y,Z; W ) (71) p(w x,y,z):I(X,Z;Y W ) γ | | ≤ inf I(X, Y ; W ) + inf I(Z; W X, Y ) (72) ≥ p(w x,y):I(X;Y W ) γ p(w x,y,z):I(X,Z;Y W ) γ | | | ≤ | | ≤ C (X; Y ) (73) ≥ γ where (72) follows from the fact that the infimum of the sum is lower bounded by the sum of the infimums and the fact that relaxing constraints cannot increase the value of the infimum, and (73) follows from non-negativity of the second term.

APPENDIX B

PROOF OF LEMMA 2

The achievability part, that is, the inequality n C (Xn; Y n) min C (X ; Y ), (74) γ n n γi i i ≤ γi : γi=γ { }i=1 i=1 i=1 P X

September 29, 2020 DRAFT 16 merely corresponds to a particular choice of W in the definition given in Equation (3). Specifically, let W = (W , W ,...,W ), and choose (X , Y , W ) n to be n independent triples of random vectors. The converse is 1 2 n { i i i }i=1 more subtle. We prove the case n =2 first, followed by induction. For n =2, we have

inf I(X1,X2, Y1, Y2; W ) p(w x ,x ,y ,y ):I(X ,X ;Y ,Y W ) γ | 1 2 1 2 1 2 1 2| ≤ (a) inf I(X1, Y1; W )+ I(X2, Y2; W, X1) (75) ≥ p(w x1,x2,y1,y2):I(X1;Y1 W )+I(X2;Y2 W,X1) γ | | | ≤ (b) = min inf I(X1, Y1; W )+ I(X2, Y2; W, X1) (76) γ1+γ2=γ p(w x ,x ,y ,y ):I(X ;Y W ) γ ,I(X ;Y W,X ) γ  | 1 2 1 2 1 1| ≤ 1 2 2| 1 ≤ 2  (c) min inf I(X1, Y1; W ) (77) ≥ γ1+γ2=γ p(w x1,x2,y1,y2):I(X1;Y1 W ) γ1,I(X2;Y2 W,X1) γ2  | | ≤ | ≤

+ inf I(X2, Y2; W,X˜ 1) (78) p(w ˜ x ,x ,y ,y ):I(X ;Y W˜ ) γ ,I(X ;Y W˜ ,X ) γ | 1 2 1 2 1 1| ≤ 1 2 2| 1 ≤ 2 ) (d) min inf I(X1, Y1; W ) + inf I(X2, Y2; W,X˜ 1) (79) ≥ γ1+γ2=γ p(w x1,y1):I(X1;Y1 W ) γ1 p(w ˜ x ,x ,y ):I(X ;Y W˜ ,X ) γ ( | | ≤ | 1 2 2 2 2| 1 ≤ 2 ) (e) min inf I(X1, Y1; W1) + inf I(X2, Y2; W,˜ X˜1) (80) ≥ γ1+γ2=γ p(w1 x1,y1):I(X1;Y1 W1) γ1 p(w, ˜ x˜ x ,y ):I(X ;Y W,˜ X˜ ) γ ( | | ≤ 1| 2 2 2 2| 1 ≤ 2 ) where Step (a) follows from

I(X ,X , Y , Y ; W )= I(X , Y ; W )+ I(X , Y ; W X , Y )+ I(X , Y ; X , Y ) (81) 1 2 1 2 1 1 2 2 | 1 1 1 1 2 2

= I(X1, Y1; W )+ I(X2, Y2; W, X1, Y1) (82)

I(X , Y ; W )+ I(X , Y ; W, X ) (83) ≥ 1 1 2 2 1 and the constraint is relaxed as follows

γ I(X ,X ; Y , Y W )= I(X ; Y , Y W )+ I(X ; Y , Y W, X ) (84) ≥ 1 2 1 2| 1 1 2| 2 1 2| 1 I(X ; Y W )+ I(X ; Y W, X ), (85) ≥ 1 1| 2 2| 1 Step (b) follows from splitting the minimization, Step (c) follows from minimizing each subproblem individually which would result in a lower bound to the original problem, Step (d) follows from reducing the number of constraints resulting into a lower bound, Step (e) follows from introducing X˜1 as a to be optimized, whereas before X1 had a fixed distribution. In other words, the preceding minimization is taken over p(˜w x ,y , x )p(x x ,y ) where p(x x ,y ) has a fixed distribution, whereas now the minimization is taken over | 2 2 1 1| 2 2 1| 2 2 p(˜w x ,y , x˜ )p(˜x x ,y ), where we also optimize over p(˜x x ,y ). Lastly, denoting W = (W,˜ X˜ ), this can | 2 2 1 1| 2 2 1| 2 2 2 1 be expressed as

inf I(X1,X2, Y1, Y2; W ) p(w x ,x ,y ,y ):I(X ,X ;Y ,Y W ) γ | 1 2 1 2 1 2 1 2| ≤

min inf I(X1, Y1; W1) + inf I(X2, Y2; W2) (86) ≥ γ1+γ2=γ {p(w1 x1,y1):I(X1;Y1 W1) γ1 p(w2 x2,y2):I(X2;Y2 W2) γ2 } | | ≤ | | ≤

September 29, 2020 DRAFT 17

After proving it for n = 2, we will use the standard induction. In other words, we will assume that the converse holds for n 1 i.e. − n 1 − n 1 n 1 (87) Cγ¯(X − ; Y − ) min− Cγi (Xi; Yi), ≥ γ : n 1 γ =¯γ i i=1 i i=1 P X after we prove it for n as follows,

inf I(Xn, Y n; W ) p(w xn,yn):I(Xn;Y n W ) γ | | ≤ (f) n 1 n 1 n 1 inf I(X − , Y − ; W )+ I(Xn, Yn; W, X − ) (88) n n n−1 n−1 n−1 ≥ p(w x ,y ):I(X ;Y W )+I(Xn;Yn W,X ) γ | | | ≤

(g) n 1 n 1 n 1 = min inf I(X − , Y − ; W )+ I(X , Y ; W, X − ) (89) − − − n n γ¯+γn=γ  n 1 n 1 n 1  n n I(X ;Y W ) i=1 γi, p(w x ,y ): | n≤−1P   | I(Xn;Yn W,X ) γn  | ≤   (h)   n 1 n 1 min inf I(X − , Y − ; W ) (90) − − − γ¯+γn=γ  n 1 n 1 n 1 ≥ n n I(X ;Y W ) i=1 γi, p(w x ,y ): | n≤−1P  | I(Xn;Yn W,X ) γn | ≤  n 1 + inf I(Xn, Yn; W,X˜ − ) (91) n−1 n−1 ˜ n−1  n n I(X ;Y W ) i=1 γi, p(w ˜ x ,y ): | n≤−P1  | I(Xn;Yn W˜ ,X ) γn  | ≤ (i)  n 1 n 1 min inf I(X − , Y − ; W ) (92) γ¯+γ =γ n−1 n−1 n−1 n−1 n−1 ≥ n p(w x ,y ):I(X ;Y W ) γi ( | | ≤Pi=1

n 1 + inf I(Xn, Yn; W,X˜ − ) (93) n n−1 p(w ˜ x ,yn):I(Xn;Yn W˜ ,X ) γn | | ≤ ) (j) n 1 n 1 min inf I(X − , Y − ; W ) (94) γ¯+γ =γ n−1 n−1 n−1 n−1 n−1 ≥ n p(w x ,y ):I(X ;Y W ) γi ( | | ≤Pi=1

n 1 + inf I(Xn, Yn; W,˜ X˜ − ) (95) n−1 n−1 p(w, ˜ x˜ xn,yn):I(Xn;Yn W,˜ X˜ ) γn | | ≤ )

(k) n 1 n 1 = min Cγ¯I(X − ; Y − ) + inf I(Xn, Yn; Wn) (96) γ¯+γn=γ p(wn xn,yn):I(Xn;Yn Wn) γn  | | ≤  n 1 (ℓ) − min Cγn (Xn; Yn) + min Cγi (Xi; Yi) (97) γ¯+γ =γ n−1 ≥ n γi: γi=¯γ ( Pi=1 i=1 ) n X = min C (X ; Y ), (98) n γi i i γi: γi=γ i=1 i=1 P X where Step (f) follows from

n n n 1 n 1 n 1 n 1 n 1 n 1 I(X , Y ; W )= I(X − , Y − ; W )+ I(X , Y ; W X − , Y − )+ I(X , Y ; X − , Y − ) (99) n n | n n n 1 n 1 n 1 n 1 = I(X − , Y − ; W )+ I(Xn, Yn; W, X − , Y − ) (100)

n 1 n 1 n 1 I(X − , Y − ; W )+ I(X , Y ; W, X − ) (101) ≥ n n

September 29, 2020 DRAFT 18 and the constraint is relaxed as follows

n n n 1 n n n 1 γ I(X ; Y W )= I(X − ; Y W )+ I(X ; Y W, X − ) (102) ≥ | | n | n 1 n 1 n 1 I(X − ; Y − W )+ I(X ; Y W, X − ). (103) ≥ | n n| Step (g) follows from the same argument as (b), Step (h) follows from the same argument as (c), Step (i) follows follows from the same argument as (d), Step (j) follows from a similar argument as (e), Step (k) follows from n 1 denoting Wn = (W,˜ X˜ − ), and Step (ℓ) follows from the induction hypothesis (87).

APPENDIX C

PROOF OF THEOREM 4

The techniques to establish the optimality of Gaussian distributions is used in [16] and is known as factorization of lower convex envelope. Let us define the following object

V (K) = inf inf h(Y W )+ h(X W ) (1 + λ)h(X, Y W ), (104) (X,Y ):K(X,Y )=K W | | − | where λ is a real number, 0 <λ< 1 and K is an arbitrary covariance matrix. Let ℓ (X, Y ) = h(Y )+ h(X) λ − (1 + λ)h(X, Y ), and ℓ˘ (X, Y ) = inf h(Y W )+ h(X W ) (1 + λ)h(X, Y W ), where ℓ˘ (X, Y ) is the lower λ W | | − | λ convex envelope of ℓλ(X, Y ). First, in Section C-A, we prove that the infimum is attained, then, in Section C-B, we prove that a Gaussian W attains the infimum in Equation (104). Together, these two arguments establish Theorem 4.

A. The infimum in Equation (104) is attained

Proposition 10 (Proposition 17 in [16]). Consider a sequence of random variables X , Y such that K { n n} (Xn,Yn)  K for all n, then the sequence is tight.

Theorem 11 (Prokhorov). If X , Y is a tight sequence then there exists a subsequence X , Y and a { n n} { ni ni } w limiting X , Y such that Xni , Yni X , Y converges weakly in distribution. { ∗ ∗} { } ⇒{ ∗ ∗} Note that ℓ (X, Y ) = h(Y )+ h(X) (1 + λ)h(X, Y ) can be written as (1 + λ)I(X; Y ) λ[h(X)+ h(Y )]. λ − − Thus, it is enough to show that this expression is lower semi-continuous. We will show by utilizing the following theorem.

w w Theorem 12 ([23]). If pX ,Y pX,Y and qX ,Y qX,Y , then D(pX,Y qX,Y ) lim inf D(pX ,Y qX ,Y ). n n n n n n n n n ⇒ ⇒ || ≤ →∞ || Observe that I(X; Y ) = D(p q ), where q = p p . For the theorem to hold we need to check the X,Y || X,Y X,Y X Y assumptions. First, from Theorem 11, we have p w p . Second, since the marginal distributions converge Xn,Yn ⇒ X,Y weakly if the joint distribution converges weakly, we also have q w q . Therefore, Xn,Yn ⇒ X,Y

I(X; Y ) lim inf I(Xn; Yn). (105) n ≤ →∞

To preserve the covariance matrix K(X,Y ), there are three degrees of freedom plus one degree of freedom coming from minimizing the objective, thus 4 is enough to attain the minimum. |W| ≤

September 29, 2020 DRAFT 19

Let us introduce δ > 0 and define N (0,δ), being independent of X ,X, Y and Y . From the entropy δ ∼ N { n} { n} power inequality, we have

h(X + N ) h(X ) (106) n δ ≥ n h(Y + N ) h(Y ), (107) n δ ≥ n and moreover, for Gaussian perturbations, we have

lim inf h(Xn + Nδ)= h(X + Nδ). (108) n →∞ This results in

lim inf ℓλ(Xn, Yn) = lim inf(1 + λ)I(Xn; Yn) λ[h(Xn)+ h(Yn)] (109) n n →∞ →∞ −

lim inf(1 + λ)I(Xn; Yn) λ[h(Xn + Nδ)+ h(Yn + Nδ)] (110) n ≥ →∞ − (1 + λ)I(X; Y ) λ[h(X + N )+ h(Y + N )], (111) ≥ − δ δ where (110) follows from (106) and (111) follows from (105), (108). Letting δ 0, we obtain the weak → semicontinuity of our object lim infn ℓλ(Xn, Yn) ℓλ(X, Y ). →∞ ≥

B. A Gaussian auxiliary W attains the infimum in Equation (104)

This proof closely follows the arguments in [20]. We include it for completeness. We start by creating two identical and independent copies of the minimizer (W, X, Y ), which are (W1,X1, Y1) and (W2,X2, Y2). In addition, let us X +X X X Y +Y Y Y define X = 1 2 , X = 1− 2 , Y = 1 2 and Y = 1− 2 . Thus, we have A √2 B √2 A √2 B √2

2V (K)= h(X ,X W , W )+ h(Y , Y W , W ) (1 + λ)h(X ,X , Y , Y W , W ) (112) 1 2| 1 2 1 2| 1 2 − 1 2 1 2| 1 2 = h(X ,X W , W )+ h(Y , Y W , W ) (1 + λ)h(X ,X , Y , Y W , W ) (113) A B| 1 2 A B| 1 2 − A B A B| 1 2 = h(X W , W )+ h(Y W , W ) (1 + λ)h(X , Y W , W ) (114) A| 1 2 A| 1 2 − A A| 1 2 + h(X X , Y , W , W )+ h(Y X , Y , W , W ) (1 + λ)h(X , Y X , Y , W , W ) (115) B| A A 1 2 B | A A 1 2 − B B| A A 1 2 + I(X ; Y Y , W , W )+ I(Y ; X X , W , W ) (116) A B| A 1 2 A B| A 1 2 2V (K)+ I(X ; Y Y , W , W )+ I(Y ; X X , W , W ), (117) ≥ A B| A 1 2 A B| A 1 2 where (113) follows from entropy preservation under bijective transformation and (117) follows from definition of V (K) such that K K. This would imply that (XA,YA)  I(X ; Y Y , W , W )= I(Y ; X X , W , W )=0. (118) A B| A 1 2 A B| A 1 2 Similarly we get

I(X ; Y Y , W , W )= I(Y ; X X , W , W )=0, (119) B A| B 1 2 B A| B 1 2

September 29, 2020 DRAFT 20

by switching the roles of index A and B. Through another way of factorization we get

2V (K)= h(X ,X W , W )+ h(Y , Y W , W ) (1 + λ)h(X ,X , Y , Y W , W ) (120) 1 2| 1 2 1 2| 1 2 − 1 2 1 2| 1 2 = h(X ,X W , W )+ h(Y , Y W , W ) (1 + λ)h(X ,X , Y , Y W , W ) (121) A B| 1 2 A B| 1 2 − A B A B| 1 2 = h(X W , W )+ h(Y W , W ) (1 + λ)h(X , Y W , W ) (122) A| 1 2 A| 1 2 − A A| 1 2 + h(X W , W )+ h(Y W , W ) (1 + λ)h(X , Y W , W ) (123) B| 1 2 B| 1 2 − B B| 1 2 I(Y ; Y W , W )+(1+ λ)I(X , Y ; X , Y W , W ) I(X ; X W , W ) (124) − A B| 1 2 A A B B| 1 2 − A B| 1 2 2V (K) I(Y ; Y W , W )+(1+ λ)I(X , Y ; X , Y W , W ) I(X ; X W , W ), (125) ≥ − A B| 1 2 A A B B| 1 2 − A B| 1 2 which implies that

I(Y ; Y W , W )+(1+ λ)I(X , Y ; X , Y W , W ) I(X ; X W , W ) 0. (126) − A B| 1 2 A A B B| 1 2 − A B| 1 2 ≤ Yet, through another way of factorization we get

2V (K)= h(X ,X W , W )+ h(Y , Y W , W ) (1 + λ)h(X ,X , Y , Y W , W ) (127) 1 2| 1 2 1 2| 1 2 − 1 2 1 2| 1 2 = h(X ,X W , W )+ h(Y , Y W , W ) (1 + λ)h(X ,X , Y , Y W , W ) (128) A B| 1 2 A B| 1 2 − A B A B| 1 2 = h(X X , Y , W , W )+ h(Y X , Y , W , W ) (1 + λ)h(X , Y X , Y , W , W ) (129) A| B B 1 2 A| B B 1 2 − A A| B B 1 2 + h(X X , Y , W , W )+ h(Y X , Y , W , W ) (1 + λ)h(X , Y X , Y , W , W ) (130) B| A A 1 2 B | A A 1 2 − B B| A A 1 2 + I(X , Y ; Y W , W )+ I(X ; Y Y , W , W ) (1 + λ)I(X , Y ; X , Y W , W ) (131) B B A| 1 2 A B | A 1 2 − B B A A| 1 2 + I(Y ; X X , W , W )+ I(X , Y ; X W , W ) (132) A B| A 1 2 B B A| 1 2 2V (K)+ I(X , Y ; Y W , W )+ I(X ; Y Y , W , W ) (133) ≥ B B A| 1 2 A B | A 1 2 (1 + λ)I(X , Y ; X , Y W , W )+ I(Y ; X X , W , W )+ I(X , Y ; X W , W ), (134) − B B A A| 1 2 A B| A 1 2 B B A| 1 2 which implies that

I(X , Y ; Y W , W )+ I(X ; Y Y , W , W ) (1 + λ)I(X , Y ; X , Y W , W ) (135) B B A| 1 2 A B| A 1 2 − B B A A| 1 2 + I(Y ; X X , W , W )+ I(X , Y ; X W , W ) 0. (136) A B| A 1 2 B B A| 1 2 ≤ By substituting (118) and (119), we obtain

I(Y ; Y W , W )+ I(X ; X W , W ) (1 + λ)I(X , Y ; X , Y W , W ) 0. (137) B A| 1 2 B A| 1 2 − B B A A| 1 2 ≤ Equation (126) and (137) imply that

I(Y ; Y W , W )+ I(X ; X W , W ) (1 + λ)I(X , Y ; X , Y W , W )=0. (138) B A| 1 2 B A| 1 2 − B B A A| 1 2

September 29, 2020 DRAFT 21

Another way of factorizing is the follows

2V (K)= h(X ,X W , W )+ h(Y , Y W , W ) (1 + λ)h(X ,X , Y , Y W , W ) (139) 1 2| 1 2 1 2| 1 2 − 1 2 1 2| 1 2 = h(X ,X W , W )+ h(Y , Y W , W ) (1 + λ)h(X ,X , Y , Y W , W ) (140) A B| 1 2 A B| 1 2 − A B A B| 1 2 = h(X X , Y , W , W )+ h(Y X , Y , W , W ) (1 + λ)h(X , Y X , Y , W , W ) (141) A| B B 1 2 A| B B 1 2 − A A| B B 1 2 + h(X X , W , W )+ h(Y X , W , W ) (1 + λ)h(X , Y X , W , W ) (142) B| A 1 2 B | A 1 2 − B B| A 1 2 + I(X ; Y W , W )+ I(Y ; X Y , W , W ) (1 + λ)I(X , Y ; X W , W ) (143) A B| 1 2 A B| B 1 2 − B B A| 1 2 + I(X , Y ; X W , W ) (144) B B A| 1 2 2V (K)+ I(X ; Y W , W )+ I(Y ; X Y , W , W ) (1 + λ)I(X , Y ; X W , W ) (145) ≥ A B| 1 2 A B| B 1 2 − B B A| 1 2 + I(X , Y ; X W , W ), (146) B B A| 1 2 which implies that

I(X ; Y W , W )+ I(Y ; X Y , W , W ) λI(X , Y ; X W , W ) 0. (147) A B | 1 2 A B| B 1 2 − B B A| 1 2 ≤ Using (118) and (119), we simplify the above equation into

I(X ; Y W , W ) λI(X ; X W , W ) 0. (148) A B | 1 2 − B A| 1 2 ≤

Switching the roles of A and B we obtain

I(X ; Y W , W ) λI(X ; X W , W ) 0. (149) B A| 1 2 − A B| 1 2 ≤ By factorizing for the last time we get

2V (K)= h(X ,X W , W )+ h(Y , Y W , W ) (1 + λ)h(X ,X , Y , Y W , W ) (150) 1 2| 1 2 1 2| 1 2 − 1 2 1 2| 1 2 = h(X ,X W , W )+ h(Y , Y W , W ) (1 + λ)h(X ,X , Y , Y W , W ) (151) A B| 1 2 A B| 1 2 − A B A B | 1 2 = h(X X , Y , W , W )+ h(Y X , Y , W , W ) (1 + λ)h(X , Y X , Y , W , W ) (152) A| B B 1 2 A| B B 1 2 − A A| B B 1 2 + h(X Y , W , W )+ h(Y XY , W , W ) (1 + λ)h(X , Y Y , W , W ) (153) B| A 1 2 B | A 1 2 − B B| A 1 2 + I(Y ; X , Y W , W ) (1 + λ)I(X , Y ; Y W , W ) (154) A B B| 1 2 − B B A| 1 2 + I(Y ; X W , W )+ I(Y ; X X , W , W ) (155) B A| 1 2 B A| B 1 2 2V (K) λI(X , Y ; Y W , W ) (156) ≥ − B B A| 1 2 + I(Y ; X W , W )+ I(Y ; X X , W , W ), (157) B A| 1 2 B A| B 1 2 which implies that

λI(X , Y ; Y W , W )+ I(Y ; X W , W )+ I(Y ; X X , W , W ) 0. (158) − B B A| 1 2 B A| 1 2 B A| B 1 2 ≤ The above inequality is simplified by using (118) and (119) into

λI(Y ; Y W , W )+ I(Y ; X W , W ) 0. (159) − B A| 1 2 B A| 1 2 ≤

September 29, 2020 DRAFT 22

Switching the roles of A and B we obtain

λI(Y ; Y W , W )+ I(Y ; X W , W ) 0. (160) − A B| 1 2 A B| 1 2 ≤ Observe that

I(X ; Y X , Y , W , W )= I(X ; Y ,X Y , W , W ) I(X ; X Y , W , W ) (161) A B| B A 1 2 A B B| A 1 2 − A B| A 1 2 = I(X , Y ; Y ,X W , W ) I(Y ; Y ,X W , W ) (162) A A B B| 1 2 − A B B| 1 2 I(X , Y ; X W , W )+ I(Y ; X W , W ) (163) − A A B| 1 2 A B| 1 2 = I(X , Y ; Y ,X W , W ) I(Y ; Y W , W ) (164) A A B B| 1 2 − A B| 1 2 I(X ; X W , W )+ I(Y ; X W , W ) I(Y ; Y W , W ) (165) − A B| 1 2 A B| 1 2 − A B | 1 2 1 = 1 I(X ; X W , W )+ I(Y ; X W , W ) (166) 1+ λ − A B| 1 2 A B| 1 2   1 + 1 I(Y ; Y W , W ) 0 (167) 1+ λ − A B | 1 2 ≥   where (165) follows from (118) and (119), Equation (167) follows from (138) and from non-negativity of the conditional mutual information term we obtain 1 1 1 I(X ; X W , W )+ I(Y ; X W , W )+ 1 I(Y ; Y W , W ) 0. (168) 1+ λ − A B| 1 2 A B| 1 2 1+ λ − A B | 1 2 ≥     Using (149), the above equation simplifies into 1 1 1 I(X ; X W , W )+ λI(X ; X W , W )+ 1 I(Y ; Y W , W ) 0, (169) 1+ λ − A B| 1 2 A B| 1 2 1+ λ − A B| 1 2 ≥     which simplifies into

λI(X ; X W , W ) I(Y ; Y W , W ), (170) A B| 1 2 ≥ A B| 1 2 form the fact that λ> 0. Using Equation (160) in (168) we obtain 1 1 1 I(X ; X W , W )+ λI(Y ; Y W , W )+ 1 I(Y ; Y W , W ) 0, (171) 1+ λ − A B| 1 2 A B| 1 2 1+ λ − A B | 1 2 ≥     which simplifies into

λI(Y ; Y W , W ) I(X ; X W , W ). (172) A B| 1 2 ≥ A B| 1 2 Combining (170) and (172) we obtain

(1 λ2)I(Y ; Y W , W ) 0, (173) − A B | 1 2 ≤ (1 λ2)I(X ; X W , W ) 0. (174) − A B| 1 2 ≤ Knowing that 0 <λ< 1, we deduce that I(Y ; Y W , W ) = I(X ; X W , W )=0. Finally using (138) we A B | 1 2 A B| 1 2 obtain

I(X , Y ; X , Y W , W )=0. (175) B B A A| 1 2 In addition, let us introduce the following theorem before arriving to the concluding result.

September 29, 2020 DRAFT 23

Theorem 13 (Corollary to Theorem 1 in [24]). If Z1 and Z2 are independent multidimensional random column vectors, and if Z1+Z2 and Z1 Z2 are independent then Z1, Z2 are normally distributed with identical . − Thus, the following statements are true

The pair (X , Y ) and (X , Y ) are conditionally independent given W = w , W = w from assumption. • 1 1 2 2 1 1 2 2 X +X Y +Y X X Y Y The pair ( 1 2 , 1 2 ) and ( 1− 2 , 1− 2 ) are conditionally independent given W = w , W = w . This • √2 √2 √2 √2 1 1 2 2 follows from (175). By applying Theorem 13 to the fact regarding conditional independence established in Equation (175), we can infer that (X, Y ) W = w (0,K ), where K might depend on the realization of W = w. We will now |{ } ∼ N w w argue that this is not the case. To make a brief summary we have shown the existence part thus, by choosing W to be the trivial random variable a single Gaussian (i.e. not a Gaussian mixture) is one of the possible minimizers. Let us suppose that there are two Gaussian minimizers (0,K ) and (0,K ), where K = K . Consider N w1 N w2 w1 6 w2 the random variable (W, X, Y ) where, (X, Y ) W = w (0,K ) and (X, Y ) W = w (0,K ). |{ 1} ∼ N w1 |{ 2} ∼ N w2 Therefore the triple (W, X, Y ) also attains V (K) and satisfies the covariance constraint. At the same time, we showed that the sum and the difference are also minimizers and they must be independent of each other, which

happens only when Kw1 = Kw2 . In other words, Kw does not depend on the realization W = w, and the (W, X, Y ) is a single Gaussian (i.e. not a Gaussian mixture). We established that (W, X, Y ) is a unique Gaussian minimizer, and thus

V (K)= h(X W )+ h(Y W ) (1 + λ)h(X, Y W ). (176) | | − |

Furthermore, there exists a decomposition (X, Y )= W +(X′, Y ′) (0,K), where W is independent of (X′, Y ′) ∼ N and W (0,K K′) and (X′, Y ′) (0,K′). Then, ∼ N − ∼ N

V (K)= h(X′)+ h(Y ′) (1 + λ)h(X′, Y ′), (177) − thus establishing Theorem 4, because

V (K) inf h(X W )+ h(Y W ) (1 + λ)h(X, Y W ), (178) ≤ W | | − | by the definition of V (K).

APPENDIX D

PROOF OF LEMMA 5 2 σX qσX σY Let K′ be parametrized as K′ = 0. Then, the problem is the same to the following one  2   qσX σY σY  

min h(X′)+ h(Y ′) (1 + λ)h(X′, Y ′) − 1 ρ K′:0 K′      ρ 1

1 2 2 2 1+ λ 2 2 2 2 = min log (2πe) σX σY log (2πe) σX σY (1 q ) (179) (σX ,σY ,q) ρ 2 − 2 − ∈A

September 29, 2020 DRAFT 24 where the set

2 σX 1 qσX σY ρ ρ = (σX , σY , q): − − 0 . (180) A  qσ σ ρ σ2 1     X Y − Y −  Matrices of dimension 2 2 are negative semi-definite if and only if the trace is negative and determinant is positive. ×   Thus, we can rewrite the set as

2 2 σX +σY 2, ρ = (σX , σY , q): 2 2 2 ≤ 2 2 2 . (181) (1 q )σ σ +2ρqσX σY +1 ρ (σ +σ ) 0 A − X Y − − X Y ≥ n o By making use of σ2 + σ2 2σ σ , we derive that , where X Y ≥ X Y Aρ ⊆Bρ

σX σY 1, = (σ , σ , q): 2 2 2 ≤ 2 . (182) ρ X Y (1 q )σ σ +2ρqσX σY +1 ρ 2σX σY ) 0 B − X Y − − ≥ n o 2 We will further reparametrize and define σ = σX σY , thus

2 σ2 1, ≤ (183) ρ = (σ , q): (σ2(1 q) 1+ρ)(σ2 (1+q) 1 ρ) 0 . D − − − − ≥ n o 2 1+ρ 2 1 ρ The second equation in the definition of the set ρ has roots σ = 1+q and σ = 1−q , thus the inequality is true D − if σ2 is not in between these two roots. Thus, we can rewrite the set as Dρ 2 ρ q, σ2(1 q) 1 ρ ≥ − ≤ − (184) ρ = (σ , q): ρ

September 29, 2020 DRAFT 25

2 1 ρ By equating (191) and (192) we deduce that q = λ. Since λ> 0, then µ =0 and by using (190) we get σ = 1−λ . ∗ 6 ∗ − λ 2 In addition, µ = 1 ρ . Since the KKT conditions are satisfied by q , σ and µ then strong duality holds, thus ∗ − ∗ ∗ ∗ 1 ρ min f(λ, σ2, q)+ µ(σ2(1 q) 1+ ρ)) = f(λ, − , λ) σ2,q − − 1 λ − 1 1 λ (1 ρ)2(1 + λ) = log log (2πe)2 − . (193) 2 1 λ2 − 2 1 λ − − By combining (179), (185), (187) and (193) we get the desired lower bound. For the case ρ < q, let us optimize over σ2 for any fixed q. The function f is decreasing in σ2. Also, the function f is convex in σ2. Since the object is continuous in σ2 and the constraint is linear for any fixed q, then the optimal 2 1+ρ choice is σ = 1+q . Thus, 1+ ρ min f(λ, σ2, q) min f(λ, , q). (194) 2 (σ ,q) ρ ≥ q [ρ,1] 1+ q ∈D ∈ The function on the right hand side can be written as 1+ ρ 1 (1 + ρ)2 1+ λ (1 + ρ)2(1 q) f(λ, , q)= log (2πe)2 log (2πe)2 − . (195) 1+ q 2 (1 + q)2 − 2 (1 + q) The function is convex and increasing in q for q [ρ, 1], ∈ ∂f q + λ = > 0, (196) ∂q 1 q2 − ∂2f 1+ q2 +2λq = > 0, (197) ∂q2 (1 q2)2 − 1 ρ thus, the optimal value of q = ρ, is guaranteed to give the minimum. To conclude we show that f(λ, 1−λ , λ) − ≤ f(λ, 1,ρ) for λ ρ. To show this we define ≤ 1 ρ h(λ) := f(λ, − , λ) f(λ, 1,ρ) (198) 1 λ − − 1 1 ρ2 λ (1 + λ)(1 ρ) = log − log − , (199) 2 1 λ2 − 2 (1 λ)(1 + ρ) − − and the new defined function is increasing in λ, ∂h 1 (1 + λ)(1 ρ) = log − 0, for λ ρ (200) ∂λ −2 (1 λ)(1 + ρ) ≥ ≤ − and it is concave in λ, ∂2h 1 = < 0, (201) ∂λ2 −1 λ2 − 1 ρ thus, h(λ) h(ρ)=0. Then, f(λ, 1,ρ) f(λ, 1−λ , λ). The argument goes through also for the case when ρ is ≤ ≥ − negative, which completes the proof.

ACKNOWLEDGMENT

This work was supported in part by the Swiss National Science Foundation under Grant 169294, Grant P2ELP2 165137.

September 29, 2020 DRAFT 26

REFERENCES

[1] A. Wyner, “The common information of two dependent random variables,” IEEE Transactions on , vol. 21, no. 2, pp. 163–179, March 1975. [2] R. M. Gray and A. D. Wyner, “Source coding for a simple network,” The Bell System Technical Journal, vol. 53, no. 9, pp. 1681 – 1721, 1974. [3] G. Xu, W. Liu, and B. Chen, “Wyner’s common information for continuous random variables - A lossy source coding interpretation,” in Annual Conference on Information Sciences and Systems, Baltimore, MD, USA, March 2011. [4] ——, “A lossy source coding interpretation of Wyner’s common information,” IEEE Transactions on Information Theory, vol. 62, no. 2, pp. 754 – 768, February 2016. [5] P. Yang and B. Chen, “Wyner’s common information in Gaussian channels,” in IEEE International Symposium on Information Theory (ISIT), Honolulu, HI, USA, August 2014. [6] S.-L. Huang, X. Xu, L. Zheng, and G. W. Wornell, “A local characterization for Wyner common information,” in IEEE International Symposium on Information Theory, Los Angeles, California, USA, June 2020. [7] H. S. Witsenhausen, “Values and bounds for the common information of two discrete random variables,” SIAM Journal on Applied Mathematics, vol. 31, no. 2, September 1976. [8] L. Yu and V. Y. F. Tan, “Wyner’s common information under R´enyi divergence measures,” IEEE Transactions on Information Theory, vol. 64, no. 5, pp. 3616–3632, 2018. [9] G. Op ’t Veld and M. Gastpar, “ of Gaussian vector sources on the Gray-Wyner network,” in Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, September 2016. [10] A. Lapidoth and M. Wigger, “Conditional and relevant common information,” in IEEE International Conference on the Science of Electrical Engineering (ICSEE), Eilat, Israel, 2016. [11] C.-Y. Wang, S. H. Lim, and M. Gastpar, “Information-theoretic caching: Sequential coding for computing,” IEEE Transactions on Information Theory, vol. 62, no. 11, pp. 6393 – 6406, August 2016. [12] S. Satpathy and P. Cuff, “Gaussian secure source coding and Wyner’s common information,” in IEEE International Symposium on Information Theory (ISIT), Hong Kong, China, October 2015. [13] K. B. Viswanatha, E. Akyol, and K. Rose, “The lossy common information of correlated sources,” IEEE Transactions on Information Theory, vol. 60, no. 6, pp. 3238 – 3253, June 2014. [14] L. Yu, H. Li, and C. W. Chen. Generalized common : Measuring commonness by the conditional maximal correlation. [Online]. Available: https://arxiv.org/abs/1610.09289 [15] G. R. Kumar, C. T. Li, and A. E. Gamal, “Exact common information,” in IEEE International Symposium on Information Theory (ISIT), Honolulu, HI, USA, August 2014. [16] Y. Geng and C. Nair, “The capacity region of the two-receiver Gaussian vector broadcast channel with private and common messages,” IWCIT, vol. 60, no. 4, April 2014. [17] M. Gastpar and E. Sula, “Common information components analysis,” in Proceedings of the 2020 Information Theory and Applications (ITA) Workshop, San Diego, USA, February 2020. [18] H. S. Witsenhausen, “Values and bounds for the common information of two discrete random variables,” SIAM J. Appl. Math, vol. 31, no. 2, pp. 313–333, September 1976. [19] A. Wyner, “The common information of two dependent random variables,” IEEE Transactions on Information Theory, vol. 21, no. 2, pp. 163–179, Mar 1975. [20] C. Nair. An extremal inequlity related to hypercontractivity of Gaussian random variables. [Online]. Available: http://chandra.ie.cuhk.edu.hk/pub/papers/manuscripts/ITA14.pdf [21] R. Gray, “Conditional rate-distortion theory,” Stanford University, Tech. Rep., 1972. [22] J. Thomas, “Feedback can at most double Gaussian multiple access (Corresp.),” IEEE Transactions on Information Theory, vol. 33, no. 5, pp. 711 – 716, September 1987. [23] E. Posner, “Random coding strategies for minimum entropy,” IEEE Transactions on Information Theory, vol. 21, no. 4, pp. 388 – 391, 1975. [24] S. G. Ghurye and I. Olkin, “A characterization of the multivariate normal distribution,” Ann. Math. Statist, vol. 33, no. 2, pp. 533–541, 1962.

September 29, 2020 DRAFT