Gaussian Universal Features, Canonical Correlations, and Common Information

Shao-Lun Huang Gregory W. Wornell, Lizhong Zheng DSIT Research Center, TBSI, SZ, China 518055 Dept. EECS, MIT, Cambridge, MA 02139 USA Email: [email protected] Email: {gww, lizhong}@mit.edu

Abstract—We address the problem of optimal feature selection we define a rotation-invariant ensemble (RIE) that assigns a for a Gaussian vector pair in the weak dependence regime, uniform prior for the unknown attributes, and formulate a when the inference task is not known in advance. In particular, universal feature selection problem that aims to select optimal we show that multiple formulations all yield the same solution, and correspond to the singular value decomposition (SVD) of features minimizing the averaged MSE over RIE. We show the matrix. Our results reveal key con- that the optimal features can be obtained from the SVD nections between canonical correlation analysis (CCA), principal of a canonical dependence matrix (CDM). In addition, we component analysis (PCA), the Gaussian information bottleneck, demonstrate that in a weak dependence regime, this SVD Wyner’s common information, and the Ky Fan (nuclear) norms. also provides the optimal features and solutions for several problems, such as CCA, information bottleneck, and Wyner’s common information, for jointly Gaussian variables. This I.INTRODUCTION reveals important connections between and Typical applications of involve data whose machine learning problems. dimension is high relative to the amount of training data that is available. As a consequence, it is necessary to perform di- mensionality reduction before the regression or other inference II.GAUSSIAN LOCAL ANALYSIS FRAMEWORK task is carried out. This reduction corresponds to extracting a set of comparatively low-dimensional features from the data. In the sequel, we restrict our attention to zero-mean vari- When the inference task is fully specified, classical ables, for simplicity of exposition. In the model of interest, K K establishes that the appropriate features take the form of a X ∈ X and Y ∈ R Y . Moreover,   h i   (minimal) sufficient statistic. However, in most contemporary X T ΛX ΛXY Z = ∼ N(0, ΛZ ), ΛZ = E ZZ = , settings, the task is not known in advance—or equivalently Y ΛYX ΛY  T there are multiple tasks—and we require a set of universal so X ∼ N(0, ΛX ), Y ∼ N(0, ΛY ), ΛYX = E YX , and T features that are, in an appropriate sense, uniformly good. ΛXY = ΛYX . We assume without loss of generality that ΛX With this motivation, the Gaussian universal feature se- and ΛY are (strictly) positive definite. The joint distribution lection problem can be expressed as: given a pair of high- takes the form |Λ |−1/2  1  dimensional jointly distributed Gaussian data vectors X ∈ P (x, y) = P (z) = Z exp − zT Λ−1 z K K X,Y Z K /2 Z (2) R X and, Y ∈ R Y , how should we choose low-dimensional (2π) Z 2 features f(X) and g(Y ) before knowing the desired inference where KZ = KX +KY , and with |·| denoting the determinant task so to ensure that after the task is revealed, inference based of its argument. It will be convenient to normalize X and Y on the features performs as well as possible? according to ˜ −1/2 ˜ −1/2 Mathematically, we express this problem as one of making X = ΛX X and Y = ΛY Y inference about latent variables U, V ∈ k, for 1 ≤ k ≤ K so that R ,  ˜  T min{K ,K }, in the Gauss-Markov chain ˜ X IB X Y Z = ˜ , ΛZ˜ = , (3) U ↔ X ↔ Y ↔ V, (1) Y BI where the (Gaussian) distributions for these variables, i.e., where B Λ−1/2 Λ Λ−1/2 = Λ−1/2 Γ Λ1/2 (4) PU , PX|U , PV , and PY |V are not known at the time of , Y YX X Y Y |X X . Our results can be viewed as an extension is called the canonical dependence matrix (CDM). CDM plays of the framework for discrete variables described in [1]. We the key role in Gaussian local analysis as the divergence note in advance that to simplify the exposition, we treat transfer matrix (DTM) does in the discrete case [1]. ˜ ˜ PX,Y as known, though in practice we must estimate the We note that the MMSE estimate of Y based on X is ˜ˆ ˜ ˜ ˜ ˜ˆ ˜ relevant aspects of this distribution from training samples Y (X) = B X, and the associated error ν˜ , Y − Y (X)  T T {(x1, y1),..., (xn, yn)}. has covariance E ν˜ν˜ = I − BB , so the resulting MSE 2 T 2 Our contribution of this paper is summarized as follows. To is σ˜e = tr I − BB = KY − kBkF. with k · kF denoting deal with the inference for unknown attributes, in section III the Frobenius norm. The SVD of B takes the form [2], which interprets the power iteration method for SVD K Y X T X Y X T computation. B = Ψ Σ Ψ = σi ψi ψi , (5) From this perspective, we see that approximations to PX,Y i=1 can be obtained by truncating the representation (8) to the first where K = min{KX ,KY } and where we order the singular k < K of the terms in the product, yielding values according to σ1 ≥ · · · ≥ σK . Note that since (3) is P (k) (k) (x, y) positive semidefinite, it follows that σi ≤ 1 for i = 1,...,K. X ,Y k ! As in the discrete case [1], it is useful to define a local ∗ ∗ Y σi fi (x) gi (y)  analysis regime for such variables. In particular, we make use = PX (x) PY (y) e 1 + o() , of the following notion of neighborhood. i=1 This corresponds to jointly Gaussian X(k) and Y (k) with the Definition 1 (Gaussian -Neighborhood). For a given  > 0, same marginals as X and Y , respectively, but the -neighborhood of a K0-dimensional Gaussian distribution k ! 1/2 X Y X T 1/2 P0 = N(0, Λ0) is the set of Gaussian distributions in a Λ (k) (k) = Λ σ ψ ψ Λ , (9) 2 Y X Y i i i X covariance-divergence ball of radius  about P0, i.e., i=1 K0 | {z } G (P0) (k) ,B n −1/2  −1/2 2 2 o so , P = N(0, Λ): Λ0 Λ − Λ0 Λ0 F ≤  K0 k (k) (k) 1 X 2 2 I(X ; Y ) = σi + o( ). Note that PX,Y lies in an -neighborhood of PX PY if and 2 i=1 only if PX,˜ Y˜ lies in an -neighborhood of PX˜ PY˜ . Hence, KX +KY 2 2 III.UNIVERSAL LINEAR FEATURE SELECTION PX,Y ∈ G (PX PY ) when kΛZ˜ − IkF ≤  (KX + K ). We conclude that the neighborhood constraint limits how We use our framework to address the problem of Gaussian Y ˜ ˜ ˜ ˜ much the mean-square error (MSE) in the estimate of Y˜ based universal feature selection. In our analysis, U, X, Y, V denote on observing X˜ can be reduced relative to the MSE in the normalized versions of the variables in (1), so are N(0, I) estimate of Y˜ based on no data. In the rest of this paper, we random vectors of appropriate dimension. In the sequel, we focus on the regime that  is small. The K-L divergence and consider several different formulations, all of which yield the in this regime admits the following useful same linear features, and coincide with those defined by the asymptotic expressions. modal expansion (8). In our development, the following lemma will be useful (see, e.g., [3, Corollary 4.3.39, p. 248]). Lemma 1. In the weak dependence regime, 1 2 2 Lemma 2. Given an arbitrary k1 × k2 matrix A and any D(PY |X (·|x)kPY ) = Bx˜ + o( ),  2 k ∈ 1,..., min{k1, k2} , we have and k 1 K 2 X 2 X 2 2 max AM F = σi(A) , (10) I(X; Y ) = σi + o( ). (6) k2×k T 2 {M∈R : M M=I} i=1 i=1 with σ (A) ≥ · · · ≥ σ (A) denoting the (ordered) 1 min{k1,k2} Proof. This is straightforward from the fact that, for an singular values of A. Moreover, the maximum in (10) is 2 T 2 2 2   arbitrary matrix A, ln |I −  AA | = − kAkF + o( ). achieved by M = ψ1(A) ··· ψk(A) , with ψi(A) de- noting the right singular vector of A corresponding to σi(A), To interpret (6), consider the modal decomposition of PX,Y . In particular, observe that as  → 0, for i = 1,..., min{k1, k2}.  T −1 I −B A. Optimum Features, Rotation-Invariant Ensembles ΛZ˜ = + o(). (7) −BI In this formulation, we seek to determine optimum features Hence, K ! for estimating an unknown k-dimensional U from Y , in the ∗ ∗ Y σi fi (x) gi (y)  case where U and X are weakly dependent; specifically, PX,Y (x, y) = PX (x) PY (y) e 1 + o() , KX +k i=1 PX,U ∈ G (PX PU ). Accordingly, from the innovations X|U (8a) form X˜ =  Φ U˜ + ν ˜ ˜ , where U˜ and ν ˜ ˜ are ∗ ∗ U→X U→X where fi and gi are (linear) functions given by independent, it follows that weak dependence means, using ∗ X T −1/2 ∗ Y T −1/2 X|U fi (x) = (ψi ) ΛX x and gi (y) = (ψi ) ΛY y. Definition 1, that Φ satisfies | {z } | {z } X|U 2 1 ∗T ∗T Φ ≤ (KX + k), (11) ,fi ,gi F 2 (8b) but is otherwise unknown. Moreover, using (8b) with (4) and (5) we obtain the covariance ˜ ˜ X|U ˜ We observe Y = B X + νX˜ →Y˜ =  BΦ U + νU˜→Y˜ . expansion From this data—and before knowing ΦX|U —we construct a K −1 −1 −1/2 −1/2 X ∗ ∗T linear feature (statistic) of the form Λ Λ Λ = Λ BΛ = σ g f . T Y YX X Y X i i i T = ΞY  Y,˜ (12) i=1 These linear functions can be computed from PX,Y (or esti- which we normalize for convenience (and without loss of mated from training data) using a linearly-constrained version generality) according to T of the alternating conditional expectation (ACE) algorithm ΞY  ΞY = I, (13) ˜ Y X  X X  so T = T since ΛT = I. We refer to Ξ as the feature vectors Ψ(k) , ψ1 ··· ψk of B corresponding to the weights associated with the linear feature T . When ΦX|U is k largest singular values, and the optimum feature vector X T −1/2  ∗ ∗ T determined, we can generate the MMSE estimate of U based S∗ = Ψ(k) ΛX X = f1 (X) ··· fk (X) , (19) on T , and express the resulting MSE in the form | {z } T 2 ∗ tr(Λ ) = tr Λ − 2 ΞY  BΦX|U Λ1/2 . (14) ,f (X) U|T U U F is from the Gaussian modal decomposition (8). Since ΦX|U is unknown, to determine the optimum choice of ΞY , we assume that (X,U) configuration is randomly drawn from a rotation-invariant ensemble (RIE), defined as follows. B. Cooperative Optimization and CCA

Definition 2. RIE is a collection of configurations such that In this subsection, we consider a variant optimization prob- X|U X|U d X|U given any Φ , Φ = QΦ for any unitary matrix Q, lem, where the system designer and nature share the goal of where =d denotes the two configurations have equal probability, constructing and estimating U from Y in the chain (1) so as i.e., the prior distribution of ΦX|U is spherically symmetric. to minimize the relative MSE subject to the weak dependence KX +k constraint PX,U ∈ G (PX PU ). In this game, nature To derive the optimal features, the following lemma from X|U chooses the variable U via Φ and attribute covariance ΛU [4] is useful. Y and the system designer chooses the feature weights Ξ for the statistic (12). Nature is constrained in that Λ cannot have Lemma 3. Let Z be a k1 × k2 spherically symmetric [4] U eigenvalues larger than 1, i.e., its spectral norm k · k , denoted random matrix, i.e., for any orthogonal k1 × k1 and k2 × k2 s d X|U T the largest eigenvalue, satisfies ΛU ≤ 1, and Φ is matrices Q1 and Q2, respectively, we have Z = Q1 ZQ2. s constrained according to [cf. (11)] Then if A1 and A2 are any fixed matrices of appropriate X|U 2 1 KX + k dimension, then Φ ≤ . (20) F 2 k h T 2 i 1 2 2 h 2 i Y A1 ZA2 = A1 A2 Z . Ξ E F k k F F E F The system designer is constrained in that the columns of 1 2 must be orthonormal, i.e., (13) must be satisfied. Theorem 1. The optimal features T∗ to minimize the averaged ∗ To minimize the MSE , in the cooperative MSE (14) over the RIE prior is given by T∗ = g (Y ), where Theorem 2. (14) ∗ X|U X g (Y ) is from the Gaussian modal decomposition from (8). game the nature should choose ΛU = I and Φ = Ψ(k), Y Y and the system designer should choose Ξ = Ψ(k). Proof. Taking the expectation of (14) over the RIE and applying Lemma 3, with some computations, we obtain that Proof. Although we omit the steps due to space constraints, the averaged MSE is it is straightforward to show, again using Lemma 2, that   2 k E tr(ΛU|T )  K + k X 2 tr(Λ ) − tr(Λ ) ≤ X σ , (21) 2   ! U U|T 2 k i  1 1 Y T 2 i=1 = tr(ΛU ) 1 − + Ξ B F . (15) and the equality holds when nature chooses the variable U 2 k KX X|U X Then via Lemma 2 it follows that (15) is minimized subject according to ΛU = I and Φ = Ψ(k), and the system Y Y Y to (13) by choosing the columns of Ξ as the left singular designer chooses the feature weights according to Ξ = Ψ(k). vectors of B corresponding to the k largest singular values, Y Y  Y Y  i.e., Ξ = Ψ(k) , ψ1 ··· ψk . Hence, the optimum feature vector for inferences about the resulting variable U from Y is, via (12), Note that the analysis of this section is not asymptotic; it Y T −1/2  ∗ ∗ T holds for all  sufficiently small. We further emphasize that T∗ = Ψ(k) ΛY Y = g1 (Y ) ··· gk(Y ) , (16) | {z } this analysis establishes that (16) is a sufficient statistic for ∗ ,g (Y ) inferences about the resulting variable U from Y . which we note coincides with the features of Y that arise in An identical analysis yields the solution to the cooperative the Gaussian modal decomposition (8). game for estimating V from X. In particular, nature chooses Y |V Y the variable V according to ΛV = I and Φ = Ψ(k), and the system designer chooses the feature weights according to ΞX = ΨX . Moreover, analogously, (19) is a sufficient By symmetry, an analogous derivation yields the corre- (k) statistic for inferences about the resulting variable V from X. sponding results for constructing feature weights ΞX and linear feature (statistic) CCA Interpretation: The cooperative game can be viewed T S = ΞX  X˜ (17) as selecting the most detectable attribute of X and detecting with [cf. (13)] this attribute by the most correlated feature of Y . These T optimizations can be equivalently and directly interpreted via ΞX  ΞX = I (18) CCA [5]. To demonstrate the connection, the following (von for estimating an unknown V from X. Similarly, we obtain Neumann) lemma (see, e.g., [6]), is useful. that the optimal ΞX has as its columns the right singular Lemma 4. Given an arbitrary k × k matrix A and any with equality when ΦX|U defining U is given by ΨX .  1 2 (k) k ∈ 1,..., min{k1, k2} , we have k An analogous development determines the Gaussian vector T  X V that maximizes I(X; V ) subject to the constraints: 1) V are max tr M1 AM2 = σi(A), (22) i k1×k k2×k 2 {M1∈R , M2∈R : i=1 i.i.d. and unit-variance; 2) I(Y ; Vi) ≤  /2 for all i; 3) Vi and T T M1 M1=M2 M2=I} Vj are conditionally independent given Y . In particular, via the T with σ (A) ≥ · · · ≥ σ (A) denoting the (ordered) Y |Vi  Y |V 1 min{k1,k2} innovations form V˜ =  φ Y˜ + ν ˜ ˜ with Φ = i Y →Vi singular values of A. Moreover, the maximum in (22) is Y |V Y |V 2 k 2 (j) (j)  1 k   P   φ ··· φ , we obtain I(X; V ) ≤ 2 i=1 σi + achieved by Mj = ψ1 (A) ··· ψk (A) , j = 1, 2, (1) (2) 2 Y |V Y o( ), with equality when Φ defining V is given by Ψ(k). with ψi (A) and ψi (A) denoting the left and right sin- gular vectors, respectively, of A corresponding to σ (A), for i D. Most Strongly Revealed Joint Dependency i = 1,..., min{k1, k2}. In this formulation, for the Gauss-Markov chain (1) we To develop this perspective, first note that for zero-mean determine Gaussian U, V ∈ k that maximize I(U; V ) subject K K R random vectors X ∈ R X an Y ∈ R Y with given covariance to the constraints: 1) the elements of U and V are each i.i.d. structure Λ , CCA seeks to find k-dimensional linear 2 2 X,Y with unit-variance; 2) I(X; Ui) ≤  /2 and I(Y ; Vi) ≤  /2 X T ˜ Y T ˜ feature vectors f(X) = Ξ X and g(Y ) = Ξ Y, for for all i; 3) Ui and Uj are conditionally independent given X, k ≤ K, normalized according to (18) and (13) so that and Vi and Vj are conditionally independent given Y . E [f(X)] = E [g(X)] = 0 (23a) X|U Y |V h i h i Theorem 4. The optimal Φ and Φ in the innovations f(X) f(X)T = g(Y ) g(Y )T = I, X E E (23b) form of U and V that maximizes I(U; V ) are given by Ψ(k) so as to maximize the vector correlation (generalized Pearson Y and Ψ(k), respectively. correlation coefficient) h T i Y T X Proof. Using the fact that σ(f, g) , E f(X) g(Y ) = Ξ BΞ , (24) T Λ = 2 ΦX|U  BTΦY |V . (26) which via Lemma 4 immediately yields that the optimizing f UV we obtain, using Lemma 1 and Lemma 2 and the analysis of an g are as given by (19) and (16), and the maximal correlation ∗ ∗ Pk Section III-C (again, omitted the computations), is σ(f , g ) = i=1 σi, which corresponds to the Ky Fan k- 4 k  X 2 4 norm of B. Note that the special case k = 1 corresponds to I(U; V ) ≤ σ + o( ), (27) 2 i standard CCA [5]. i=1 which is achieved with equality when ΦY |V and ΦX|U are C. The Local Gaussian Information Bottleneck Y X given by the singular vectors Ψ(k) and Ψ(k), respectively. For the Gauss-Markov chain (1), we want to determine the The resulting optimizing (jointly Gaussian) U, V then has Gaussian vector U = (U1,...,Uk) that maximizes I(Y ; U) 2 covariance ΛUV =  Σ(k), where Σ(k) is diagonal with subject to the constraints: 1) Ui are i.i.d. and unit-variance; 2)   2 2 diagonal elements σ1, . . . , σk, so E Ui Vj =  σi 1i=j and I(X; Ui) ≤  /2 for all i; 3) Ui and Uj are conditionally 4 independent given X. This can be viewed as a variation  2 4 I(Ui; Vj) = σi 1i=j + o( ). of the Gaussian information bottleneck problem [7] in the 2 Note, finally, that (S, T ) given by (19) and (16) are sufficient weak dependence regime. In this subsection, we illustrate the statistics for inferences about the resulting (U, V ), which we connection between our optimal features and this information emphasize are obtained by separate processing of X and Y . bottleneck problem. Theorem 3. Denote the innovations form of U as X˜ = IV. GAUSSIAN COMMON INFORMATION X|U X|Ui T  Φ U˜ + ν ˜ ˜ , and U as U =  φ X˜ + ν ˜ , We interpret the dominant structure in terms of the common U→X i i X→Ui X|U X information associated with the jointly Gaussian pair (X,Y ) then the Φ maximizing I(Y ; U) is given by Ψ(k). as defined by Wyner [8]. In our analysis, the following Proof. Note that the constraint 1 expresses ΛU = I, and (variational) characterization of the Ky Fan (nuclear) norm X|Ui T via the innovations form U =  φ X˜ + ν ˜ , i X→Ui (see, e.g., [9], [10]) is useful. X|U Constraint 2 imposes that φ i ≤ 1 + o(1) for all i. Lemma 5. Given an arbitrary k1 × k2 matrix A and any Finally, Constraint 3 imposes that the ν ˜ are indepen- X→Ui positive integer k, we have 2 X|U T X|U   dent for different i, thus I −  Φ Φ is diagonal, 1 2 1 2 min kM k + kM k = kAk X|U  X|U1 X|Uk  1 F 2 F ∗ k1×k k×k2 2 2 where Φ = φ ··· φ , which, in turn, means {M1∈R , M2∈R : X|Ui T X|Uj φ φ = 0 for i 6= j. M1M2=A} Hence, it follows from —omitting the computations due to (28) space limitations—using Lemma 1 and Lemma 2 that where k · k∗ denotes the Ky Fan norm, i.e., 2 k k∗,min{k1,k2}  X 2 2 p T  X I(Y ; U) ≤ σ + o( ), (25) kAk tr A A = σ (A), (29) 2 i ∗ , i i=1 i=1 with σ (A), . . . , σ (A) denoting the singular values of A. As a result, we have that specializes to 1 k∗ (19) f ∗(X) = Λ−1/2 ΥT X, (36a) To begin, we express the common information C(X,Y ) as (k) (k) where Υ denotes the K × k matrix consisting of the first k C(X,Y ) = min I(W˜ ; X,˜ Y˜ ), (30) (k) columns of Υ, and where Λ denotes the k × k upper left where the minimum is over all (zero-mean) random vectors (k) k submatrix of Λ. In turn, it follows that the k-dimensional PCA W˜ ∈ R for some k such that (W,˜ X,˜ Y˜ ) are jointly Gaussian ˜ ˜ ˜ vector and have the Markov structure X ↔ W ↔ Y , and where we PCA PCA T ˜ S = f (X) , Υ(k) X (36b) have normalized W such that ΛW˜ = I. This corresponds to the Wyner’s common information for Gaussian vectors [13]. is a sufficient statistic for inferences about the unknown V . Then, we express this Gaussian structure via innovations form Via a similar analysis   T PCA = gPCA(Y ) ΥT Y (36c) ˜ AX ˜ , (k) Z = W + νW →Z , (31) AY is a sufficient statistic for inferences about the un- PCA PCA where AX and AY are KX × k and KY × k matrices, known U in this case. More generally (S ,T ) = respectively. (f PCA(X), gPCA(Y )) is a sufficient statistic pair for infer- KX We focus on the case where PX|W (·|w) ∈ G (PX ) and ences about (U, V ). KY PY |W (·|w) ∈ G (PY ), so kAX kF ≤  and kAY kF ≤ . In Beyond this example, for a general jointly Gaussian pair addition, note that A and A are related according to h X i Y (X,Y ), our statistics S = f(X) and T = g(Y ) specialize Y˜ X˜ T = A AT = B. (32) E Y X to the PCA statistics (36) whenever KX = KY = K and ΛX and ΛY are simultaneously diagonalizable, i.e., when Theorem 5. The optimal AX and AY that maximizes (30) are X 1/2 Y 1/2 they share the same set of eigenvectors, which is equivalent given by Ψ(K)Σ(K) and Ψ(K)Σ(K), where Σ(k) is diagonal to the condition that ΛX and ΛY commute (see, e.g., [3, with diagonal elements σ1, . . . , σk. Theorem 1.3.12]). In fact, if ΛX has distinct eigenvalues and Proof. Note that via Lemma 1—but omitting the computations commutes with ΛY , then there is a polynomial π(·) of degree due to space constraints—we obtain at most K − 1 such that ΛY = π(ΛX ), which follows from ˜ ˜ ˜ 1 T  1 T  2 the Cayley-Hamilton theorem (see, e.g., [3, Theorem 2.4.3.2 I(W ; X, Y ) = tr AX AX + tr AY AY + o( ) (33) 2 2 and Problem 1.3.P4]). Minimizing (33) subject to the constraint (32) yields, via Lemma 5, ACKNOWLEDGMENT K ˜ ˜ ˜ X 2 The research of Shao-Lun Huang was funded by the Shen- C(X,Y ) = min I(W ; X, Y ) = σi + o( ), {AX ,AY : zhen Municipal Scientific Program JCYJ20170818094022586. T i=1 AY AX =B} for which an optimizing W is defined via REFERENCES   " X # [1] S.-L. Huang, A. Makur, L. Zheng, and G. W. Wornell, “An information- A Ψ(K) 1/2 X = Σ . (34) theoretic approach to universal feature selection in high-dimensional A Y (K) Y Ψ(K) inference,” in Proc. Int. Symp. Inform. Theory (ISIT), Aachen, Germany, Jun. 2017. [2] L. Breiman and J. H. Friedman, “Estimating optimal transformations In addition, we note that R S + T —with S and T from for multiple regression and correlation,” J. Am. Stat. Assoc., vol. 80, no. , 391, pp. 580–598, Sep. 1985. (19) and (16)—is a sufficient statistic for inferences about W . [3] R. A. Horn and C. R. Johnson, Matrix Analysis, 2nd ed. Cambridge, We can interpret the common information variable Wi as UK: Cambridge University Press, 2012. [4] M. A. Chmielewski, “Elliptically symmetric distributions: A review and capturing the common information between Ui and Vi. Indeed, 2 bibliography,” Int. Stat. Review, vol. 49, no. 1, pp. 67–74, Apr. 1981. it can be verified that C(Ui,Vi) = σi + o( ) = I(Wi; X,Y ). [5] H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, pp. 321–377, 1936. V. CONNECTIONTO PCA [6] A. S. Lewis, “The convex analysis of unitarily invariant matrix func- tions,” J. Convex Anal., vol. 2, no. 1/2, pp. 173–183, 1995. PCA [11], [12] can be interpreted as a special case of [7] G. Chechik, A. Globerson, N. Tishby, and Y. Weiss, “Information the preceding results. Specifically, in some instances, the bottleneck for Gaussian variables,” J. Machine Learning Res., vol. 6, pp. 165–188, May 2005. realized by PCA corresponds to the [8] A. D. Wyner, “The common information of two dependent random optimum statistics defined in (19) and (16), respectively, for variables,” IEEE Trans. Inform. Theory, vol. 21, no. 2, pp. 163–179, the universal estimation of the unknown attributes U and Mar. 1975. [9] F. Bach, J. Mairal, and J. Ponce, “Convex sparse matrix factorizations,” V under any of our formulations. CNRS, Paris, France, Tech. Rep. HAL-00345747, 2008. [10] B. Recht, M. Fazel, and P. A. Parillo, “Guaranteed minimum-rank Example 1. Suppose we have the innovations form Y = solutions of linear matrix equations via nuclear norm minimization,” X + νX→Y , where X and Y are K-dimensional, and where SIAM Rev., vol. 52, no. 3, pp. 471–501, 2010. Λ = σ2 I but Λ is arbitrary. Moreover, let Λ = ΥΛΥT [11] K. Pearson, “On lines and planes of closest fit to systems of points in ν ν X X space,” Phil. Mag., vol. 2, no. 11, pp. 559–572, 1901. denote the diagonalization of ΛX , so the columns of Υ are [12] H. Hotelling, “Analysis of a complex of statistical variables into principal orthonormal, and Λ is diagonal with entries λ1 ≥ λ2 ≥ · · · ≥ components,” J. Ed. Psych., vol. 24, pp. 417–441, 498–520, 1933. λ . Then it is immediate that Λ = Υ Λ + σ2 I ΥT, so [13] S. Satpathy and P. Cuff, “Gaussian secure source coding and Wyner’s K Y ν common information," in Proc. IEEE Int. Symp. Inf. Theory, Jun. 2015, −1/2 −1/2 1/2 2 −1 T pp. 116–120. B = ΛY ΛX = Υ I + σν Λ Υ . (35)