Gaussian Universal Features, Canonical Correlations, and Common Information
Total Page:16
File Type:pdf, Size:1020Kb
Gaussian Universal Features, Canonical Correlations, and Common Information Shao-Lun Huang Gregory W. Wornell, Lizhong Zheng DSIT Research Center, TBSI, SZ, China 518055 Dept. EECS, MIT, Cambridge, MA 02139 USA Email: [email protected] Email: {gww, lizhong}@mit.edu Abstract—We address the problem of optimal feature selection we define a rotation-invariant ensemble (RIE) that assigns a for a Gaussian vector pair in the weak dependence regime, uniform prior for the unknown attributes, and formulate a when the inference task is not known in advance. In particular, universal feature selection problem that aims to select optimal we show that multiple formulations all yield the same solution, and correspond to the singular value decomposition (SVD) of features minimizing the averaged MSE over RIE. We show the canonical correlation matrix. Our results reveal key con- that the optimal features can be obtained from the SVD nections between canonical correlation analysis (CCA), principal of a canonical dependence matrix (CDM). In addition, we component analysis (PCA), the Gaussian information bottleneck, demonstrate that in a weak dependence regime, this SVD Wyner’s common information, and the Ky Fan (nuclear) norms. also provides the optimal features and solutions for several problems, such as CCA, information bottleneck, and Wyner’s common information, for jointly Gaussian variables. This I. INTRODUCTION reveals important connections between information theory and Typical applications of machine learning involve data whose machine learning problems. dimension is high relative to the amount of training data that is available. As a consequence, it is necessary to perform di- mensionality reduction before the regression or other inference II. GAUSSIAN LOCAL ANALYSIS FRAMEWORK task is carried out. This reduction corresponds to extracting a set of comparatively low-dimensional features from the data. In the sequel, we restrict our attention to zero-mean vari- When the inference task is fully specified, classical statistics ables, for simplicity of exposition. In the model of interest, K K establishes that the appropriate features take the form of a X 2 R X and Y 2 R Y . Moreover, h i (minimal) sufficient statistic. However, in most contemporary X T ΛX ΛXY Z = ∼ N(0; ΛZ ); ΛZ = E ZZ = ; settings, the task is not known in advance—or equivalently Y ΛYX ΛY T there are multiple tasks—and we require a set of universal so X ∼ N(0; ΛX ), Y ∼ N(0; ΛY ), ΛYX = E YX , and T features that are, in an appropriate sense, uniformly good. ΛXY = ΛYX . We assume without loss of generality that ΛX With this motivation, the Gaussian universal feature se- and ΛY are (strictly) positive definite. The joint distribution lection problem can be expressed as: given a pair of high- takes the form jΛ j−1=2 1 dimensional jointly distributed Gaussian data vectors X 2 P (x; y) = P (z) = Z exp − zT Λ−1 z K K X;Y Z K =2 Z (2) R X and, Y 2 R Y , how should we choose low-dimensional (2π) Z 2 features f(X) and g(Y ) before knowing the desired inference where KZ = KX +KY , and with j·j denoting the determinant task so to ensure that after the task is revealed, inference based of its argument. It will be convenient to normalize X and Y on the features performs as well as possible? according to ~ −1=2 ~ −1=2 Mathematically, we express this problem as one of making X = ΛX X and Y = ΛY Y inference about latent variables U; V 2 k, for 1 ≤ k ≤ K so that R , ~ T minfK ;K g, in the Gauss-Markov chain ~ X IB X Y Z = ~ ; ΛZ~ = ; (3) U $ X $ Y $ V; (1) Y BI where the (Gaussian) distributions for these variables, i.e., where B Λ−1=2 Λ Λ−1=2 = Λ−1=2 Γ Λ1=2 (4) PU , PXjU , PV , and PY jV are not known at the time of , Y YX X Y Y jX X feature extraction. Our results can be viewed as an extension is called the canonical dependence matrix (CDM). CDM plays of the framework for discrete variables described in [1]. We the key role in Gaussian local analysis as the divergence note in advance that to simplify the exposition, we treat transfer matrix (DTM) does in the discrete case [1]. ~ ~ PX;Y as known, though in practice we must estimate the We note that the MMSE estimate of Y based on X is ~^ ~ ~ ~ ~^ ~ relevant aspects of this distribution from training samples Y (X) = B X, and the associated error ν~ , Y − Y (X) T T f(x1; y1);:::; (xn; yn)g. has covariance E ν~ν~ = I − BB , so the resulting MSE 2 T 2 Our contribution of this paper is summarized as follows. To is σ~e = tr I − BB = KY − kBkF. with k · kF denoting deal with the inference for unknown attributes, in section III the Frobenius norm. The SVD of B takes the form [2], which interprets the power iteration method for SVD K Y X T X Y X T computation. B = Ψ Σ Ψ = σi i i ; (5) From this perspective, we see that approximations to PX;Y i=1 can be obtained by truncating the representation (8) to the first where K = minfKX ;KY g and where we order the singular k < K of the terms in the product, yielding values according to σ1 ≥ · · · ≥ σK . Note that since (3) is P (k) (k) (x; y) positive semidefinite, it follows that σi ≤ 1 for i = 1;:::;K. X ;Y k ! As in the discrete case [1], it is useful to define a local ∗ ∗ Y σi fi (x) gi (y) analysis regime for such variables. In particular, we make use = PX (x) PY (y) e 1 + o() ; of the following notion of neighborhood. i=1 This corresponds to jointly Gaussian X(k) and Y (k) with the Definition 1 (Gaussian -Neighborhood). For a given > 0, same marginals as X and Y , respectively, but the -neighborhood of a K0-dimensional Gaussian distribution k ! 1=2 X Y X T 1=2 P0 = N(0; Λ0) is the set of Gaussian distributions in a Λ (k) (k) = Λ σ Λ ; (9) 2 Y X Y i i i X covariance-divergence ball of radius about P0, i.e., i=1 K0 | {z } G (P0) (k) ,B n −1=2 −1=2 2 2 o so , P = N(0; Λ): Λ0 Λ − Λ0 Λ0 F ≤ K0 k (k) (k) 1 X 2 2 I(X ; Y ) = σi + o( ): Note that PX;Y lies in an -neighborhood of PX PY if and 2 i=1 only if PX;~ Y~ lies in an -neighborhood of PX~ PY~ . Hence, KX +KY 2 2 III. UNIVERSAL LINEAR FEATURE SELECTION PX;Y 2 G (PX PY ) when kΛZ~ − IkF ≤ (KX + K ). We conclude that the neighborhood constraint limits how We use our framework to address the problem of Gaussian Y ~ ~ ~ ~ much the mean-square error (MSE) in the estimate of Y~ based universal feature selection. In our analysis, U; X; Y; V denote on observing X~ can be reduced relative to the MSE in the normalized versions of the variables in (1), so are N(0; I) estimate of Y~ based on no data. In the rest of this paper, we random vectors of appropriate dimension. In the sequel, we focus on the regime that is small. The K-L divergence and consider several different formulations, all of which yield the mutual information in this regime admits the following useful same linear features, and coincide with those defined by the asymptotic expressions. modal expansion (8). In our development, the following lemma will be useful (see, e.g., [3, Corollary 4.3.39, p. 248]). Lemma 1. In the weak dependence regime, 1 2 2 Lemma 2. Given an arbitrary k1 × k2 matrix A and any D(PY jX (·|x)kPY ) = Bx~ + o( ); 2 k 2 1;:::; minfk1; k2g , we have and k 1 K 2 X 2 X 2 2 max AM F = σi(A) ; (10) I(X; Y ) = σi + o( ): (6) k2×k T 2 fM2R : M M=Ig i=1 i=1 with σ (A) ≥ · · · ≥ σ (A) denoting the (ordered) 1 minfk1;k2g Proof. This is straightforward from the fact that, for an singular values of A. Moreover, the maximum in (10) is 2 T 2 2 2 arbitrary matrix A, ln jI − AA j = − kAkF + o( ). achieved by M = 1(A) ··· k(A) , with i(A) de- noting the right singular vector of A corresponding to σi(A), To interpret (6), consider the modal decomposition of PX;Y . In particular, observe that as ! 0, for i = 1;:::; minfk1; k2g. T −1 I −B A. Optimum Features, Rotation-Invariant Ensembles ΛZ~ = + o(): (7) −BI In this formulation, we seek to determine optimum features Hence, K ! for estimating an unknown k-dimensional U from Y , in the ∗ ∗ Y σi fi (x) gi (y) case where U and X are weakly dependent; specifically, PX;Y (x; y) = PX (x) PY (y) e 1 + o() ; KX +k i=1 PX;U 2 G (PX PU ). Accordingly, from the innovations XjU (8a) form X~ = Φ U~ + ν ~ ~ , where U~ and ν ~ ~ are ∗ ∗ U!X U!X where fi and gi are (linear) functions given by independent, it follows that weak dependence means, using ∗ X T −1=2 ∗ Y T −1=2 XjU fi (x) = ( i ) ΛX x and gi (y) = ( i ) ΛY y: Definition 1, that Φ satisfies | {z } | {z } XjU 2 1 ∗T ∗T Φ ≤ (KX + k); (11) ,fi ,gi F 2 (8b) but is otherwise unknown. Moreover, using (8b) with (4) and (5) we obtain the covariance ~ ~ XjU ~ We observe Y = B X + νX~ !Y~ = BΦ U + νU~!Y~ .