Conditional Central Limit Theorems for Gaussian Projections
Total Page:16
File Type:pdf, Size:1020Kb
1 Conditional Central Limit Theorems for Gaussian Projections Galen Reeves Abstract—This paper addresses the question of when projec- The focus of this paper is on the typical behavior when the tions of a high-dimensional random vector are approximately projections are generated randomly and independently of the Gaussian. This problem has been studied previously in the context random variables. Given an n-dimensional random vector X, of high-dimensional data analysis, where the focus is on low- dimensional projections of high-dimensional point clouds. The the k-dimensional linear projection Z is defined according to focus of this paper is on the typical behavior when the projections Z =ΘX, (1) are generated by an i.i.d. Gaussian projection matrix. The main results are bounds on the deviation between the conditional where Θ is a k n random matrix that is independent of X. distribution of the projections and a Gaussian approximation, Throughout this× paper it assumed that X has finite second where the conditioning is on the projection matrix. The bounds are given in terms of the quadratic Wasserstein distance and moment and that the entries of Θ are i.i.d. Gaussian random relative entropy and are stated explicitly as a function of the variables with mean zero and variance 1/n. number of projections and certain key properties of the random Our main results are bounds on the deviation between the vector. The proof uses Talagrand’s transportation inequality and conditional distribution of Z given Θ and a Gaussian approx- a general integral-moment inequality for mutual information. imation. These bounds are given in terms of the quadratic Applications to random linear estimation and compressed sensing are discussed. Wasserstein distance and relative entropy and are stated explic- itly as a function of the number of projections k and certain Index Terms—Central Limit Theorems, Compressed Sensing, properties of the distribution on X. For example, under the High-dimensional Data Analysis, Random Projections same assumptions used by Meckes [12, Corollary 4], we show that I. INTRODUCTION 2 − 1 − 2 E W (P , GZ ) C n 4 + kn k+4 , A somewhat surprising phenomenon is that the distributions 2 Z|Θ ≤ of certain weighted sums (or projections) of random variables where the expectation is with respect to the random matrix Θ, can be close to Gaussian, even if the variables themselves W2( , ) denotes the quadratic Wasserstein distance, and GZ is have a nontrivial dependence structure. This fact can be traced the Gaussian· · distribution with the same mean and covariance back to the work of Sudakov [1], who showed that under as Z. mild conditions, the distributions of most one-dimensional In comparison with previous work, one of the contributions projections of a high-dimensional vector are close to Gaussian. of this paper is that our results provide a stronger character- An independent line of work by Diaconis and Freedman [2] ization of the approximation error. Specifically, the analysis provides similar results for projections of high-dimensional requires fewer assumptions about the distribution of X and point clouds. In both cases, it is shown that the phenomenon the bounds are stated with respect to stronger measures of persists with high probability when the weights are drawn statistical distance, namely the quadratic Wasserstein distance randomly from the uniform measure on the sphere. Ensuing and relative entropy. arXiv:1612.09252v2 [cs.IT] 30 Dec 2016 work [3]–[14] has generalized and strengthened these results A further contribution of the paper is given by our proof in several directions, including the case of multivariate projec- technique, which appears to be quite different from previous tions. approaches. The first step in our proof is to characterize the Most related to the current paper is the recent line of work conditional distribution of Z after it has been passed through by Meckes [11], [12], who provides bounds with respect to an additive white Gaussian noise (AWGN) channel of noise the bounded-Lipschitz metric when the projections are dis- power t (0, ). In particular, the k-dimensional random tributed uniformly on the Stiefel manifold. Meckes shows that, vector Y ∈is defined∞ according to under certain assumptions on a sequence of n-dimensional √ (2) random vectors, the distribution of the projections are close to Y = Z + tN, Gaussian provided that the number of projections k satisfies where N (0, Ik) is independent of Z. The bulk of the k < 2 log n/ log log n. Meckes also shows that this condition work is to∼ bound N the relative entropy between the conditional cannot be improved in general. distribution of Y given Θ and the Gaussian distribution with the same mean and covariance as Y . To this end, we take The work of G. Reeves was supported in part by funding from the Laboratory for Analytic Sciences (LAS). Any opinions, findings, conclusions, advantage of a general integral-moment inequality (Lemma 6) and recommendations expressed in this material are those of the author and that allows us to bound the mutual information I(Y ; Θ) in do not necessarily reflect the views of the sponsors. terms of the variance of the density function of P . Note G. Reeves is with the Department of Electrical and Computer Engineering Y |Θ and the Department of Statistical Science, Duke University, Durham, NC that this density is guaranteed to exists because of the added 27708 USA (e-mail: [email protected]). Gaussian noise. 2 The next step in our proof is to use the fact that the square is sometimes referred to convergence in information, is much of the conditional density can be expressed as an expectation stronger than convergence in distribution [28]. with respect to two independent copies of X using the identity: The main results of this paper are bounds on the conditional distributions of the random projection Z defined in (1) and p2 (y Θ) = E p (y X , Θ) p (y X , Θ) Θ , Y |Θ | Y |X,Θ | 1 Y |X,Θ | 2 | the noisy random projection Y defined in (2). The marginal where the expectation is with respect to independent vector s distributions of these vectors are denoted by PZ and PY and the Gaussian distributions with the same mean and covariance X1 and X2 with the same distribution as X. By swapping are denoted by GZ and GY . The conditional distributions the order of expectation between Θ and the pair (X1,X2), we are then able to obtain closed form expressions for integrals corresponding to the random matrix Θ are denoted by PZ|Θ involving the variance of the density. These expressions lead and PY |Θ. Using this notation, the marginal distributions can E E to explicit bounds with respect to the relative entropy (Theo- be expressed as PZ = PZ|Θ and PY = PY |Θ where the expectation is with respect to Θ. rem 2). Finally, the last step of our proof leverages Talagrand’s The following definition describes the properties of the transportation inequality [15], to obtain bounds on the condi- distribution of X that are needed for our bounds. tional distribution of Z given Θ with respect to the quadratic Definition 1. For any n-dimensional random vector X with 2 Wasserstein distance (Theorem 1). This step requires careful E X < , the functions α(X) and βr(X) are defined k k ∞ control of the behavior of the conditional distribution PY |Θ in according to the limit as the noise power t converges to zero. 1 One of the primary motivations for this work comes from α(X)= E X 2 E X 2 the author’s recent work on the asymptotic properties of a n k k − k k 1 1 E r r certain random linear estimation problem [16], [17]. In partic- βr(X)= ( [ X1,X2 ]) , ular, Theorem 10 of this paper plays a key role in rigorously n |h i| characterizing certain phase transitions that had been predicted where r 1, 2 , , denotes the Euclidean inner product ∈ { } h· ·i using the heuristic replica method from statistical physics between vectors and X1 and X2 are independent vectors with [18]. More generally, we believe that the results in this paper the same distribution as X. could be useful for the analysis of algorithms that rely on The function α(X) measures the deviation of the squared Gaussian approximations for weighted sums of large numbers magnitude of X about its expectation. The function βr(X) of random variables. These include, for example, expectation is non-decreasing in r. It is straightforward to show that propagation [19], expectation consistent approximate inference the case r = 2 can be expressed equivalently as β2(X) = [20], relaxed belief propagation [21], and the rapidly growing 1 T E XX , where F denotes the Frobenius norm. class of algorithms based on approximate message passing n F k·k [22]–[24]. Another potential application for our results is Assumption 1 (IID Gaussian Projections). The entries of the to provide theoretical guarantees for approximate inference. k n matrix Θ are i.i.d. Gaussian random variables with mean × Some initial work in this direction is described in [25], [26]. zero and variance 1/n. Assumption 2 (Finite Second Moment). The n-dimensional 1 E 2 A. Statement of Main Results random vector X has finite second moment: n X = γ (0, ). k k Before we can state our main results we need some addi- ∈ ∞ tional definitions. The quadratic Wasserstein distance between Under Assumptions 1 and 2, the marginal distribution of k distributions P and Q on R is defined according to Z has mean zero and covariance γIk, and thus the Gaussian 1 approximations are given by GZ = (0, γIk) and GY = E 2 2 N W2(P,Q) = inf U V , (0, (γ + t)I ).