Experiments with Random Projection Sanjoy Dasgupta∗ AT&T Labs – Research Abstract a PAC-like sense) algorithm for learning mixtures of Gaus- sians (Dasgupta, 1999). Random projection can also easily Recent theoretical work has identified random be used in conjunction with EM. To test this combination, projection as a promising dimensionality reduc- we have performed experiments on synthetic data from a tion technique for learning mixtures of Gaus- variety of Gaussian mixtures. In these, EM with random sians. Here we summarize these results and il- projection is seen to consistently yield models of quality lustrate them by a wide variety of experiments (log-likelihood on a test set) comparable to or better than on synthetic and real data. that of models found by regular EM. And the reduction in dimension saves a lot of time. Finally, we have used randomprojection to construct a clas- 1 Introduction sifier for handwritten digits, from a canonical USPS data set in which each digit is represented as a vector in R256. It has recently been suggested that the learning of high- We projected the training data randomly into R40, and were dimensional mixtures of Gaussians might be facilitated by able to fit a mixture of fifty Gaussians (five per digit) to this first projecting the data into a randomly chosen subspace of data quickly and easily, without any tweaking or covariance low dimension (Dasgupta, 1999). In this paper we present a restrictions. The details of the experiment directly corrob- comprehensive series of experiments intended to precisely orated our theoretical results. illustrate the benefits of this technique. Another, very popular, technique for dimensionality reduc- There are two main theoretical results about random pro- tion is principal component analysis (PCA). Throughout jection. The first is that data from a mixture of k Gaus- this paper, both in conceptual discussions and empirical sians can be projected into just O(log k) dimensions while studies, we will contrast PCA with random projection in still retaining the approximate level of separation between order to get a better feel for each. the clusters. This projected dimension is independent of the number of data points and of their original dimension. In the experiments we perform, a value of 10 ln k works 2 High-dimensional Gaussians nicely. Second, even if the original clusters are highly ec- centric (that is, far from spherical), random projection will 2.1 Some counter-intuitive effects make them more spherical. This effect is of major impor- An -dimensional Gaussian has density function tance because raw high-dimensional data can be expected n N(µ, Σ) to form very eccentric clusters, owing, for instance, to dif- 1 1 p(x)= exp (x µ)T Σ−1(x µ) . ferent units of measurement for different attributes. Clus- (2π)n/2 Σ 1/2 −2 − − ters of high eccentricity present an algorithmic challenge. | | For example, they are problematic for the EM algorithm If Σ is a multiple of the identity matrix, then the Gaus- because special pains must be taken to ensure that interme- sian is called spherical. Some important intuition about the diate covariance matrices do not become singular, or close behavior of Gaussians in high dimension can quickly be 2 to singular. Often this is accomplished by imposing special gained by examining the spherical Gaussian N(0, σ In). restrictions on the matrices. Although its density is highest at the origin, it turns out that for large n most of the probability mass lies far away These two benefits have made random projection the key from this center. A point X Rn chosen randomly ingredient in the first polynomial-time, provably correct (in ∈ from this Gaussian has coordinates Xi which are i.i.d. ∗Work done while at University of California, Berkeley. N(0, σ2). Therefore its expected squared Euclidean norm is E( X 2) = EX2 = nσ2. In fact, it can be shown The intention is that two Gaussians are c-separated if their k k i i quite routinely, byP writing out the moment-generatingfunc- centers are c radii apart. Our choice of radius for non- tion of X 2, that the distribution of X 2 will be tightly spherical Gaussians N(µ, Σ) is motivated by the observa- concentratedk k around its expected value.k Specifically,k tion that points X from such Gaussians have E X µ 2 = k − k 2 trace(Σ). P( X 2 σ2n > ǫσ2n) 2e−nǫ /24. |k k − | ≤ In high dimension, a 2-separated mixture corresponds That is to say, for big enough n, almost the entire distribu- roughly to almost completely separated Gaussian clusters, tion lies in a thin shell of radius approximately σ√n. Thus whereas a mixture that is 1- or 1 -separated has slightly the natural scale of this Gaussian is in units of σ√n. 2 more (though still negligible) overlap. What kind of sep- This effect might arouse some initial skepticism because aration should be expected of real data sets? This will vary it is not observable in one or two dimensions. But it can from case to case. As an example, we did some simple perhaps be made more plausible by the following explana- analysis of a collection of 9,709 handwritten digits from tion. The Gaussian N(0, In) assigns density proportional USPS, where each digit was represented as a vector in 256- 2 to e−ρ n/2 to points on the surface of the sphere centered at dimensional space. We fit a mixture of ten Gaussians to the origin and of radius ρ√n,ρ 1. But the surface area the data, by doing each digit separately, and found that this of this sphere is proportional to (≤ρ√n)n−1. For large n, as mixture was 0.63-separated. ρ 1, this surface area is growingmuch faster than the den- ↑ One way to think about high-dimensional c-separated mix- sity is decaying, and thus most of the probability mass lies tures is to imagine that their projections to any one coordi- at distance about √n from the origin (Bishop, 1995, exer- nate are c-separated. For instance, suppose that measure- cise 1.4). Figure 1 is a graphical depiction of this effect for ments are made on a population consisting of two kinds various values of n. of fish. Various attributes, such as length and weight, are The more general Gaussian N(0, Σ) has ellipsoidal con- recorded. Suppose also that restricting attention to any one tours of equal density. Each such ellipsoid is of the form attribute gives a 1-separated mixture of two Gaussians in x : xT Σ−1x = r2 , corresponding to points at a fixed R1, which is unimodal and therefore potentially difficult to { } T −1 learn. However, if several (say ten) independent attributes Mahalanobis distance x Σ = √x Σ x from the center R10 of the Gaussian. The principalk k axes of these ellipsoids are are considered together, then the mixture in will re- given by the eigenvectors of Σ. The radius along a particu- main 1-separated but will no longer have a unimodal distri- lar axis is proportionalto the square root of the correspond- bution. It is precisely to achieve such an effect that multiple attributes are used. This improvement in terms of better- ing eigenvalue. Denote the eigenvalues by λ1 λn. We will measure how non-spherical a Gaussian≤···≤ is by its defined clusters is bought at the price of an increase in di- mensionality. It is then up to the learning algorithm to ef- eccentricity, namely λn/λ1. As in the spherical case, for large n the distributionp of N(0, Σ) will be concen- fectively exploit this tradeoff. trated around an ellipsoidal shell x 2 n. Yet, if Σ k kΣ ≈ It is worth clarifying that our particular notion of separa- has bounded eccentricity, this distribution will also be con- tion corresponds to the expectation that at least some frac- centrated, perhaps less tightly, around a spherical shell tion of the attributes will provide a little bit of discrimina- x 2 λ + + λ = trace(Σ). k k ≈ 1 · · · n tive information between the clusters. As an example of when this is not the case, consider two spherical Gaussians 2.2 Formalizing separation N(µ1, In) and N(µ2, In) in some very high dimension n, and suppose that only one of the attributes is at all useful. It is reasonable to imagine, and is borne out by experience In other words, µ1 and µ2 are identical on every coordinate with techniques like EM (Duda and Hart, 1973; Redner save one. We will consider these clusters to be poorly sep- and Walker, 1984), that a mixture of Gaussians is easiest arated – their separation is O(n−1/2) – even though clus- to learn when the Gaussians do not overlap too much. Our tering might information-theoretically be possible. 2 discussion of N(µ, σ In) suggests that it is natural to de- fine the radius of this Gaussian as σ√n, which leads to the following 3 Dimensionality reduction 2 2 Definition Two Gaussians N(µ1, σ In) and N(µ2, σ In) are c-separated if µ1 µ2 cσ√n. More generally, Dimensionality reduction has been the subject of keen k − k ≥ n Gaussians N(µ1, Σ1) and N(µ2, Σ2) in R are c-separated study for the past few decades, and instead of trying to if summarize this work we will focus upon two popular contemporary techniques: principal component analysis µ µ c max trace(Σ ), trace(Σ ) . 1 2 1 2 (PCA) and random projection. They are both designed for k − k≥ p { } A mixture of Gaussians is c-separated if its component data with Euclidean (L2) interpoint distances and are both Gaussians are pairwise c-separated.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages9 Page
-
File Size-