Experiments with Random Projection Abstract 1 Introduction 2 High
Total Page:16
File Type:pdf, Size:1020Kb
UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 143 Experiments with Random Projection Sanjoy Dasgupta* AT&T Labs - Research Abstract a PAC-like sense) algorithm for learningmixtures of Gaus sians (Dasgupta, 1999). Random projection can also easily Recent theoretical work has identified random be used in conjunction with EM. To test this combination, projection as a promising dimensionality reduc we have performed experiments on synthetic data from a tion technique for learning mixtures of Gaus variety of Gaussian mixtures. In these, EM with random sians. Here we summarize these results and il projection is seen to consistently yield models of quality lustrate them by a wide variety of experiments (log-likelihood on a test set) comparable to or better than on synthetic and real data. that of models found by regular EM. And the reduction in dimension saves a lot of time. Finally, we have used random projection to construct a clas 1 Introduction sifier for handwritten digits, from a canonical USPS data set in which each digit is represented as a vector in JR256 . It has recently been suggested that the learning of high We projected the training data randomly into JR40, and were dimensional mixtures of Gaussians might be facilitated by able to fit a mixture of fifty Gaussians (five per digit) to this first projecting the data into a randomly chosen subspace of data quickly and easily, without any tweaking or covariance low dimension (Dasgupta, 1999). In this paper we present a restrictions. The details of the experiment directly corrob comprehensive series of experiments intended to precisely orated our theoretical results. illustrate the benefits of this technique. Another, very popular, technique for dimensionality reduc There are two main theoretical results about random pro tion is principal component analysis (PCA). Throughout jection. The first is that data from a mixture of k Gaus this paper, both in conceptual discussions and empirical sians can be projected into just 0 (log k) dimensions while studies, we will contrast PCA with random projection in still retaining the approximate level of separation between order to get a better feel for each. the clusters. This projected dimension is independent of the number of data points and of their original dimension. 2 In the experiments we perform, a value of 10 Ink works High-dimensional Gaussians nicely. Second, even if the original clusters are highly ec 2.1 centric (that is, far from spherical), random projection will Some counter-intuitive effects make them more spherical. This effect is of major impor An n-dimensional Gaussian N(p,,� ) has density function tance because raw high-dimensional data can be expected to form very eccentric clusters, owing, for instance, to dif 1 ( 1 ) -p,) T � -1 -p,) . ferent units of measurement for different attributes. Clus p(x) = (2n)n /21�11/2 exp -2(x (x ters of high eccentricity present an algorithmic challenge. For example, they are problematic for the EM algorithm If � is a multiple of the identity matrix, then the Gaus because special pains must be taken to ensure that interme sian is called spherical. Some important intuition about the diate covariance matrices do not become singular, or close behavior of Gaussians in high dimension can quickly be to singular. Often this is accomplished by imposing special gained by examining the spherical Gaussian N(O, a2I n ). restrictions on the matrices. Although its density is highest at the origin, it turns out that for large n most of the probability mass lies far away These two benefits have made random projection the key from this center. A point X E !Rn chosen randomly ingredient in the first polynomial-time, provably correct (in from this Gaussian has coordinates Xi which are i.i.d. *Work done while at University of California, Berkeley. N(O, a2). Therefore its expected squared Euclidean norm 144 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 is E(IIXI/2) = L; EX[ = ncr2• In fact, it can be shown The intention is that two Gaussians are c-separated if their quite routinely, by writing out the moment-generating func centers are c radii apart. Our choice of radius for non tion of IIXW, that the distribution of IIXII2 will be tightly spherical Gaussians N(f1, I:) is motivated by the observa concentrated around its expected value. Specifically, tion that points X from such Gaussians have EIIX-11ll2 = 4 trace( I:). P(IIIXII2 -cr2nl > w2n) � 2e-n•2/2 . In high dimension, a 2-separated mixture corresponds That is to say, for big enough almost the entire distribu n, roughly to almost completely separated Gaussian clusters, tion lies in a thin shell of radius approximately cryn. Thus whereas a mixture that is 1- or �-separated has slightly the natural scale of this Gaussian is in units of cryn. more (though still negligible) overlap. What kind of sep This effect might arouse some initial skepticism because aration should be expected of real data sets? This will vary it is not observable in one or two dimensions. But it can from case to case. As an example, we did some simple perhaps be made more plausible by the following explana analysis of a collection of 9,709 handwritten digits from tion. The Gaussian N(O,I n) assigns density proportional USPS, where each digit was represented as a vector in 256- to e-p2n/2 to points on the surface of the sphere centered at dimensional space. We fit a mixture of ten Gaussians to the origin and of radius pyn, p � 1. But the surface area the data, by doing each digit separately, and found that this of this sphere is proportional to (pyn)n-1. For large n, as mixture was 0.63-separated. p t 1, this surface area is growing much faster than the den One way to think about high-dimensional c-separated mix sity is decaying, and thus most of the probability mass lies tures is to imagine that their projections to any one coordi at distance about yn from the origin (Bishop, 1995, exer nate are c-separated. For instance, suppose that measure cise 1.4). Figure 1 is a graphical depiction of this effectfor ments are made on a population consisting of two kinds various values of n. of fish. Various attributes, such as length and weight, are The more general Gaussian N(O, I:) has ellipsoidal con recorded. Suppose also that restricting attention to any one attribute gives a !-separated mixture of two Gaussians in tours of equal density. Each such ellipsoid is of the form 1 {x : xTL:-1x = r2}, corresponding to points at a fixed � , which is unimodal and therefore potentially difficultto 1 learn. However, if several (say ten) independent attributes Mahalanobis distance llxiiE = JxTL:- x from the center 1 of the Gaussian. The principal axes of these ellipsoids are are considered together, then the mixture in � 0 will re given by the eigenvectors of I:. The radius along a particu main !-separated but will no longer have a unimodal distri lar axis is proportional to the square root of the correspond bution. It is precisely to achieve such an effectthat multiple attributes are used. This improvement in terms of better ing eigenvalue. Denote the eigenvalues by A1 � · · · � An. We will measure how non-spherical a Gaussian is by its defined clusters is bought at the price of an increase in di mensionality. It is then up to the learning algorithm to ef eccentricity, namely y'A n/ A1. As in the spherical case, for large n the distribution of N(O, I:) will be concen fectively exploit this tradeoff. trated around an ellipsoidal shell llxll� � n. Yet, if I: It is worth clarifying that our particular notion of separa has bounded eccentricity, this distribution will also be con tion corresponds to the expectation that at least some frac centrated, perhaps less tightly, around a spherical shell tion of the attributes will provide a little bit of discrimina llxll2 � A1 +···+An = trace(I:). tive information between the clusters. As an example of when this is not the case, consider two spherical Gaussians 2.2 Formalizing separation N(f11,ln) and N(f12, In) in some very high dimension n, and suppose that only one of the attributes is at all useful. It is reasonable to imagine, and is borneout by experience In other words, 111 and 112 are identical on every coordinate with techniques like EM (Duda and Hart, 1973; Redner save one. We will consider these clusters to be poorly sep and Walker, 1984), that a mixture of Gaussians is easiest arated- their separation is O(n-112) - even though clus to learn when the Gaussians do not overlap too much. Our tering might information-theoretically be possible. discussion of N(f1, cr2I n) suggests that it is natural to de fine the radiusof this Gaussian as cryn,which leads to the following 3 Dimensionality reduction Definition Two Gaussians N(f11,cr2 I n) and N(f12, cr2I n) are c-separated if llf11 - 11 ll 2': ccryn. More generally, Dimensionality reduction has been the subject of keen 2 n Gaussians N(111, I:1) and N(112, I:2) in � are c-separated study for the past few decades, and instead of trying to if summarize this work we will focus upon two popular contemporary techniques: principal component analysis (PCA) and random projection. They are both designed for A mixture of Gaussians is c-separated if its component data with Euclidean (£2) interpoint distances and are both Gaussians are pairwise c-separated. achieved via linear mappings. UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 145 O.B,r==--�---�---�---.0.45 0.4 n = n = � 0.7 1 2 "' ;;;::>o.35 � 0.6 15 � � 0.