Latent Variable Models for Dimensionality Reduction

Latent Variable Models for Dimensionality Reduction Zhihua Zhang Michael I. Jordan College of Comp. Sci. and Tech. Departments of EECS and Statistics Zhejiang University University of California, Berkeley Hangzhou, Zhejiang 310027, China Berkeley, CA 94720, USA [email protected] [email protected] Abstract and PCA, we prefer to use the term PCO in this paper to refer to classical scaling MDS. Principal coordinate analysis (PCO), a dual In PCO, the original dissimilarity measure is required of principal component analysis (PCA), is a to be Euclidean. Equivalently, the inner product func- classical method for exploratory data anal- tion that induces the dissimilarity is positive definite. ysis. In this paper we provide a proba- This inner product function thus defines a similarity bilistic interpretation of PCO. We show that measure and it can be referred to as a reproducing this interpretation yields a maximum likeli- kernel. hood procedure for estimating the PCO parameters and we also present an iterative Thus, Schölkopf et al. (1998) proposed kernel PCA expectation-maximization algorithm for ob- (KPCA) as a nonlinear extension of PCA. There also taining maximum likelihood estimates. Fi- exists a duality between PCO and KPCA (Williams, nally, we show that our framework yields a 2001). Thus, a nonlinear version of PCO can be de- probabilistic formulation of kernel PCA. vised by using reproducing kernels as similarities. Different MDS models make use of different techniques to model the configurations of points. For example, 1 Introduction conventional PCO employs the spectral decomposition method, while the metric least squares scaling uses Multidimensional scaling (MDS) (Borg and Groe- an iterative majorization method (Borg and Groenen, nen, 1997) has been widely applied to data analysis 1997). A statistical approach to MDS has been de- and processing. Like principal component analysis vised maximum likelihood methods can be used to (PCA) (Jolliffe, 2002), MDS is also an important tool estimate the configurations of points (Ramsay, 1982; for dimensionality reduction and visualization. Given Groenen et al., 1995). In addition, Oh and Raftery the (dis)similarities between pairs of objects, MDS is (2001) proposed a Bayesian method for the configura- concerned with the problem of representing the objects tion. In these statistical treatments, the dissimilarities as points in a (usually) Euclidean space so that the are generally modeled as following a truncated normal distances between the points in the Euclidean space or log-normal distribution. match the original dissimilarities as much as possible. However, these statistical approaches are not appro- In terms of the techniques used for the configurations priate for PCO. Since the dissimilarities in PCO are of points, MDS can be categorized into metric and Euclidean, the metric inequality (i.e., the triangle in- nonmetric methods. Furthermore, metric MDS meth- equality) should be satisfied. For the dissimilarities ods include classical scaling and least squares scaling. generated from a truncated normal or log-normal dis- Classical scaling is commonly called principal coordi- tribution, the metric inequality is no longer guaran- nate analysis (PCO). Considering that there exists a teed. Thus, a probabilistic formulation is still absent duality between PCO and PCA (Gower, 1966), and for PCO. In the current paper we attempt to address given our interest in the relationship between MDS this gap by showing that PCO may indeed fit into a th maximum likelihood estimation framework. Appearing in Proceedings of the 12 International Confe- rence on Artificial Intelligence and Statistics (AISTATS) Tipping and Bishop (1999) proposed a probabilistic 2009, Clearwater Beach, Florida, USA. Volume 5 of JMLR: PCA (PPCA) model in which PCA is reformulated as W&CP 5. Copyright 2009 by the authors. a normal latent variable model that is closely related to 655 Latent Variable Models for Dimensionality Reduction factor analysis (FA) (Bartholomew and Knott, 1999). EM algorithm for parameter estimation are also de- Owing to the duality between PCO and PCA, it would vised. In Section 4 we propose PKPCA and establish be desirable to develop such a latent variable model the duality between PPCO and PKPCA. Experimen- for PCO so that we have a probabilistic formulation tal studies and concluding remarks are given in Sec- of PCO. tions 5 and 6, respectively. All proofs and derivations are omitted, but they are presented in a long version Conventional PCO (or KPCA) does not necessarily re- of this paper. quire that the original objects (called feature vectors in the machine learning literature) are explicitly available. Instead, it only requires that the (dis)similarities 2 PCO and KPCA are given. However, the original objects themselves are given in PPCA and FA. As a result, it seems dif- Suppose we are given a set of dissimilarities, 2 ficult to follow the approach taken for PCA, in which δij ; i, j = 1,...,n , between n objects. Let ∆ = [δ ] { } ij a connection to FA is exploited, in developing a latent be the n n dissimilarity matrix. We assume that ∆ × variable model for PCO. is Euclidean. This implies that there exists a set of n points in a Euclidean space, denoted by f i : i = Recall that since in PCO or KPCA the dissimilarities 1,...,n , such that { (or similarities) are Euclidean (or positive definite), } 2 ′ ′ ′ ′ there exists a set of feature vectors such that the Eu- δij = (fi fj) (fi fj ) = fi fi + fjfj 2fi fj . (1) clidean distances (or the inner products) between them − − − are exactly equal to the dissimilarities (or similarities). We thus have This motivates us to treat the feature vectors as vir- 1 ′ H∆H = HFF H, tual observations. We will show how this treatment −2 allows us to specify normal latent variable models for ′ 1 ′ PCO as well as KPCA. We refer to these interpreta- where F = [f1,..., fn] and H = In 1n1 (a cen- − n n tions as probabilistic PCO (PPCO) and probabilistic tering matrix). Here and later, In is the n n identity × KPCA (PKPCA), respectively. matrix and 1n is the n 1 vector of 1’s. Thus, the assumption that ∆ is Euclidean× is equivalent to the In PPCO the principal coordinates (the configurations positive semidefiniteness of 1 H∆H. in a low-dimensional Euclidean space) are treated as − 2 the model parameters. As a result, we can use max- We now establish a connection of ∆ to the theory imum likelihood (ML) to estimate the principal coor- of reproducing kernels. Starting with a set of n p- dinates. We shall see that the estimated results agree dimensional input vectors, xi : i = 1,...,n { } ⊂ X ⊂ with those obtained via the spectral decomposition Rp, we define a positive definite function K : X ×X → method. Moreover, the latent variable idea allows us R, as a kernel function. There are three common ker- to use the expectation maximization (EM) algorithm nel functions that are widely used in practice: for PPCO. Importantly, without the explicit usage of the virtual observations themselves, we can still imple- (a) Linear kernel: K x , x ) = x′ x ; i j i j ment ML and EM procedures using only the available (b) Gaussian kernel: K x , x ) = exp( x x 2/θ (dis)similarities. i j i j with θ > 0; −k − k In PKPCA the principal components (the orthonor- ′ m mal bases spanning the low-dimensional subspace) are (c) Polynomial kernel: K(xi, xj ) = (xixj + 1) . treated as the model parameters, which are also estimated by ML. Our model differs from the PPCA From the kernel function and the data, we obtain an model of Tipping and Bishop (1999) in that they use n n kernel matrix K = [kij ] where kij = K(xi, xj ). × non-orthonormal principal components (factor load- Since the kernel matrix K is positive semidefinite ings) instead of orthonormal principal components. (p.s.d.), it can be regarded as an inner product ma- Although this difference seems to be minor, the dif- trix and induces an Euclidean matrix, which is then ference has important consequences in that the so- defined as the aforementioned ∆. It is readily seen lution of our model agrees with that of conventional that 2 PCA, but the solution of PPCA of Tipping and Bishop δ = kii + kjj 2kij . (2) ij − (1999) does not. Comparing (2) with (1), we equate kij with the in- The remainder of the paper is organized as follows. ′ ner product between fi and fj , i.e., kij = fi fj and Section 2 describes the original formulations of PCO ′ K = FF . The vector fi is referred to as the feature and KPCA. In Section 3 we propose a normal latent vector corresponding to xi. Thus, the kernel technol- variable model for PCO. A direct ML method and an ogy provides us with an approach to the construction 656 Zhang, Jordan of inner product matrices and dissimilarity (distance) KPCA employs the R technique. Since the Q and R matrices. techniques are dual to each another, there also exists a duality between PCO and KPCA. The difference is PCO (or classical multidimensional scaling) was origi- that PCO directly computes the low-dimensional con- nally used to construct the coordinates for the points, figurations, while KPCA computes the bases that span y : i = 1,...,n , in a Euclidean space, such that { i } the low-dimensional subspace and the low-dimensional ′ 2 2 ′ configurations are the projections of the feature vec- (yi yj ) (yi yj) = dij δij = (fi fj ) (fi fj). (3) − − ≈ − − tors onto this subspace. The focus of this paper is dimensionality reduction: letting y Rq and f Rr, q should be less than i ∈ i ∈ 3 Probabilistic PCO r. Assuming that the centroid of yi is at the origin of Rq, from (3) we obtain 1 H∆H YY′ where ′ − 2 ≈ Before presenting our probabilistic approach to PCO, Y = [y1,..., yn] .

Load more