Latent Variable Models for Dimensionality Reduction

Zhihua Zhang Michael I. Jordan College of Comp. Sci. and Tech. Departments of EECS and Zhejiang University University of California, Berkeley Hangzhou, Zhejiang 310027, China Berkeley, CA 94720, USA [email protected] [email protected]

Abstract and PCA, we prefer to use the term PCO in this pa- per to refer to classical scaling MDS. Principal coordinate analysis (PCO), a dual In PCO, the original dissimilarity measure is required of principal component analysis (PCA), is a to be Euclidean. Equivalently, the inner product func- classical method for exploratory data anal- tion that induces the dissimilarity is positive definite. ysis. In this paper we provide a proba- This inner product function thus defines a similarity bilistic interpretation of PCO. We show that measure and it can be referred to as a reproducing this interpretation yields a maximum likeli- kernel. hood procedure for estimating the PCO pa- rameters and we also present an iterative Thus, Sch¨olkopf et al. (1998) proposed kernel PCA expectation-maximization algorithm for ob- (KPCA) as a nonlinear extension of PCA. There also taining maximum likelihood estimates. Fi- exists a duality between PCO and KPCA (Williams, nally, we show that our framework yields a 2001). Thus, a nonlinear version of PCO can be de- probabilistic formulation of kernel PCA. vised by using reproducing kernels as similarities. Different MDS models make use of different techniques to model the configurations of points. For example, 1 Introduction conventional PCO employs the spectral decomposition method, while the metric least squares scaling uses Multidimensional scaling (MDS) (Borg and Groe- an iterative majorization method (Borg and Groenen, nen, 1997) has been widely applied to data analysis 1997). A statistical approach to MDS has been de- and processing. Like principal component analysis vised maximum likelihood methods can be used to (PCA) (Jolliffe, 2002), MDS is also an important tool estimate the configurations of points (Ramsay, 1982; for dimensionality reduction and visualization. Given Groenen et al., 1995). In addition, Oh and Raftery the (dis)similarities between pairs of objects, MDS is (2001) proposed a Bayesian method for the configura- concerned with the problem of representing the objects tion. In these statistical treatments, the dissimilarities as points in a (usually) Euclidean space so that the are generally modeled as following a truncated normal distances between the points in the Euclidean space or log-normal distribution. match the original dissimilarities as much as possible. However, these statistical approaches are not appro- In terms of the techniques used for the configurations priate for PCO. Since the dissimilarities in PCO are of points, MDS can be categorized into metric and Euclidean, the metric inequality (i.e., the triangle in- nonmetric methods. Furthermore, metric MDS meth- equality) should be satisfied. For the dissimilarities ods include classical scaling and least squares scaling. generated from a truncated normal or log-normal dis- Classical scaling is commonly called principal coordi- tribution, the metric inequality is no longer guaran- nate analysis (PCO). Considering that there exists a teed. Thus, a probabilistic formulation is still absent duality between PCO and PCA (Gower, 1966), and for PCO. In the current paper we attempt to address given our interest in the relationship between MDS this gap by showing that PCO may indeed fit into a

th maximum likelihood estimation framework. Appearing in Proceedings of the 12 International Confe- rence on Artificial Intelligence and Statistics (AISTATS) Tipping and Bishop (1999) proposed a probabilistic 2009, Clearwater Beach, Florida, USA. Volume 5 of JMLR: PCA (PPCA) model in which PCA is reformulated as W&CP 5. Copyright 2009 by the authors. a normal that is closely related to

655 Latent Variable Models for Dimensionality Reduction (FA) (Bartholomew and Knott, 1999). EM algorithm for parameter estimation are also de- Owing to the duality between PCO and PCA, it would vised. In Section 4 we propose PKPCA and establish be desirable to develop such a latent variable model the duality between PPCO and PKPCA. Experimen- for PCO so that we have a probabilistic formulation tal studies and concluding remarks are given in Sec- of PCO. tions 5 and 6, respectively. All proofs and derivations are omitted, but they are presented in a long version Conventional PCO (or KPCA) does not necessarily re- of this paper. quire that the original objects (called feature vectors in the literature) are explicitly avail- able. Instead, it only requires that the (dis)similarities 2 PCO and KPCA are given. However, the original objects themselves are given in PPCA and FA. As a result, it seems dif- Suppose we are given a set of dissimilarities, 2 ficult to follow the approach taken for PCA, in which δij ; i, j = 1,...,n , between n objects. Let ∆ = [δ ] { } ij a connection to FA is exploited, in developing a latent be the n n dissimilarity matrix. We assume that ∆ × variable model for PCO. is Euclidean. This implies that there exists a set of n points in a Euclidean space, denoted by f i : i = Recall that since in PCO or KPCA the dissimilarities 1,...,n , such that { (or similarities) are Euclidean (or positive definite), } 2 ′ ′ ′ ′ there exists a set of feature vectors such that the Eu- δij = (fi fj) (fi fj ) = fi fi + fjfj 2fi fj . (1) clidean distances (or the inner products) between them − − − are exactly equal to the dissimilarities (or similarities). We thus have

This motivates us to treat the feature vectors as vir- 1 ′ H∆H = HFF H, tual observations. We will show how this treatment −2 allows us to specify normal latent variable models for ′ 1 ′ PCO as well as KPCA. We refer to these interpreta- where F = [f1,..., fn] and H = In 1n1 (a cen- − n n tions as probabilistic PCO (PPCO) and probabilistic tering matrix). Here and later, In is the n n identity × KPCA (PKPCA), respectively. matrix and 1n is the n 1 vector of 1’s. Thus, the assumption that ∆ is Euclidean× is equivalent to the In PPCO the principal coordinates (the configurations positive semidefiniteness of 1 H∆H. in a low-dimensional Euclidean space) are treated as − 2 the model parameters. As a result, we can use max- We now establish a connection of ∆ to the theory imum likelihood (ML) to estimate the principal coor- of reproducing kernels. Starting with a set of n p- dinates. We shall see that the estimated results agree dimensional input vectors, xi : i = 1,...,n { } ⊂ X ⊂ with those obtained via the spectral decomposition Rp, we define a positive definite function K : X ×X → method. Moreover, the latent variable idea allows us R, as a kernel function. There are three common ker- to use the expectation maximization (EM) algorithm nel functions that are widely used in practice: for PPCO. Importantly, without the explicit usage of the virtual observations themselves, we can still imple- (a) Linear kernel: K x , x ) = x′ x ; i j i j ment ML and EM procedures using only the available (b) Gaussian kernel: K x , x ) = exp( x x 2/θ (dis)similarities. i j i j  with θ > 0; −k − k In PKPCA the principal components (the orthonor- ′ m mal bases spanning the low-dimensional subspace) are (c) Polynomial kernel: K(xi, xj ) = (xixj + 1) . treated as the model parameters, which are also es- timated by ML. Our model differs from the PPCA From the kernel function and the data, we obtain an model of Tipping and Bishop (1999) in that they use n n kernel matrix K = [kij ] where kij = K(xi, xj ). × non-orthonormal principal components (factor load- Since the kernel matrix K is positive semidefinite ings) instead of orthonormal principal components. (p.s.d.), it can be regarded as an inner product ma- Although this difference seems to be minor, the dif- trix and induces an Euclidean matrix, which is then ference has important consequences in that the so- defined as the aforementioned ∆. It is readily seen lution of our model agrees with that of conventional that 2 PCA, but the solution of PPCA of Tipping and Bishop δ = kii + kjj 2kij . (2) ij − (1999) does not. Comparing (2) with (1), we equate kij with the in- The remainder of the paper is organized as follows. ′ ner product between fi and fj , i.e., kij = fi fj and Section 2 describes the original formulations of PCO ′ K = FF . The vector fi is referred to as the feature and KPCA. In Section 3 we propose a normal latent vector corresponding to xi. Thus, the kernel technol- variable model for PCO. A direct ML method and an ogy provides us with an approach to the construction

656 Zhang, Jordan of inner product matrices and dissimilarity (distance) KPCA employs the R technique. Since the Q and R matrices. techniques are dual to each another, there also exists a duality between PCO and KPCA. The difference is PCO (or classical multidimensional scaling) was origi- that PCO directly computes the low-dimensional con- nally used to construct the coordinates for the points, figurations, while KPCA computes the bases that span y : i = 1,...,n , in a Euclidean space, such that { i } the low-dimensional subspace and the low-dimensional ′ 2 2 ′ configurations are the projections of the feature vec- (yi yj ) (yi yj) = dij δij = (fi fj ) (fi fj). (3) − − ≈ − − tors onto this subspace. The focus of this paper is dimensionality reduction: letting y Rq and f Rr, q should be less than i ∈ i ∈ 3 Probabilistic PCO r. Assuming that the centroid of yi is at the origin of Rq, from (3) we obtain 1 H∆H YY′ where ′ − 2 ≈ Before presenting our probabilistic approach to PCO, Y = [y1,..., yn] . we first introduce some notation. We use A+ for the We now consider using a kernel function in the PCO Moore-Penrose inverse of A. The Kronecker product setting. If we use a linear kernel, we obtain standard of A and B is denoted by A B. We employ the ⊗ PCO in which fi = xi. More generally, we obtain notation of Gupta and Nagar (2000) for matrix-variate a nonlinear dimensionality version of PCO which is distributions. Thus, for an s t random matrix Z, Z ∼ based on a nonlinear mapping from x to f . × i i Ns,t(M, A B) means that Z follows a matrix-variate normal distribution⊗ with mean matrix M (s t) and Given either ∆ or K, we henceforth denote Q = × 1 H∆H′ or Q = HKH′. PCO attempts to find covariance matrix A B, where A (s s) and B (t t) − 2 are p.s.d. ⊗ × × the configurations of the yi by applying eigenvalue de- composition to Q. Let Ψ be an n q matrix whose q × columns are the top q eigenvectors of Q and Γq be a 3.1 Normal Latent Variable Model q q diagonal matrix whose diagonal elements are the top× q eigenvalues of Q. The solution of PCO is then We attempt to develop PPCO through a normal latent 1 variable model. However, it is not immediately clear given by Y = ΨqΓq 2 S where S is an arbitrary or- thonormal matrix. It is worth noting that the solution how to derive PPCO from PPCA or FA because for ′ PCO, apart from the Q matrix Q, the f and their di- satisfies Y 1n = 0. In general, we set S = Iq so that i ′ mension r are not explicitly available. In this case, we Y Y = Γq is diagonal. regard the fi as virtual observations. Recall that PCO x In order to explore the nonlinear structure of i in directly calculates the coordinates of the yi. Thus we a low-dimensional representation, KPCA (Sch¨olkopf treat the yi as parameters of the model that need to et al., 1998) uses the sample covariance matrix of fi, be estimated. We thus reformulate PCO as a latent which is given by variable model in matrix form: n 1 ′ ′ R = (f ¯f)(f ¯f) F = YW + 1nu + Υ, (4) n X i − i − i=1 1 ′ 1 ′ where u is an r 1 mean vector, W is a q r latent = F HF = F HHF × × n n matrix, and Υ is an n r error matrix. Furthermore, we assume × 1 n with ¯f = fi. The goal of KPCA is to n Pi=1 find the first q principal components. Typically, F W ∼ N (0, (I I )/r) , Υ ∼ N (0, (λI I )/r) , ′ q,r q r n,r n r is not explicitly available. Since F HHF has the ⊗ ⊗ (5) ′ same nonzero eigenvalues as HFF H = HKH = where λ > 0. Thus we get a PPCO model where each Q, KPCA works with Q instead. That is, the row of Y is just the target coordinate associated with first q principal components constitute the matrix F. Now the difficulty is that both r and F are usually −1/2 ′ SΓq Ψq HF. The q-dimensional configuration of F unknown. Fortunately, we will see a linear algebraic −1/2 1/2 is then HQHΨqΓq S = ΨqΓq S, which is com- manipulation yields an estimation procedure for the puted without the explicit use of F. This shows the unknown parameters, Y and λ, that does not explicitly duality between KPCA and PCO. depend on r and F. 1 Gower (1966) referred to 2 H∆H or HKH as a Q As described in Section 2, we assume that the columns − 1 ′ ′ matrix and the covariance matrix n F HF as an R of Y are centered to have mean 0, i.e., Y 1n = 0. Pre- matrix. The corresponding dimensionality reduction multiplying (4) by the centering matrix H, we obtain methods are called Q and R techniques, respectively. It is clear that PCO employs the Q technique while HF = HYW + HΥ.

657 Latent Variable Models for Dimensionality Reduction

′ 1 It is clearly seen that Y H1n = 0. Thus, we now treat Theorem 1 Let Q = HKH or Q = 2 H∆H be of HY as the low-dimensional representation of F. For rank d n 1, and let f be a real-valued− loss function notational simplicity, we still use Y and Υ to denote defined≤ by− (9) where Y Rn×q subject to q d and ′ ∈ ≤ HY and HΥ. Thus, we have Y 1n = 0. Then, the minimum of f w.r.t. Y and λ is obtained when HF = YW + Υ (6) − 1 n 1 with Y′1 = 0 and Υ ∼ N (0, λ(H I )/r). Since Y = Ψ (Γ λˆI )1/2S and λˆ = γ , n n,r ⊗ r q q − q n q 1 X j H is singular, the distribution of Υ degenerates to a b − − j=q+1 singular normal distribution (Mardia et al., 1979). where γ γ γ − are the eigenvalues of 1 ≥ · · · ≥ q ≥ ··· n 1 3.2 Maximum Likelihood Estimates Q, S is an arbitrary q q orthonormal matrix, Γq is a q q diagonal matrix× containing the first q principal × It follows readily from (6) that (largest) eigenvalues γi, and Ψq is an n q orthogonal matrix in which the q column vectors are× the principal ∼ HF W Nn,r (YW, λ(H Ir)/r) eigenvectors corresponding to Γ . | ⊗ q and so, by integrating out W, we have ′ Since Q1n = 0, we have Y 1n = 0. Furthermore, we ′ ∼ ′ have that Y Y = Γ λˆI bwhen S = I . Thus, the ML HF Nn,r 0, (YY +λH) Ir/r . (7) q − q q ⊗  estimate ofbYbagrees with that obtained via the eigen- Using Bayes’ rule, we can compute the conditional dis- value decomposition method in Section 2. Moreover, tribution of W given F as if λˆ 0, then the methods are entirely equivalent. It → is worth pointing out that if γq > γq+1, (Y, λˆ) is a ∼ −1 ′ −1 W F Nq,r Σ Y HF, λ(Σ Ir)/r , (8) strict local minimum. | ⊗  b ′ where Σ = λIq +Y Y. Now HF also follows a singular normal distribution and its p.d.f. is given by 3.3 EM Algorithm

(1−n)r nr Considering W as the , W, F as the (2π) 2 r 2 r ′ ′ exp tr (YY + λH)+HFF H Y { } n−1 r/2 h i complete data, and and λ as the model parameters, η − 2  Qi=1 i we devise an EM algorithm for our PPCO model. where ηi, i = 1,...,n 1, are the nonzero eigenvalues Given the tth estimates Y(t) and λ(t) of Y and λ, ′ − of YY +λH. Note that we use the fact that the rank the EM algorithm updates Y and λ in the following ′ ′ of YY +λH (= H(YY + λIn)H) is equal to n 1. procedure: ′ − This is because YY +λIn is nonsingular and the rank − ′ −1 of H is n 1. Y(t+1) = QY(t) λ(t)I + Σ 1(t)Y (t)QY(t) , −  q  Treating Y and λ as unknown parameters, we now (10) consider their maximum likelihood (ML) estimates. 1 − ′ The corresponding log-likelihood function is given by λ(t+1) = tr(Q) tr Y(t+1)Σ 1(t)Y (t)Q , (n 1) −   − n−1 (11) r r ′ + ′ ϕ = log ηi tr (YY + λH) HFF H −2 X − 2  ′ i=1 where Σ(t) = λ(t)Iq + Y (t)Y(t). The derivation of (n 1)r nr the EM algorithm is omitted. We see that the iterative − log(2π) + log r. − 2 2 equations (10) and (11) also work well only with Q. ′ Note that 1nY(t) = 0 is always satisfied in the EM It is easily seen that the maximization of ϕ with re- algorithm. It can be proven that if λ(0) > 0, the λ Y spect to (w.r.t.) and λ is equivalent to the mini- calculated via (11) is always positive. Moreover, the mization of EM iterates of Y and λ converge to the direct ML n−1 estimates. ′ f(Y,λ) = log η + tr (YY + λH)+Q (9) X i  Compared with the direct ML estimate, the EM ap- i=1 proach is more efficient when n is large, because the w.r.t. Y and λ. Note that f is independent of both former involves the spectral decomposition of an n n F and r. Thus, given either K or ∆, we can estimate matrix, whereas the latter involves the inversion of q×q Y and λ without the explicit usage of F and r. In matrices. Thus, the EM algorithm provides an efficient× particular, we have numerical method for PCO.

658 Zhang, Jordan

−1/2 ′ 4 Probabilistic KPCA ML estimate of W is given by W = SΓq Ψq HF where Γq is the diagonal matrixc of the q largest eigen- In this section we present our probabilistic KPCA values of Q, Ψq is the corresponding eigenvector ma- (PKPCA) model and explore its relationship with trix of Q, and S is a q q orthonormal matrix. We × PPCO. Our point of departure is Tipping and Bishop omit the derivation. As we can see, W does not de- (1999), who proposed PPCA based on a latent variable pend on λ and r but on the featurec matrix F. For- −1 −1 model. We extend this approach to KPCA. In partic- tunately, since [y f, W] ∼ Nq(Σ W(f u),λΣ ) ular, PKPCA expresses the feature vector f Rr as where Σ = WW|′+λI = (1+λ)I , we− are able to ∈ q q a linear combination of q principal components (say, compute the q-dimensional configuration of f via the wj ) plus noise (ǫ): kernel trick. In particular, we have

q ′ 1 −1/2 ′ f = w y + u + ǫ = W y + u + ǫ, E(y f, W) = Γq Ψq HF(f uˆ) X j j | 1+λˆ − j=1 1 − ′ 1 ′ = Γ 1/2Ψ HF f F 1 ǫ N(0, λIq), w N(0, Iq), q q  n ∼ ∼ 1+λˆ − n where y = (y ,...,y )′ Rq with q < min p, r is 1 −1/2 ′ 1 1 q = Γq Ψq k K1n, the latent vector and W∈′ = [w ,..., w ] (r {q) is} an 1+λˆ − n 1 q × r q basis matrix that relates f and y. Given an input ′ × ′ where k = (K(x, x1),...,K(x, xn)) . In matrix form, matrix X = [x1,..., xn] and its corresponding feature matrix F, we express PKPCA in matrix form as we have

′ 1 − 1 1 1 F = YW + 1nu + Υ, (12) E(Y F, W) = QΨqΓq 2 = ΨqΓq 2 . | 1+λˆ 1+λˆ Y ∼ N (0, I I ) , Υ ∼ N (0, λ(I I )) . n,q n ⊗ q n,p n ⊗ p This shows that when λˆ 0, the solution of PKPCA In this model, Y is treated as a latent matrix and is the same as that of PPCO→ as well as the conven- W as a matrix of principal components such that 1 ′ tional KPCA and PCO. Since ˆ is a constant, we WW = Iq. The q-dimensional configuration of f 1+λ (or x) is obtained by using the expectation of y con- use (1+λˆ)E(y f, W) as the low-dimensional configu- | ditioned on f and W, i.e., E(y f, W). However, in ration of f. As a result, the solution of PPKCA fully PPCO, W is treated as a latent| matrix and Y is just agrees with that of the conventional KPCA as well as the q-dimensional representation of X. This provides a the conventional PCO. Moreover, we can ignore the perspective on the duality between PPCA and PPCO. ML estimate of λ for our purpose of dimensionality re- Note that in the PPCA of Tipping and Bishop (1999), duction. Unfortunately, it is not feasible to obtain an ′ the constraint WW = Iq is not imposed. We will see EM algorithm for PKPCA when the feature vectors shortly that this constraint is indeed necessary. are not explicitly available. In the linear kernel case, i.e., f = x, the ML estimate ′ of W is Uq where Uq is the matrix formed by the top 5 Experiments q eigenvectors of R. We thus obtain the configura- tion of x in the q-dimensional space as E(y x, W) = In this paper our principal focus has been to provide a ′ −1 1 ′ | (WW + λIq) W(x u) = Uq(x u). When probabilistic perspective from which to view PCO and − 1+λ − λ 0, the solution of PPCA is the same to that KPCA. Although the direct ML estimation approaches → of the conventional PCA. In the PPCA of Tipping to these models give the same solutions as their con- and Bishop (1999), however, the ML estimate of W ventional counterparts, our analysis has also provided 1 ′ is (Γq λIq) 2 U where Γq is the diagonal matrix of an EM algorithm for PPCO, and it is of interest to − q the q largest eigenvalues of nR. Hence, E(y x, W) = compare the performance of the EM algorithm with ′ −1 −1 1 |′ (WW +λIq) W(x u) = Γ (Γq λIq) 2 Uq(x u), the direct ML method. which does not converge− to the conventional− PCA− as As we see in Section 3.3, the EM algorithm is more effi- λ 0. → cient than the ML method when n is large. Moreover, 1 n The ML estimate of u is uˆ = n i=1 fi. After sub- the solution of the EM algorithm converges to that of stituting uˆ for u into the likelihood,P the resulting log- the ML estimate. As we know, EM algorithms rely on likelihood is referred to as the log concentrated likeli- initial values. We use conventional PCA to initialize hood (Magnus and Neudecker, 1999), which is used for the EM algorithm for our PPCO. All algorithms have the ML estimation of W and λ. Typically, the feature been implemented in Matlab on a Pentium 4 computer vectors f are assumed unavailable. In this case, the with a 2.00GHz CPU and 1.96GB of RAM.

659 Latent Variable Models for Dimensionality Reduction

Table 1: Summary of the benchmark datasets: n— Table 2: Results on the iris and oil flow datasets: γ1— # of samples; p—# of variates; c—# of classes; β— largest eigenvalue of Q; λˆ—ML estimate of λ; γ1(0)— parameter in the Gaussian kernel K. initial value of γ1 in the EM iteration; λ(0)—initial Iris Oil Letter Segmen NIST value of λ in the EM iteration. ˆ n 150 1000 1978 2310 3823 γ1 γ1(0) λ λ(0) p 4 12 16 19 256 Iris 0.2799 3.7321 0.0029 0.2178 c 3 2 10 7 10 Oil flow 0.0437 0.1004 0.0009 0.0145 β 2 0.2 100 10000 1000

Table 3: CPU time (s) of running PPCO with the ML estimate and EM iteration. 5.1 Convergence Analysis Iris Oil Letter Segmen NIST ML 0.0469 7.6250 59.8438 70.9375 419.0469 EM 0.2344 4.3906 17.7656 23.1875 65.6563 Our first experimental analysis is based on two data sets: the iris data set and the multi-phase oil flow data set studied by Bishop and James (1993), which consists of 12 features and three classes, stratified, annular and 5.3 Performance Analysis homogeneous, corresponding to the phases of flow in an oil pipeline. In this subsection we further investigate the perfor- We adopt the Gaussian kernel K(xi, xj ) = mance of the EM algorithm for PPCO with the iris 1 2 exp( xi xj /β) with β = 2.0 for the iris data and oil flow data as well as three publicly available n −k − k and β = 0.2 for the oil flow data. For PPCO, we imple- datasets from the UCI machine learning repository ment the direct ML estimation and the EM algorithm. (the NIST optical handwritten digit data, the letter The EM algorithm is initialized using conventional lin- data and the image segmentation data). The NIST ear PCA and the maximum number of iterations is dataset contains the handwritten digits 0 9, where 100. Figure 1 depicts the two-dimensional (q = 2) each instance consists of 16 16 pixels and− the digits principal coordinates of the two data sets using these are treated as classes. The× letter dataset consists of two algorithms. We can see that the two algorithms images of the letters “A” to “Z.” In our experiments give essentially the same results up to a rotation trans- we selected the first 10 letters with 195, 199, 182, 207, formation. 203, 210, 226, 196, 188 and 172 instances, respectively. The image segmentation data consists of seven types of images: “brickface,” “sky,” “foliage,” “cement,” “win- 5.2 Estimating the Largest Eigenvalue of Q dow,” “path”, and “grass.” Table 1 gives a summary of these datasets. ′ Let q = 1. Since S = 1, we always have yˆ yˆ = γ1 λˆ In Table 3 we report the CPU time of implementing where yˆ (n 1) and λˆ±are the ML estimates of PPCO,− the direct ML estimate and EM iteration of PCO. The × and γ1 is the largest eigenvalue of Q = HKH (see The- results are based on q = 2 and T = 100 (the maximum orem 1). Note that we can compute the largest eigen- number of iterations of the EM algorithm). The com- ′ 3 value of Q via γ1 = yˆ yˆ +λˆ and its corresponding prin- putational complexity of the ML estimate is O(n ), cipal eigenvector as ψ = 1 yˆ. Since the solution while that of the EM iteration is O(T nq2). We thus 1 −ˆ √γ1 λ see that the EM iteration is more efficient than the of the EM algorithm given in (10) and (11) converges ML estimate for large values of n. to that of the ML estimate, it can be used to iteratively ′ estimate γ1 and ψ1 by γ1(t) = y (t)y(t) + λ(t) and Given a kernel K, PKPCA with the ML estimate has ψ 1 y the same computational complexity as PPCO with the 1(t) = − (t). Interestingly, this EM algo- √γ1(t) λ(t) ML estimate. In the general case it is not possible to rithm bears resemblance to the power method (Golub devise an EM algorithm for PKPCA because the fea- and Loan, 1996) (see also p. 462 in Anderson (1984)). ture vectors f are not explicitly available. However, Table 2 lists the value of γ1 and the ML estimates of PKPCA directly estimates the principal components λ on the iris and oil flow datasets. We also report the other than the principal coordinates. When n is large, EM estimates of λ and y, and hence that of γ1. From in order to reduce computation, we can implement Figure 2, we see that γ1(t) converges to the true value, PKPCA on a small-size dataset. This motivates us to while λ(t) converges to the ML estimate. Moreover, devise an initialization method for the EM algorithm the convergence of λ takes only several iterations. for PPCO in the case that both n and p are large.

660 Zhang, Jordan

1.5 1.5

1 1

0.5 0.5

0 0 −0.5

−0.5 −1

−1 −1.5

−1.5 −2 −4 −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3

(a) (a)

0.06 0.02

0.04 0.015

0.02 0.01

0 0.005

−0.02 0

−0.04 −0.005

−0.06 −0.01 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 −0.03 −0.025 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01

(b) (b)

0.06 0.03

0.04 0.025

0.02 0.02

0 0.015

−0.02 0.01

−0.04 0.005

−0.06 0

−0.08 −0.005 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 −0.025 −0.02 −0.015 −0.01 −0.005 0 0.005 0.01

(c) (c) iris data multi-phase oil flow data

Figure 1: Two-dimensional configurations for: (a) conventional PCA; (b) PPCO with the direct ML estimate; (c) PPCO with the EM estimate.

′ 6 Conclusion constraint 1nY = 0 on the parameter matrix Y in ′ PPCO, and the constraint WW = Iq on the parame- In this paper we have studied normal latent variable ter matrix W in PKPCA. Under these constraints, we models for PCO and KPCA. This has yielded algo- have shown that there is still a closed-form solution for rithms that we refer to as PPCO and PKPCA. More- the ML estimate of the parameter matrix. over, we have further explored the duality between PCO and KPCA based on their probabilistic formula- Acknowledgement tions. Our work demonstrates that PCO and KPCA Zhihua Zhang has been supported in part by pro- can be derived within an ML estimation framework. gram for Changjiang Scholars and Innovative Re- These normal latent variable models are closely related search Team in University (IRT0652, PCSIRT), China. to factor analysis. However, we impose some new con- Michael Jordan acknowledges support from NSF straints on the unknown parameter (factor loading) Grant 0509559 and grants from Google and Microsoft matrix in these models. In particular, we impose the Research.

661 Latent Variable Models for Dimensionality Reduction

0.25 0.015

0.2

0.01 0.15

0.1 0.005

0.05

0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

(a) (a)

4 0.11

3.5 0.1

3 0.09

2.5 0.08

2 0.07

1.5 0.06

1 0.05

0.5 0.04

0 0.03 0 20 40 60 80 100 0 10 20 30 40 50 60 70 80 90 100

(b) (b) iris data multi-phase oil flow data

Figure 2: The largest eigenvalue γ1 and the noise variance λ vs. the iteration count: (a) the EM estimate of λ; (b) the EM estimate of γ1.

References Magnus, J. R. and H. Neudecker (1999). Matrix Calcu- lus with Applications in Statistics and Econometric (Re- Anderson, T. W. (1984). An Introduction to Multivariate vised Edition ed.). New York: John Wiley & Sons. Statistical Analysis (Second ed.). New York: John Wiley & Sons. Mardia, K. V., J. T. Kent, and J. M. Bibby (1979). Mul- tivariate Analysis. New York: Academic Press. Bartholomew, D. J. and M. Knott (1999). Latent Vari- able Models and Factor Analysis (second ed.). London: Oh, M.-H. and A. E. Raftery (2001). Bayesian multidimen- Arnold. sional scaling and choice of dimension. Journal of the American Statistical Association 96 (455), 1031–1044. Bishop, C. M. and G. D. James (1993). Analysis of multi- phase flows using dual-energy gamma densitometry and Ramsay, J. O. (1982). Some statistical approaches to multi- neural networks. Nuclear Instruments and Methods in dimensional scaling data. Journal of the Royal Statistical Physics Research A327, 580–593. Society Series A 145, 285–312. Sch¨olkopf, B., A. Smola, and K.-R. M¨uller (1998). Nonlin- Borg, I. and P. Groenen (1997). Modern Multidimensional ear component analysis as a kernel eigenvalue problem. Scaling. New York: Springer-Verlag. Neural Computation 10, 1299–1319. Golub, G. H. and C. F. V. Loan (1996). Matrix Com- Tipping, M. E. and C. M. Bishop (1999). Probabilistic putations (Third ed.). Baltimore: The Johns Hopkins principal component analysis. Journal of the Royal Sta- University Press. tistical Society, Series B 61 (3), 611–622. Gower, J. C. (1966). Some distance properties of latent root Williams, C. K. I. (2001). On a connection between ker- and vector methods used in multivariate data analysis. nel PCA and metric multidimensional scaling. In Ad- Biometrika 53, 315–328. vances in Neural Information Processing Systems 13, Groenen, P. J. F., R. Mathar, and W. J. Heiser (1995). Cambridge, MA. MIT Press. The majorization approach to multidimensional scaling for Minkowski distance. Journal of Classification 12, 3–19. Gupta, A. and D. Nagar (2000). Matrix Variate Distribu- tions. Chapman & Hall/CRC. Jolliffe, I. (2002). Principal Component Analysis (Second Edition ed.). New York: Springer.

662