Regression on Manifolds Using Kernel Dimension Reduction

Jens Nilsson @.. Centre for Mathematical Sciences, Lund University, Box 118, SE-221 00 Lund, Sweden Fei Sha @.. Computer Science Division, University of California, Berkeley, CA 94720 USA Michael I. Jordan @.. Computer Science Division and Department of Statistics, University of California, Berkeley, CA 94720 USA

Abstract list of “manifold learning” algorithms—including LLE, Isomap, and Laplacian eigenmaps—provide sophisticated We study the problem of discovering a manifold examples of unsupervised dimension reduction (Tenen- that best preserves information relevant to a non- baum et al., 2000; Roweis & Saul, 2000; Belkin & Niyogi, linear regression. Solving this problem involves 2003; Donoho & Grimes, 2003). The supervised learn- extending and uniting two threads of research. ing setting is somewhat more involved; one must make a On the one hand, the literature on sufficient di- choice of the family of manifolds to represent the covari- mension reduction has focused on methods for ate vectors, and one must also choose a family of functions finding the best linear subspace for nonlinear re- to represent the regression surface (or classification bound- gression; we extend this to manifolds. On the ary). Due in part to this additional complexity, most of the other hand, the literature on manifold learning focus in the supervised setting has been on reduction to lin- has focused on unsupervised dimensionality re- ear manifolds. This is true of classical linear discriminant duction; we extend this to the supervised setting. analysis, and also of the large family of methods known Our approach to solving the problem involves as sufficient dimension reduction (SDR) (Li, 1991; Cook, combining the machinery of kernel dimension re- 1998; Fukumizu et al., 2006). SDR aims to find a linear duction with Laplacian eigenmaps. Specifically, subspace S such that the response Y is conditionally inde- we optimize cross-covariance operators in kernel pendent of the covariate vector X, given the projection of feature spaces that are induced by the normalized X on S. This formulation in terms of conditional indepen- graph Laplacian. The result is a highly flexible dence means that essentially no assumptions are made on method in which no strong assumptions are made the form of the regression from X to Y, but strong assump- on the regression function or on the distribution tions are made on the manifold representation of X (it is a of the covariates. We illustrate our methodology linear manifold). Finally, note that the large literature on on the analysis of global temperature data and feature selection for can also be con- image manifolds. ceived of as a projection onto a family of linear manifolds. It is obviously of interest to consider methods that com- 1. Introduction bine manifold learning and sufficient dimension reduction. From the point of view of manifold learning, we can read- Dimension reduction is an important theme in machine ily imagine situations in which some form of side infor- learning. Dimension reduction problems can be ap- mation is available to help guide the choice of manifold. proached from the point of view of either unsupervised Such side information might come from a human user in learning or supervised learning. A classical example of the an exploratory data analysis setting. We can also envisage former is principal component analysis (PCA), a method regression and classification problems in which nonlinear that projects data onto a linear manifold. More recent re- representations of the covariate vectors are natural on sub- search has focused on nonlinear manifolds, and the long ject matter grounds. For example, we will consider a prob-

th lem involving atmosphere temperature close to the Earth’s Appearing in Proceedings of the 24 International Conference on surface in which a manifold representation of the covariate , Corvallis, OR, 2007. Copyright 2007 by the author(s)/owner(s). vectors is quite natural. We will also consider an exam- Regression on Manifolds Using Kernel Dimension Reduction ple involving image manifolds. Other examples involving formally as a conditional independence assertion: dynamical systems are readily envisaged; for example, a torus is often a natural representation of robot kinematics Y y X | BTX (1) and robot dynamics can be viewed as a regression on this manifold. where B denotes the orthogonal projection of X onto S. The subspace S is referred to as a dimension reduction sub- It is obviously an ambitious undertaking to attempt to iden- space. Dimension reduction subspaces are not unique. We tify a nonlinear manifold from a large family of nonlinear can derive a unique “minimal” subspace, defined as the in- manifolds while making few assumptions regarding the re- tersections of all reduction subspaces S. This minimal sub- gression surface. Nonetheless, it is an important undertak- space does not necessarily satisfy the conditional indepen- ing, because the limitation to linear manifolds in SDR can dence assertion; when it does, the subspace is referred to as be quite restrictive in practice, and the lack of a role for su- the central subspace. pervised data in manifold learning is limiting. As in much of the unsupervised manifold learning literature, we aim to Many approaches have been developed to identify central make progress on this problem by focusing on visualiza- subspaces (Li, 1991; Li, 1992; Cook & Li, 1991; Cook & tion. Without attempting to define the problem formally, Yin, 2001; Chiaromonte & Cook, 2002; Li et al., 2005). we attempt to study situations in which supervised mani- Many of these approaches are based on inverse regression; fold learning is natural and investigate the ability of algo- that is, the problem of estimating E X|Y . The intuition J K rithms to find useful visualizations. is that, if the forward regression model P(Y|X) is concen- trated in a subspace of X then E X|Y should lie in the The methodology that we develop combines techniques same subspace. Moreover, the responsesJ K Y are typically from SDR and unsupervised manifold learning. Specif- of much lower dimension than the covariates X, and thus ically, we make use of ideas from kernel dimension re- the subspace may be more readily identified via inverse re- duction (KDR), a recently-developed approach to SDR gression. A difficulty with this approach, however, is that that uses cross-covariance operators on reproducing kernel rather strong assumptions generally have to be imposed on Hilbert spaces to measure quantities related to conditional the distribution of X (e.g., that the distribution be elliptical), independence. We will show that this approach combines and the methods can fail when these assumptions are not naturally with representations of manifolds based on Lapla- met. This issue is of particular importance in our setting, cian eigenmaps. in which the focus is on capturing the structure underlying The paper is organized as follows. In Section 2, we pro- the distribution of X and in which such strong assumptions vide basic background on SDR, KDR and unsupervised would be a significant drawback. We thus turn to a descrip- manifold learning. Section 3 presents our new manifold tion of KDR, an approach to SDR that does not make such kernel dimension reduction (mKDR) method. In Section 4, strong assumptions. we present experimental results evaluating mKDR on both synthetic and real data sets. In Section 5, we comment 2.2. Kernel Dimension Reduction briefly on related work. Finally, we conclude and discuss The framework of kernel dimension reduction was first de- future directions in Section 6. scribed in Fukumizu et al. (2004) and later refined in Fuku- mizu et al. (2006). The key idea of KDR is to map ran- 2. Background dom variables X and Y to reproducing kernel Hilbert spaces (RKHS) and to characterize conditional independence us- We begin by outlining the SDR problem. We then describe ing cross-covariance operators. KDR, a specific methodology for SDR in which the linear subspace is characterized by cross-covariance operators on Let HX be an RKHS of functions on X induced by the reproducing kernel Hilbert spaces. Finally, we also provide kernel function KX(·, X) for X ∈ X. We define the space a brief overview of unsupervised manifold learning. HY and the kernel function KY similarly. Define the cross- covariance between a pair of functions f ∈ HX and g ∈ HY 2.1. Sufficient Dimension Reduction as follows:    Let (X, BX) and (Y, BY) be measurable spaces of covari- C f g = EXY r f (X) − EX f (X) g(Y) − EY g(Y) z. (2) ates X and response variables Y respectively. SDR aims at J K J K S ⊂ X S finding a linear subspace such that contains as It turns out that there exists a bilinear operator ΣYX from much predictive information regarding the response Y as HX to HY such that C f g = hg, ΣYX f iHY for all functions the original covariate space. This desideratum is captured f and g. Similarly we can define covariance operators ΣXX and ΣYY . Finally, we can use these operators to define a class of conditional cross-covariance operators in the fol- Regression on Manifolds Using Kernel Dimension Reduction lowing way: where I is the identity matrix of appropriate dimensional- ity, and  a regularization coefficient. The matrix Kc de- −1 ΣYY|X = ΣYY − ΣYXΣXXΣXY . (3) notes the centered kernel matrices ! ! 1 1 This definition assumes that ΣXX is invertible; more general Kc = I − eeT K I − eeT (5) cases are discussed in Fukumizu et al. (2006). N N

Note that the conditional covariance operator ΣYY|X of where e is a vector of all ones. eq. (3) is “less” than the covariance operator ΣYY , as the −1 difference ΣYXΣXXΣXY is positive semidefinite. This agrees 2.3. Manifold Learning with the intuition that conditioning reduces uncertainty. We Many real-world data sets are generated with very few de- gain further insight by noting the similarity between eq. (3) grees of freedom; an example is pictures of the same ob- and the covariance matrix of the conditional distribution ject under different imaging conditions, such as rotation P(Y|X) when X and Y are jointly Gaussian. angle or translation. The extrinsic dimensionality of these Finally, we are ready to link these cross-covariance oper- images as data points in the Euclidean space spanned by ators to the central subspace. Consider any subspace S in the pixel intensities far exceeds the intrinsic dimensionality X. Let us map this subspace to an RKHS HS with a ker- determined by those underlying factors. When these fac- nel function KS. Furthermore, let us define the conditional tors vary smoothly, the data points can be seen as lying on covariance operator ΣYY|S as if we would regress Y on S. a low-dimensional submanifold embedded in the high di- What is the relation between ΣYY|S and ΣYY|X? mensional ambient space. Discovering such submanifolds and finding low-dimensional representations of them has Intuitively, Σ would indicate greater residual error of YY|S been a focus of much recent work on unsupervised learn- predicting Y unless S contains the central subspace. This ing (Roweis & Saul, 2000; Tenenbaum et al., 2000; Belkin intuition was formalized in Fukumizu et al. (2006): & Niyogi, 2003; Donoho & Grimes, 2003; Sha & Saul, Theorem 1 Suppose Z = BBTX ∈ S where B ∈ RD×d 2005). A key theme of these learning algorithms is to pre- is a projection matrix such that BT B is an identity matrix. serve (local) topological and geometrical properties (for ex- ample, geodesics, proximity, symmetry, angle) while pro- Further, assume Gaussian RBF kernels for KX, KY and jecting data points to low dimensional representations. KS. Then In this section, we briefly review the method of Lapla- • ΣYY|X ≺ ΣYY|Z, where ≺ stands for “less than or equal cian eigenmaps, which is the manifold learning method on to” in some operator partial ordering. which we base our extension to supervised manifold learn- ing. This method is based on the (normalized) graph Lapla- T • ΣYY|X = ΣYY|Z if and only if Y y X|B X, that is, S is cian which can be seen as a discrete approximation to the the central subspace. Laplace-Beltrami operator on continuous manifolds. Let {x ∈ RD}N denote N data points sampled from a sub- Note that this theorem does not impose assumptions on ei- i i=1 manifold. We start by constructing a graph which has N ther the marginal distributions of X and Y or the conditional vertices, one for each x . Vertex i and vertex j are linked distribution P(Y|X). Note also that although we have stated i by an edge if x and x are nearest neighbors. Let W be the theorem using Gaussian kernels, other kernels are pos- i j the matrix whose element W = exp(−||x − x ||2/σ2) if sible and general conditions on these are discussed in Fuku- i j i j there is an edge between vertex i and j, and zero otherwise. mizu et al. (2006). Furthermore, let D be the diagonal matrix whose diago- P Theorem 1 leads to an algorithm for estimating the cen- nal elements are the row sums of W; i.e., Dii = j Wi j. tral subspace, characterized by B, for empirical samples. The aim of the Laplacian eigenmap procedure is to find an m Using the trace to order operators, B is the matrix that min- m < D dimensional embedding {ui ∈ R } such that ui and N imizes Tr Σˆ YY|Z , where Σˆ YY|Z is the empirical version of u j are close if xi and x j are close. Formally, let vm ∈ R be J K N T the conditional covariance operator (3). Let {xi, yi}i=1 de- column vectors such that [u1, u2,... uN ] = [v1, v2, ...vm] . note N samples from the joint distribution P(X, Y), and let Then vm is chosen such that K ∈ RN×N and K ∈ RN×N denote the Gram matrices com- Y Z X W v − v 2 puted over {y } and {z = BT x }. Fukumizu et al. (2006) i j( mi m j) i i i √ p (6) show that this minimization problem can be formulated in i j Dii D j j terms of KY and KZ, so that B is the solution to: is minimized, subject to the constraint that vm is orthogo- 0 c c −1 0 min Tr KY (KZ + NI) nal to vm if m , m . By the Rayleigh quotient theorem, T (4) such that B JB = I K the vector vm must be the m-th bottom eigenvector of the Regression on Manifolds Using Kernel Dimension Reduction

T maps a point B xi in the central subspace to the RKHS. We construct the mapping explicitly:

T K(·, B xi) ≈ Φui (7)

by approximating it with a linear expansion Φ ∈ RM×M in the eigenfunctions. Note that the linear map Φ is indepen- dent of xi, enforcing a globally smooth transformation on U Figure 1. 3-D torus and bottom eigenvectors of graph Laplacian the embedding . for data sampled from the torus. See text for details. We identify the linear map Φ in the framework of KDR. Specifically, we aim to minimize the contrast function of eq. (4) for statistical conditional independence between the matrix L = D−1/2(D − W)D−1/2 excluding the very bot- response yi and xi. The Gram matrix is then approximated tom constant eigenvector. The matrix L is referred to as the and parameterized by the linear map Φ: normalized (and symmetrized) graph Laplacian or graph Laplacian for brevity in the rest of the paper. T T T T h K(·, B xi), K(·, B x j) i ≈ ui Φ Φu j (8) The eigenvectors of the graph Laplacian can be seen as discretized versions of harmonic functions (Lafon, 2004; Finally, we formulate the following optimization problem: Coifman et al., 2005); i.e., they are eigenfunctions of the min Tr Kc (UTΩU + NI)−1 continuous Laplace-Beltrami operator sampled at the lo- Y such that Ω J 0 K (9) cations of the x . As an example, Fig. 1 shows some of i Tr(Ω) = 1 the first non-constant eigenvectors (mapped onto 2-D) for data points sampled from a 3-D torus. The image inten- where Ω = ΦTΦ. Note that if Ω is allowed to grow arbi- sities correspond to the high and low values of the eigen- trarily large then the objective function attains an arbitrarily vectors. The variation in the intensities can be interpreted small value with infimum of zero. Therefore, we constrain as the high and low frequency components of the harmonic the matrix Ω to have unit trace: Tr(Ω) = 1. Furthermore, functions. Intuitively, these eigenvectors can be used to ap- the matrix Ω needs to be constrained in the cone of positive proximate smooth functions on the manifold (Bengio et al., semidefinitive matrices, i.e., Ω  0, so that a linear map Φ 2003; Ham et al., 2004). We will explore this intuition in can be computed as the square root of the Ω. the next section to parameterize the central subspace with these eigenfunctions. Note that recovering B from the linear map Φ is possible via inverting the map in eq. (7). However, we would like to 3. Manifold KDR point out that for the purpose of regression using reduced dimensionality, it is sufficient to build regression models Consider regression problems where the covariates have an for Y from the central subspace ΦU in the embedding space intrinsic manifold structure. Can we incorporate such geo- U generated by the graph Laplacian. metrical information into the process of identifying a pre- The optimization of eq. (9) is nonlinear and nonconvex. We dictive representation for the covariates? To this end, we have applied the projected gradient method to find local op- present an algorithm that we refer to as manifold kernel di- timal minimizers. This method worked well in all of our mension reduction (mKDR). The algorithm combines ideas experiments. Despite the nonconvexity of the problem, ini- from unsupervised manifold learning and KDR. tialization with the identity matrix gave fast convergence in At a high level, the algorithm has three ingredients: (1) the our experiments. Algorithm 1 gives the pseudocode for the computation of a low-dimensional embedding of covari- mKDR algorithm. In our experiments we choose a regu- T ates X; (2) the parametrization of the central subspace as larized linear response kernel KY = Y Y + NI, but other a linear transformation of the low-dimensional embedding; choices could also be considered. (3) the computation of the coefficients of the optimal linear The problem of choosing the dimensionality of the central map using the KDR framework. The linear map yields di- subspace is still open in the case of KDR and is open in our rections in the low-dimensional embedding that contribute case as well. A heuristic strategy is to choose M to be larger most significantly to the central subspace. Such directions than a conservative prior estimate of the the dimensionality can be used for data visualization. d of the central subspace and to rely on the fact that the M Let us choose M eigenvectors {vm}m=1, or equivalently, an solution to the optimization eq. (9) often leads to a low rank M-dimensional embedding U ∈ U ⊂ RM×N . As in the linear map Φ. The rank r = (Φ) then provides an framework of KDR, we consider a kernel function that empirical estimate of d. Let Φr stand for the matrix formed Regression on Manifolds Using Kernel Dimension Reduction

Algorithm 1 Manifold Kernel Dimension Reduction Input: 1 1 0 Covariates {x ∈ RD}N and responses {y ∈ RD }N 0.8 0.8 i i=1 i i=1 0.6 0.6 y y M ≤ D: number of eigenvectors 0.4 0.4 d ≤ M: the dimensionality of the central subspace 0.2 0.2 Output: linear map Φ 0 0 M −3 −2 −1 0 1 −3 −2 −1 0 1 Compute eigenvectors {vm} via graph Laplacian Φ −3 Φ −3 m=1 U x 10 U x 10 Initialize Ω ← I (a) (b) t ← 1 repeat Figure 2. Central subspaces of data on torus, sampled uniformly ← Set step size η 1/t and randomly. See text for details. Update Ω with gradient descent

c T −1 ∂ Tr KY (U ΩU + NI) cannot be visualized directly. Applying the algorithm of Ω ← Ω − η J K (10) ∂ Ω mKDR to these data sets yielded interesting and informa- tive low-dimensional representations, confirming the po- where the gradient is given by tential of the algorithm for exploratory data analysis and T −1 c T −1 T visualization of high-dimensional data. −U(U ΩU + NI) KY (U ΩU + NI) U Project Ω to the positive semidefinite cone 4.1. Regression on a Torus

X T We begin by analyzing data points lying on the surface Ω ← max(λm, 0)am am (11) m of a torus, illustrated in Fig. 1. A torus can be con- structed by rotating a 2-D cycle in R3 with respect to an where (λm, am) are Ω’s eigenvalues and eigenvectors. axis. Therefore, a data point on the surface has two de- Scale Ω to have unit trace: Ω ← Ω/ Tr(Ω) grees of freedom: the rotated angle θr with respect to the Increase t: t ← t + 1 axis and the polar angle θp on the cycle. We synthesized until |V(t) − V(t − 1)| / |V(t)| < tol, where V(t) is the our data set by sampling these two angles from the Carte- value of the objective function of eq. (9) at step t. sian product [0 2π] × [0 2π]. The 3-D coordinates Compute Φ as the square root of Ω of our torus are thus given by x1 = (2 + cos θr) cos θp, x2 = (2 + cos θr) sin θp, and x3 = sin θr. We then em- bed the torus in x ∈ R10 by augmenting the coordinates by the top r eigenvectors of Φ. We can approximate the with 7-dimensional all-zero or random vectors. To set up subspace of ΦU with ΦrU. the regression problem, we define the response by y = h  p i σ −17 (θ − π)2 + (θ − π)2 − 0.6π where σ[·] is the Additionally, the row vectors of the matrix Φr are com- r p sigmoid function. Note that y is radial symmetric, depend- binations of eigenfunctions. We can select the eigenfunc- ing only on the distance between (θ , θ ) and (π, π). The tions with the largest coefficients and use them to visual- r p colors on the surface of the torus in Fig. 1 correspond to ize the original data {x }. Note that these eigenfunctions i the value of the response. are not necessarily chosen consecutively from the spectral bottom of the graph Laplacian (in contradistinction to the We applied mKDR to the torus data set generated from 961 unsupervised manifold learning setting). Indeed, the seem- uniformly sampled angles θp and θr. We used M = 50 bot- ingly out-of-order selection of eigenfunctions implemented tom eigenvectors from the graph Laplacian. The mKDR by our procedure can be interpreted as the “most predic- algorithm then computed the matrix Φ ∈ R50×50 that min- tive” functions that respect the intrinsic geometry of {xi}. imizes the empirical conditional covariance operator; cf. In our experiments, we have used these strategies to reveal eq. (9). We found that this matrix is nearly rank 1 and can the central subspace as well as to visualize data. be approximated by aT a where a is the eigenvector corre- sponding to the largest eigenvalue. Hence, we projected the 4. Experimental Results 50-D embedding of the graph Laplacian onto this principal direction a. Fig. 2(a) shows the scatter plot of the projec- We demonstrate the effectiveness of the mKDR algorithm tions and the responses of all samples. The scatter plot re- with one synthetic and two real-world data sets. For two veals a clear linear relation. Thus, if we want to regress the of the three data sets, the true manifolds underlying the response on the eigenvectors returned by the graph Lapla- data can be directly visualized with 3-D graphics. The cian, a linear regression function is appropriate and very true manifold for the third data set is high-dimensional and likely to be sufficient. Regression on Manifolds Using Kernel Dimension Reduction 5 3 v v

v v v v (a) (b) 3 1 2 1

(a) (b) Figure 4. A map of the global temperature in Dec. 2004 and its central subspace. See text for details Figure 3. Visualizing data using the most predictive eigenvectors, contributing the most to the central subspace, as well as the bot- tom eigenvectors ignoring responses. See text for details. and analyze a set of 3168 satellite measurements of temper- atures in the middle troposphere (Remote Sensing Systems, 2004). Fig. 4(a) shows these temperature encoded by col- The linear relation revealed in Fig. 2(a) does not mean that ors on a world map, where yellow or white colors mean it is appropriate to choose linear functions to regress the hotter temperature and red colors mean lower temperature. response on the original coordinates of the torus. Specifi- Our regression problem is to predict the temperature using cally, the coordinates are mapped to the kernel space via a the coordinates of latitude and longitude. Note that there highly nonlinear mapping induced by the graph Laplacian. are only two covariates in the problem. However, it is not For this particular example, the kernel mapping is powerful obvious what is the most suitable regression function by enough to make linear regression in the RKHS sufficient. reading the temperature map. Moreover, the domain of the covariates is not Euclidean, but rather ellipsoidal, a fact that In practice it is often not possible to achieve a uniform sam- a regression method should take into consideration when pling of the underlying manifold. Furthermore it is reason- estimating the global temperature distribution. able to assume that both covariates and response are noisy. To examine the robustness of the mKDR algorithm, we cre- We applied the algorithm of mKDR to the temperature ate a torus data set where θr and θp are randomly sampled data set. We choose M = 100 eigenvectors from the and the covariate x and response y are disturbed by additive graph Laplacian. We projected the M-dimensional mani- Gaussian noise. Fig. 2(b) illustrates the central subspace fold embedding onto the principal direction of the linear of the noisy data set. Comparing to the noiseless central map Φ. Fig. 4(b) displays the scatter plot of the projection subspace in Fig. 2(a), the central subspace is more diffuse against the temperatures. The relationship between the two although the general trend is still linear. is largely linear. The principal direction a allows us to view central sub- We tested the linearity by regressing the temperatures on spaces in the RKHS induced by the manifold learning ker- the projections, using a linear regression function. The nel. It also gives rise to the possibility of visualizing data predicted temperatures and the prediction errors are shown in the coordinate space induced by the graph Laplacian. in Fig. 5(a) and Fig. 5(b), with color encoding tempera- Specifically, a encodes the combining coefficients of eigen- tures and errors respectively. Note that the central space vectors. To understand which eigenvectors are the most predicts the overall temperature pattern well. In areas of useful in forming the principal direction, we chose the three inner Greenland, inner Antarctica and the Himalayas, the eigenvectors with largest coefficients in magnitude. They prediction error are relatively large, shown in red color corresponded to the first, the third and the fifth bottom in Fig. 5(b). Climates in these areas are typically ex- eigenvectors of graph Laplacian. We call them predictive treme and vary significantly even across small local re- eigenvectors. Fig. 3(a) shows the 3-D embedding of sam- gions. Such variations might not be well represented by ples using the predictive eigenvectors as coordinates, where the graph Laplacian eigenfunctions which are smooth. the color encodes the responses. As a contrast, Fig. 3(b) shows the 3-D embedding of the torus using the bottom 4.3. Regression on Image Manifolds three eigenvectors. The difference in how data is visual- ized is clear: mKDR arranges samples under the guidance In our two previous experiments, the underlying mani- of the responses while unsupervised graph Laplacian does folds are low dimensional and can be directly visualized. so solely based on the intrinsic geometry of the covariates. In this section, we experiment with a real-world data set whose underlying manifold is high-dimensional and un- known. Our data set contains one thousand 110×80 images 4.2. Predicting Global Temperature of a snowman in a three-dimensional scene. Each snowman To investigate the effectiveness of mKDR on complex non- is rotated around its axis with an angle chosen uniformly linear regression problems, we have applied it to visualize random from the interval [−45◦, 45◦], and tilted with an an- Regression on Manifolds Using Kernel Dimension Reduction

50

0

(a) (b) Rotation angle (degrees) −50 −0.05 0 0.05 Φ U Figure 5. Prediction and prediction errors of the global tempera- (a) (b) ture. See text for details. Figure 6. Images of rotating, tilting and translating snowman and its central subspace when the rotation angle is used as regression gle chosen uniformly random from [−10◦, 10◦]. Moreover, response. See text for details. the objects are subject to random vertical plane translations spanning 5 pixels in each direction. A few representative images of such variations are shown in Fig. 6(a). Note that in the ambient and intrinsic geometries. Side information all variations are chosen independently. Therefore, the data can also be used to achieve better low-dimensional em- set resides on a four-dimensional manifold embedded in beddings in the “pure” setting of manifold learning (Yang 110 × 80-dimensional space. et al., 2006). However, that work does not treat regres- sion; the side information is the prior knowledge of a “cor- Suppose that we are interested in the rotation angle and rect” embedding, not the predictive information coded by would like to create a low-dimensional embedding of the responses. Finally, it is important to emphasize that despite data that highlights this aspect of image variation, while the similarity in terminology, the work on “sufficient di- also maintaining variations in other factors. To this end, mensionality reduction” of Globerson and Tishby (2003) is we set up a regression problem whose covariates are im- fundamentally different from the SDR framework studied age pixel intensities and whose responses are the rotation here, in that it focuses on algorithms for compressing data angles of the images. in the form of two-way contingency tables. It is closely We applied the mKDR algorithm to the data set and the as- related to low-rank matrix factorization for compact repre- sociated regression problem, using M = 100 eigenvectors sentation of matrices; it is not a regression methodology. of the graph Laplacian. The algorithm resulted in a cen- tral space whose first direction correlates fairly well with 6. Conclusions the rotation angle, as shown in Fig. 6(b). In Fig. 7(a), we visualize the data in R3 by using the top three predictive Dimensionality reduction is an essential component of eigenvectors. We use colors to encode the rotation angle of many high-dimensional data analysis procedures. This pa- each point. The data clustering pattern in the rotation an- per presents a new method for dimensionality reduction gles are clearly present in the embedding. As a contrast, we that is appropriate when supervised information is available used the bottom three eigenvectors to visualize the data as to guide the process of finding manifolds of reduced dimen- typically practiced in unsupervised manifold learning. The sionality yet with high predictive power. Our approach is embedding, shown in Fig. 7(b) with colors encoding rota- based on two strands of research that have hitherto not in- tion angles, shows no clear structure or pattern, in terms teracted: sufficient dimension reduction from the statistics of rotation. In fact, the nearly one-dimensional embedding literature and manifold learning from the machine learning reflects closely the tilt angle, which tends to cause large literature. The bridge that connects these ideas is the re- variations in image pixel intensities. cently proposed methodology of kernel dimension reduc- tion. 5. Related Work We have proposed an algorithm of manifold kernel dimen- sion reduction (mKDR). We have applied the algorithm to There has been relatively little previous work on the su- several synthetic and real-world data sets in the interest of pervised manifold learning problem. Sajama and Orlit- exploratory data analysis and visualization. In these exper- sky (2005) recently proposed a dimensionality algorithm iments, the algorithm discovered low-dimensional and pre- that exploits supervised information to tie the parameters dictive subspaces and revealed interesting and useful data of Gaussian mixture models. This contrasts to the model- patterns that are not accessible to unsupervised manifold free assumption of the SDR framework. This is also the learning algorithms. case for the manifold regularization framework (Belkin et al., 2006), which implements semi-supervised learning The mKDR algorithm is a particular instantiation of the with regularization terms controlling the complexity both framework of kernel dimension reduction. Therefore, it en- Regression on Manifolds Using Kernel Dimension Reduction

Cook, R. D., & Yin, X. (2001). Dimension reduction and visual- ization in discriminant analysis (with discussion). Australian & New Zealand Journal of Statistics, 43, 147–199. Donoho, D. L., & Grimes, C. (2003). Hessian eigenmaps: Lo- cally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences USA, 100, 5591–5596. Fukumizu, K., Bach, F. R., & Jordan, M. I. (2004). Dimension- ality reduction for supervised learning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research, 5, 73– 99. Kernel dimen- (a) (b) Fukumizu, K., Bach, F. R., & Jordan, M. I. (2006). sion reduction in regression (Technical Report). Department of Statistics, University of California, Berkeley. Figure 7. Embedding the snowman images with predictive eigen- vectors and bottom graph Laplacian eigenvectors, respectively. Globerson, A., & Tishby, N. (2003). Sufficient dimensionality Color corresponds to rotation angle. reduction. Journal of Machine Learning Research, 3, 1307– 1331. joys many of the desirable properties of kernel methods in Ham, J., Lee, D., Mika, S., & Schölkopf, B. (2004). A kernel view of the dimensionality reduction of manifolds. Proceed- general, including the ability to handle of multivariate re- ings of the 21’st International Conference on Machine Learn- sponse variables and non-vectorial data. We view mKDR ing. ACM. as a promising general tool for the visualization of complex data types. Lafon, S. (2004). Diffusion maps and geometric harmonics. Doc- toral dissertation, Yale University. Acknowledgements Li, B., Zha, H., & Chiaramonte, F. (2005). Contour regression: A general approach to dimension reduction. The Annals of This work has been partly supported by grants from the Statistics, 33, 1580–1616. Swedish Knowledge Foundation, AstraZeneca, Yahoo! Re- Li, K.-C. (1991). Sliced inverse regresion for dimension reduc- search and Microsoft Research. tion. Journal of the American Statistical Association, 86, 316– 327. References Li, K.-C. (1992). On principal Hessian directions for data vi- sualization and dimension reduction: Another application of Belkin, M., & Niyogi, P. (2003). Laplacian Eigenmaps for dimen- Stein’s lemma. Journal of the American Statistical Associa- sionality reduction and data representation. Neural Computa- tion, 86, 316–342. tion, 15, 1373–1396. Remote Sensing Systems (2004). Microwave sounding units Belkin, M., Niyogi, P., & Sindhwani, V. (2006). Manifold regular- (MSU) data. sponsored by the NOAA Climate and Global ization: A geometric framework for learning from labeled and Change Program. Data available at www.remss.com. unlabeled examples. Journal of Machine Learning Research, 7, 2399–2434. Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduc- tion by locally linear embedding. Science, 290, 2323–2326. Bengio, Y., Paiement, J.-F., & Vincent, P. (2003). Out-of- sample extensions for LLE, Isomap, MDS, Eigenmaps, and Sajama, & Orlitsky, A. (2005). Supervised dimensionality reduc- spectral clustering (Technical Report 1238). Département tion using mixture models. Proceedings of the 22’nd Interna- d’Informatique et Recherche Opérationnelle, Université de tional Conference on Machine Learning (pp. 760–767). ACM. Montréal. Sha, F., & Saul, L. (2005). Analysis and extension of spectral Chiaromonte, F., & Cook, R. D. (2002). Sufficient dimension methods for nonlinear dimensionality reduction. Proceedings reduction and graphics in regression. Annals of the Institute of of the 22’nd International Conference on Machine Learning Statistical Mathematics, 54, 768–795. (pp. 785–792). ACM. Coifman, R., Lafon, S., Lee, A., Maggioni, M., Nadler, B., Tenenbaum, J., de Silva, V., & Langford, J. (2000). A global Warner, F., & Zucker, S. (2005). Geometric diffusions as a tool geometric framework for nonlinear dimensionality reduction. for harmonic analysis and structure definition of data: Diffu- Science, 290, 2319–2322. sion maps. Proceedings of the National Academy of Sciences USA, 102, 7426–7431. Yang, X., Fu, H., Zha, H., & Barlow, J. (2006). Semi-supervised Cook, R. D. (1998). Regression graphics. Wiley Inter-Science. nonlinear dimensionality reduction. Proceedings of the 23’rd International Conference on Machine Learning (pp. 1065– Cook, R. D., & Li, B. (1991). Discussion of Li (1991). Journal 1072). ACM. of the American Statistical Association, 86, 328–332.