Regression on Manifolds Using Kernel Dimension Reduction

Regression on Manifolds Using Kernel Dimension Reduction Jens Nilsson @.. Centre for Mathematical Sciences, Lund University, Box 118, SE-221 00 Lund, Sweden Fei Sha @.. Computer Science Division, University of California, Berkeley, CA 94720 USA Michael I. Jordan @.. Computer Science Division and Department of Statistics, University of California, Berkeley, CA 94720 USA Abstract list of “manifold learning” algorithms—including LLE, Isomap, and Laplacian eigenmaps—provide sophisticated We study the problem of discovering a manifold examples of unsupervised dimension reduction (Tenen- that best preserves information relevant to a non- baum et al., 2000; Roweis & Saul, 2000; Belkin & Niyogi, linear regression. Solving this problem involves 2003; Donoho & Grimes, 2003). The supervised learn- extending and uniting two threads of research. ing setting is somewhat more involved; one must make a On the one hand, the literature on sufficient di- choice of the family of manifolds to represent the covari- mension reduction has focused on methods for ate vectors, and one must also choose a family of functions finding the best linear subspace for nonlinear re- to represent the regression surface (or classification bound- gression; we extend this to manifolds. On the ary). Due in part to this additional complexity, most of the other hand, the literature on manifold learning focus in the supervised setting has been on reduction to lin- has focused on unsupervised dimensionality re- ear manifolds. This is true of classical linear discriminant duction; we extend this to the supervised setting. analysis, and also of the large family of methods known Our approach to solving the problem involves as sufficient dimension reduction (SDR) (Li, 1991; Cook, combining the machinery of kernel dimension re- 1998; Fukumizu et al., 2006). SDR aims to find a linear duction with Laplacian eigenmaps. Specifically, subspace S such that the response Y is conditionally inde- we optimize cross-covariance operators in kernel pendent of the covariate vector X, given the projection of feature spaces that are induced by the normalized X on S. This formulation in terms of conditional indepen- graph Laplacian. The result is a highly flexible dence means that essentially no assumptions are made on method in which no strong assumptions are made the form of the regression from X to Y, but strong assump- on the regression function or on the distribution tions are made on the manifold representation of X (it is a of the covariates. We illustrate our methodology linear manifold). Finally, note that the large literature on on the analysis of global temperature data and feature selection for supervised learning can also be con- image manifolds. ceived of as a projection onto a family of linear manifolds. It is obviously of interest to consider methods that com- 1. Introduction bine manifold learning and sufficient dimension reduction. From the point of view of manifold learning, we can read- Dimension reduction is an important theme in machine ily imagine situations in which some form of side infor- learning. Dimension reduction problems can be ap- mation is available to help guide the choice of manifold. proached from the point of view of either unsupervised Such side information might come from a human user in learning or supervised learning. A classical example of the an exploratory data analysis setting. We can also envisage former is principal component analysis (PCA), a method regression and classification problems in which nonlinear that projects data onto a linear manifold. More recent re- representations of the covariate vectors are natural on sub- search has focused on nonlinear manifolds, and the long ject matter grounds. For example, we will consider a prob- th lem involving atmosphere temperature close to the Earth’s Appearing in Proceedings of the 24 International Conference on surface in which a manifold representation of the covariate Machine Learning, Corvallis, OR, 2007. Copyright 2007 by the author(s)/owner(s). vectors is quite natural. We will also consider an exam- Regression on Manifolds Using Kernel Dimension Reduction ple involving image manifolds. Other examples involving formally as a conditional independence assertion: dynamical systems are readily envisaged; for example, a torus is often a natural representation of robot kinematics Y y X | BTX (1) and robot dynamics can be viewed as a regression on this manifold. where B denotes the orthogonal projection of X onto S. The subspace S is referred to as a dimension reduction sub- It is obviously an ambitious undertaking to attempt to iden- space. Dimension reduction subspaces are not unique. We tify a nonlinear manifold from a large family of nonlinear can derive a unique “minimal” subspace, defined as the in- manifolds while making few assumptions regarding the re- tersections of all reduction subspaces S. This minimal sub- gression surface. Nonetheless, it is an important undertak- space does not necessarily satisfy the conditional indepen- ing, because the limitation to linear manifolds in SDR can dence assertion; when it does, the subspace is referred to as be quite restrictive in practice, and the lack of a role for su- the central subspace. pervised data in manifold learning is limiting. As in much of the unsupervised manifold learning literature, we aim to Many approaches have been developed to identify central make progress on this problem by focusing on visualiza- subspaces (Li, 1991; Li, 1992; Cook & Li, 1991; Cook & tion. Without attempting to define the problem formally, Yin, 2001; Chiaromonte & Cook, 2002; Li et al., 2005). we attempt to study situations in which supervised mani- Many of these approaches are based on inverse regression; fold learning is natural and investigate the ability of algo- that is, the problem of estimating E X|Y . The intuition J K rithms to find useful visualizations. is that, if the forward regression model P(Y|X) is concen- trated in a subspace of X then E X|Y should lie in the The methodology that we develop combines techniques same subspace. Moreover, the responsesJ K Y are typically from SDR and unsupervised manifold learning. Specif- of much lower dimension than the covariates X, and thus ically, we make use of ideas from kernel dimension re- the subspace may be more readily identified via inverse reduction (KDR), a recently-developed approach to SDR gression. A difficulty with this approach, however, is that that uses cross-covariance operators on reproducing kernel rather strong assumptions generally have to be imposed on Hilbert spaces to measure quantities related to conditional the distribution of X (e.g., that the distribution be elliptical), independence. We will show that this approach combines and the methods can fail when these assumptions are not naturally with representations of manifolds based on Lapla- met. This issue is of particular importance in our setting, cian eigenmaps. in which the focus is on capturing the structure underlying The paper is organized as follows. In Section 2, we pro- the distribution of X and in which such strong assumptions vide basic background on SDR, KDR and unsupervised would be a significant drawback. We thus turn to a descrip- manifold learning. Section 3 presents our new manifold tion of KDR, an approach to SDR that does not make such kernel dimension reduction (mKDR) method. In Section 4, strong assumptions. we present experimental results evaluating mKDR on both synthetic and real data sets. In Section 5, we comment 2.2. Kernel Dimension Reduction briefly on related work. Finally, we conclude and discuss The framework of kernel dimension reduction was first de- future directions in Section 6. scribed in Fukumizu et al. (2004) and later refined in Fuku- mizu et al. (2006). The key idea of KDR is to map ran- 2. Background dom variables X and Y to reproducing kernel Hilbert spaces (RKHS) and to characterize conditional independence us- We begin by outlining the SDR problem. We then describe ing cross-covariance operators. KDR, a specific methodology for SDR in which the linear subspace is characterized by cross-covariance operators on Let HX be an RKHS of functions on X induced by the reproducing kernel Hilbert spaces. Finally, we also provide kernel function KX(·, X) for X ∈ X. We define the space a brief overview of unsupervised manifold learning. HY and the kernel function KY similarly. Define the cross- covariance between a pair of functions f ∈ HX and g ∈ HY 2.1. Sufficient Dimension Reduction as follows: Let (X, BX) and (Y, BY) be measurable spaces of covari- C f g = EXY r f (X) − EX f (X) g(Y) − EY g(Y) z. (2) ates X and response variables Y respectively. SDR aims at J K J K S ⊂ X S finding a linear subspace such that contains as It turns out that there exists a bilinear operator ΣYX from much predictive information regarding the response Y as HX to HY such that C f g = hg, ΣYX f iHY for all functions the original covariate space. This desideratum is captured f and g. Similarly we can define covariance operators ΣXX and ΣYY . Finally, we can use these operators to define a class of conditional cross-covariance operators in the fol- Regression on Manifolds Using Kernel Dimension Reduction lowing way: where I is the identity matrix of appropriate dimensionality, and a regularization coefficient. The matrix Kc de- −1 ΣYY|X = ΣYY − ΣYXΣXXΣXY . (3) notes the centered kernel matrices ! ! 1 1 This definition assumes that ΣXX is invertible; more general Kc = I − eeT K I − eeT (5) cases are discussed in Fukumizu et al. (2006). N N Note that the conditional covariance operator ΣYY|X of where e is a vector of all ones. eq. (3) is “less” than the covariance operator ΣYY , as the −1 difference ΣYXΣXXΣXY is positive semidefinite. This agrees 2.3. Manifold Learning with the intuition that conditioning reduces uncertainty.

Regression on Manifolds Using Kernel Dimension Reduction

ECS289: Scalable Machine Learning

Semi-Supervised Classification Using Local and Global Regularization

Linear Manifold Regularization for Large Scale Semi-Supervised Learning

Semi-Supervised Deep Learning Using Improved Unsupervised Discriminant Projection?

Hyper-Parameter Optimization for Manifold Regularization Learning

Zero Shot Learning Via Multi-Scale Manifold Regularization

Semi-Supervised Deep Metric Learning Networks for Classiﬁcation of Polarimetric SAR Data

Manifold Regularization and Semi-Supervised Learning: Some Theoretical Analyses

A Graph Based Approach to Semi-Supervised Learning

Cross-Modal and Multimodal Data Analysis Based on Functional Mapping of Spectral Descriptors and Manifold Regularization

Classification by Discriminative Regularization

Approximate Manifold Regularization: Scalable Algorithm and Generalization Analysis