On Manifold Regularization

On Manifold Regularization Mikhail Belkin, Partha Niyogi, Vikas Sindhwani misha,niyogi,vikass ¡ @cs.uchicago.edu Department of Computer Science University of Chicago Abstract submanifold of the ambient space. Within this general framework, we propose two spe- We propose a family of learning algorithms cific families of algorithms: the Laplacian Regular- based on a new form of regularization that ized Least Squares (hereafter LapRLS) and the Lapla- allows us to exploit the geometry of the cian Support Vector Machines (hereafter LapSVM). marginal distribution. We focus on a semi- These are natural extensions of RLS and SVM respec- supervised framework that incorporates la- tively. In addition, several recently proposed trans- beled and unlabeled data in a general- ductive methods (e.g., [18, 17, 1]) are also seen to be purpose learner. Some transductive graph special cases of this general approach. Our solution learning algorithms and standard meth- for the semi-supervised case can be expressed as an ods including Support Vector Machines and expansion over labeled and unlabeled data points. Regularized Least Squares can be obtained Building on a solid theoretical foundation, we obtain as special cases. We utilize properties of a natural solution to the problem of out-of-sample ex- Reproducing Kernel Hilbert spaces to prove tension (see also [6] for some recent work).When all new Representer theorems that provide the- examples are unlabeled, we obtain a new regularized oretical basis for the algorithms. As a re- version of spectral clustering. sult (in contrast to purely graph based ap- proaches) we obtain a natural out-of-sample Our general framework brings together three distinct extension to novel examples and are thus concepts that have received some independent recent able to handle both transductive and truly attention in machine learning: Regularization in Re- semi-supervised settings. We present exper- producing Kernel Hilbert Spaces, the technology of imental evidence suggesting that our semi- Spectral Graph Theory and the geometric viewpoint supervised algorithms are able to use unla- of Manifold Learning algorithms. beled data effectively. In the absence of labeled examples, our framework gives rise to a regularized form of spectral clustering 2 The Semi-Supervised Learning with an out-of-sample extension. Framework 1 Introduction First, we recall the standard statistical framework of learning from examples, where there is a probability £¥¤§¦ distribution ¢ on according to which training The problem of learning from labeled and unlabeled examples are generated. Labeled examples are ¨ © data (semi-supervised and transductive learning) has pairs drawn from ¢ . Unlabeled examples are simply attracted considerable attention in recent years (cf. ¢ ©§£ drawn from the marginal distribution of . [11, 7, 10, 15, 18, 17, 9]). In this paper, we consider this problem within a new framework for data-dependent One might hope that knowledge of the marginal regularization. Our framework exploits the geome- can be exploited for better function learning (e.g. in try of the probability distribution that generates the classification or regression tasks). Figure 1 shows data and incorporates it as an additional regulariza- how unlabeled data can radically alter our prior be- tion term. We consider in some detail the special case lief about the appropriate choice of classification func- where this probability distribution is supported on a tions. However, if there is no identifiable relation be- of the marginal . We would like to ensure that the Figure 1: Unlabeled data and prior beliefs solution is smooth with respect to both the ambient space and the marginal distribution 2 . To achieve that, we introduce an additional regularizer: P Q ? ? ? ? @ 9 7BACED(FHGJI ¨!©$3 4 (3_ ÙaVcbO1 1 UaVcdY1 1 d . (3) K(LM N > 3SR T ? 1 1 where d is an appropriate penalty term that b V should reflect the intrinsic structure of . Here controls the complexity of the function in the ambient ¨ © d tween and the conditional , the knowledge space while V controls the complexity of the function of is unlikely to be of much use. Therefore, we in the intrinsic geometry of . Given this setup one will make a specific assumption about the connection can prove the following representer theorem: between the marginal and the conditional. We will as- ? 1`d Theorem 2.1. Assume that the penalty term 1 is suf- ? sume that if two points © © £ are close in the in- 1 1 ficiently smooth with respect to the RKHS norm . ? @ trinsic geometry of , then the conditional distribu- Then the solution to the optimization problem in Eqn 3 ©" ¨! ©# tions ¨! and are similar. In other words, above exists and admits the following representation ©$ the conditional probability distribution ¨! varies P smoothly along the geodesics in the intrinsic geome- Q @ ? 3 3 ¨!©\7fecg ¨ _%h¨ © 4 ci ¨! jU %^¨!© © try of . (4) 3SR ] ] We utilize these geometric ideas to extend an estab- Op¡ 2 lished framework for function learning. A number of where kl7Bm4noo is the support of the marginal . popular algorithms such as SVM, Ridge regression, splines, Radial Basis Functions may be broadly in- The proof of this theorem runs over several pages and terpreted as regularization algorithms with different is omitted for lack of space. See [4] for details includ- empirical cost functions and complexity measures in ing the exact statement of the smoothness conditions. an appropriately chosen Reproducing Kernel Hilbert In most applications, however, we do not know . Space (RKHS). Therefore we must attempt to get empirical estimates ? d 1 of 1 . Note that in order to get such empirical esti- For a Mercer kernel %'&(£'¤)£+*,¦ , there is an asso- mates it is sufficient to have unlabeled examples. £/*0¦ ciated RKHS -). of functions with the corre- 121 sponding norm . Given a set of labeled examples A case of particular recent interest is when the ¨!©$3 4 (35 687:9( <;";<;= 4> k q , the standard framework estimates support of is a compact submanifold ? 7r¦ts 1 1<d an unknown function by minimizing £ . In that case, a natural choice for g g ? ?z gwvyx x P is u . The optimization problem be- Q ? ? ?@ 9 3 3 ¨!© UWVX1 1 7BACED(FHGJI comes . (1) KLMON > P 3SR 8T Q ?@ ? ? 9 3 3 b 7{ACED(FHGJI ¨!© 4 U}V 1 1 U . where is some loss function, such as squared loss K(LM|N ? > 3JR T T ¨!©$3[4 ¨! Y32Z for RLS or the soft margin loss func- g g ? ?z x tion for SVM. Penalizing the RKHS norm imposes v[x Vcd e~g (5) smoothness conditions on possible solutions. The g g ? ?z x classical Representer Theorem states that the solution gwv[x u The term may be approximated on to this minimization problem exists in -). and can be the basis of labeled and unlabeled data using the written as P > Q graph Laplacian ([1]). Thus, given a set of labeled P ?@ ¨!©\7 3 %^¨!©$3 4©$ ¨!©$3 4 (3y¡ 3JR (2) examples and a set of unlabeled exam- P R 3SR ] P ©c¡ R ples , we consider the following optimiza- Therefore, the problem is reduced to optimizing tion problem : over the finite dimensional space of coefficients 3 , which is the algorithmic basis for SVM, Regularized] P Q ? ? ? Least Squares and other regression and classification @ 9 7{ACED(FHGJI ¨!©$3 4 (3 U}Vcb1 1 U . schemes. K(LM|N > 3JR T ? ? d Our goal is to extend this framework by incorporating V (6) additional information about the geometric structure ¨ Uh>y ? ? ? P P 7' ¨!©"= ";";<;= ¨ © ";";<;" where , and is the graph whose minimizer is given by : ] ] 3 Laplacian given by 7/+Zf where are the @ V~d`> ¤ % £¢ edge weights in the data adjacency graph. Here, the 7:¨_%UWVcb8>y ¡U (8) P ¨! Uh>y ] 3S327 3 R diagonal matrix is given by . The normalizing coefficient #P is the natural scale fac- Here, K is the ¨ >U¥¡¤h¨ >U¦ Gram matrix over la- tor for the empirical estimate of the Laplace operator. beled and unlabeled points; Y is an ¨ >XU dimen- ¤ P ~# <;";<;= 4¨ <;";<;= 4¨ On a sparse adjacency graph it may be replaced by 7§ P sional label vector given by - 3 ¨ >U©|¤}¨ >ªU© 3! R . and is an diagonal matrix given by > - B7«i(6¬Y¨9( <;";"; <9( E¨ ";<;";= EŸ with the first diagonal The following simple version of the representer the- entries as 1 and the rest 0. orem shows that the minimizer has an expansion in terms of both labeled and unlabeled examples and is Note that when Vcd®7¯¨ , Eqn (8) gives zero coeffi- a key to our algorithms. cients over unlabeled data. The coefficients over la- Theorem 2.2. The minimizer of optimization problem 6 beled data are exactly those for standard RLS. admits an expansion 3.2 Laplacian Support Vector Machines (LapSVM) PS Q ?@ 3 3 ¨!©\7 4©$ %^¨!© (7) Laplacian SVMs solve the optimization problem in 3SR ] Eqn. 6 with the soft margin loss function defined as ? ? °7±FA²¨ ¨ª "9Zh (3 ¨!©$3[4 Y3¡ Z°9Y £Ua9¡ ¨ ©34 (3 . In- in terms of the labeled and unlabeled examples. trT oducing slack variables, using standard Lagrange Multiplier techniques used for deriving SVMs [16], The proof is a variation of the standard orthogonality we first arrive at the following quadratic program in argument, which we omit for lack of space. ³ > dual variables : P d Remarks : (a) Other natural choices of 11 exist. Ex- Q º¹ amples are (i) heat kernel (ii) iterated Laplacian (iii) 9 3 ³ ´7{FA² ³ ³ ³ Z ¸ µL#¶· (9) kernels in geodesic coordinates. The above kernels 3JR are geodesic analogs of similar kernels in Euclidean P (3 ³37+¨ª ®¨¼»½³$3» % k 3SR space. (b) Note that restricted to (denoted by subject to the contraints : g P 6t79( <;S;J; > k % ) is also a kernel defined on with an associated , where g - k *¦ RKHS of functions .

Load more