On Manifold Regularization

Mikhail Belkin, Partha Niyogi, Vikas Sindhwani

misha,niyogi,vikass ¡ @cs.uchicago.edu Department of Computer Science University of Chicago

Abstract submanifold of the ambient space. Within this general framework, we propose two spe- We propose a family of learning algorithms cific families of algorithms: the Laplacian Regular- based on a new form of regularization that ized (hereafter LapRLS) and the Lapla- allows us to exploit the geometry of the cian Support Vector Machines (hereafter LapSVM). marginal distribution. We focus on a semi- These are natural extensions of RLS and SVM respec- supervised framework that incorporates la- tively. In addition, several recently proposed trans- beled and unlabeled data in a general- ductive methods (e.g., [18, 17, 1]) are also seen to be purpose learner. Some transductive graph special cases of this general approach. Our solution learning algorithms and standard meth- for the semi-supervised case can be expressed as an ods including Support Vector Machines and expansion over labeled and unlabeled data points. Regularized Least Squares can be obtained Building on a solid theoretical foundation, we obtain as special cases. We utilize properties of a natural solution to the problem of out-of-sample ex- Reproducing Kernel Hilbert spaces to prove tension (see also [6] for some recent work).When all new Representer theorems that provide the- examples are unlabeled, we obtain a new regularized oretical basis for the algorithms. As a re- version of spectral clustering. sult (in contrast to purely graph based ap- proaches) we obtain a natural out-of-sample Our general framework brings together three distinct extension to novel examples and are thus concepts that have received some independent recent able to handle both transductive and truly attention in : Regularization in Re- semi-supervised settings. We present exper- producing Kernel Hilbert Spaces, the technology of imental evidence suggesting that our semi- Spectral Graph Theory and the geometric viewpoint supervised algorithms are able to use unla- of Manifold Learning algorithms. beled data effectively. In the absence of la- beled examples, our framework gives rise to a regularized form of spectral clustering 2 The Semi- with an out-of-sample extension. Framework

1 Introduction First, we recall the standard statistical framework of

learning from examples, where there is a probability £¥¤§¦ distribution ¢ on according to which training The problem of learning from labeled and unlabeled

examples are generated. Labeled examples are ¨ ©   data (semi-supervised and transductive learning) has

pairs drawn from ¢ . Unlabeled examples are simply

attracted considerable attention in recent years (cf.

 ¢ ©§£ drawn from the marginal distribution of . [11, 7, 10, 15, 18, 17, 9]). In this paper, we consider this

problem within a new framework for data-dependent One might hope that knowledge of the marginal  regularization. Our framework exploits the geome- can be exploited for better function learning (e.g. in try of the probability distribution that generates the classification or regression tasks). Figure 1 shows data and incorporates it as an additional regulariza- how unlabeled data can radically alter our prior be- tion term. We consider in some detail the special case lief about the appropriate choice of classification func- where this probability distribution is supported on a tions. However, if there is no identifiable relation be-  of the marginal  . We would like to ensure that the Figure 1: Unlabeled data and prior beliefs solution is smooth with respect to both the ambient

space and the marginal distribution 2 . To achieve

that, we introduce an additional regularizer:

P

Q

? ? ? ?

@

9

 

7BACED(FHGJI ¨!©$3 4 (3_ `UaVcbO1 1 UaVcdY1 1 d

. (3)

K(LM

N

>

3SR T



?



1 1

where d is an appropriate penalty term that

 b V should reflect the intrinsic structure of  . Here

controls the complexity of the function in the ambient



 ¨  © d

tween and the conditional , the knowledge space while V controls the complexity of the function



  of is unlikely to be of much use. Therefore, we in the intrinsic geometry of  . Given this setup one will make a specific assumption about the connection can prove the following representer theorem:

between the marginal and the conditional. We will as- ? 1`d

Theorem 2.1. Assume that the penalty term 1 is suf- ?

sume that if two points © ©  £ are close in the in-

1 1

ficiently smooth with respect to the RKHS norm . .

?

@ 

trinsic geometry of  , then the conditional distribu- Then the solution to the optimization problem in Eqn 3

©" ¨!  ©#

tions ¨!  and are similar. In other words, above exists and admits the following representation ©$

the conditional probability distribution ¨!  varies P

smoothly along the geodesics in the intrinsic geome- Q

@

?

 3 3

¨!©\7fecg ¨ _%h¨ © 4 ci ¨! jU %^¨!© ©

try of  . (4)

3SR

]  ]

We utilize these geometric ideas to extend an estab-

Op¡ 2 lished framework for function learning. A number of where kl7Bm4noo is the support of the marginal . popular algorithms such as SVM, Ridge regression, splines, Radial Basis Functions may be broadly in- The proof of this theorem runs over several pages and terpreted as regularization algorithms with different is omitted for lack of space. See [4] for details includ-

empirical cost functions and complexity measures in ing the exact statement of the smoothness conditions.  an appropriately chosen Reproducing Kernel Hilbert In most applications, however, we do not know  .

Space (RKHS). Therefore we must attempt to get empirical estimates

?

d 1 of 1 . Note that in order to get such empirical esti- For a Mercer kernel %'&(£'¤)£+*,¦ , there is an asso-

mates it is sufficient to have unlabeled examples. £/*0¦

ciated RKHS -). of functions with the corre- 121

sponding norm . . Given a set of labeled examples A case of particular recent interest is when the

¨!©$3 4 (35 687:9( <;";<;= 4>

k q

, the standard framework estimates support of  is a compact submanifold

?

7r¦ts 1 1

an unknown function by minimizing £ . In that case, a natural choice for

g g

? ?z

gwvyx x

P

is u . The optimization problem be-

Q

? ? ?@

9



3 3

¨!©  UWVX1 1 7BACED(FHGJI comes

. (1)

KLMON

>

P

3SR

8T

Q

?@ ? ?

9



3 3 b

7{ACED(FHGJI ¨!© 4 U}V 1 1 U .

where is some , such as squared loss K(LM|N

?

>



3JR

 T

T ¨!©$3[4

¨! Y32Z for RLS or the soft margin loss func-

g g

? ?z x

tion for SVM. Penalizing the RKHS norm imposes v[x

Vcd e~g (5)

smoothness conditions on possible solutions. The

g g

? ?z x

classical Representer Theorem states that the solution gwv[x u The term may be approximated on to this minimization problem exists in -). and can be the basis of labeled and unlabeled data using the

written as P >

Q graph Laplacian ([1]). Thus, given a set of labeled

P

?@

¨!©\7 3 %^¨!©$3 4©$

¨!©$3 4 (3y¡ € 3JR

(2) examples  and a set of unlabeled exam-

Pƒ‚„

R



3SR

 ]

Pƒ‚

©c¡

R  ples  , we consider the following optimiza- Therefore, the problem is reduced to optimizing tion problem :

over the finite dimensional space of coefficients 3 ,

which is the algorithmic basis for SVM, Regularized]

P

Q

? ? ?

Least Squares and other regression and classification @

9



7{ACED(FHGJI ¨!©$3 4 (3 U}Vcb1 1 U .

schemes. K(LM|N

>

3JR T



?‡ ˆ ? d

Our goal is to extend this framework by incorporating V †  † (6)

additional information about the geometric structure ¨ € Uh>y

? ? ? ˆ

‡ ‡

Pƒ‚ „ Pƒ‚ „

7'‰ ¨!©"= ";";<;= ¨ © ‹Š ‰  ";";<;" Š

where † , and is the graph whose minimizer is given by :

ˆ

] ]

3

 

Laplacian given by 7/Œ+Zf where are the

ˆ

@

V~d`>

¤ % £¢ edge weights in the data adjacency graph. Here, the 7:¨_ž%ŸUWVcb8>y ¡U

 (8)

Pƒ‚„

¨!€ Uh>y

]

Œ ŒŽ3S327 ‘3



R 

diagonal matrix is given by  . The

 ’ normalizing coefficient „#‚Pƒ“ ” is the natural scale fac- Here, K is the ¨ >U¥€¡¤h¨ >U¦€ Gram matrix over la-

tor for the empirical estimate of the Laplace operator. beled and unlabeled points; Y is an ¨ >XUš€ dimen-

¤

P

~# <;";<;=  4¨ <;";<;= 4¨Š

On a sparse adjacency graph it may be replaced by 7§‰

Pƒ‚„ sional label vector given by -

 ‘3



ž ¨ >U©€|¤}¨ >ªU©€

3!• R 

 . and is an diagonal matrix given by > - žB7«i(6‹¬Y­¨9( <;";"; <9( E¨ ";<;";= E¨Y with the first diagonal The following simple version of the representer the- entries as 1 and the rest 0. orem shows that the minimizer has an expansion in

terms of both labeled and unlabeled examples and is Note that when Vcd®7¯¨ , Eqn (8) gives zero coeffi- a key to our algorithms. cients over unlabeled data. The coefficients over la- Theorem 2.2. The minimizer of optimization problem 6 beled data are exactly those for standard RLS. admits an expansion

3.2 Laplacian Support Vector Machines (LapSVM)

PS‚„

Q

?@

3 3

¨!©\7 4©$

%^¨!© (7) Laplacian SVMs solve the optimization problem in

3SR 

] Eqn. 6 with the soft margin loss function defined as

? ?

°7±FŽA²¨ ¨ª "9œZh (3 ¨!©$3[4  Y3¡ Z°9Y £Ua9¡ ¨ ©34  (3 . In- in terms of the labeled and unlabeled examples. trT oducing slack variables, using standard Lagrange Multiplier techniques used for deriving SVMs [16], The proof is a variation of the standard orthogonality we first arrive at the following quadratic program in

argument, which we omit for lack of space. ³

> dual variables :

P d

Remarks : (a) Other natural choices of 1–1 exist. Ex-

Q ‡º¹

amples are (i) heat kernel (ii) iterated Laplacian (iii) 9

3

³ ´œ7{FŽA² ³ ³ ³ Z ¸

µL#¶· (9)

kernels in geodesic coordinates. The above kernels

3JR 

are geodesic analogs of similar kernels in Euclidean P

 (3 ³3Ž7+¨ª ®¨¼»½³$3 »

% k 3SR

space. (b) Note that restricted to (denoted by subject to the contraints : 

g



P

6t7š9( <;S;J; > k

% ) is also a kernel defined on with an associated , where

g

- k *—¦

ˆ ‡

RKHS of functions . While this might ¹

Vcd

¸ ¸

g g



¤ ¤

? ? ? ?

¢ b

%¾ ž 7 ž%h¨ V ¡U

d

1 1 7w1  1=.\˜ k 

suggest ( is restricted to )  (10)

¨!€ Uh>y

?

d 1

as a reasonable choice for 1 , it turns out, that for

?

¤

@

3S3 3

the minimizer of the corresponding optimization Here, Y is the diagonal matrix 7{ , K is the Gram

? ?

@ @

d

1 7™1 1". problem we get 1 , yielding the same matrix over both the labeled and the unlabeled data;

solution as standard regularization, although with a L is the data adjacency graph Laplacian; and J is an

3 3

>º¤§¨ >cU € ž œ7¿9 687^À ©

different V . matrix given by - if and is a

žj3 7B¨

labeled example, and  otherwise. To obtain the

’

Pƒ‚ „`“ @

optimal expansion coefficient vector Á¦ , one

3 Algorithms has to solve the following linear system] after solving the quadratic program above :

We now present solutions to the optimization prob-

@ ‡

ˆ

d

¸ ¸ V



¤

¢

b ´

7:¨ V œU %  ³

lem posed in Eqn (6). To fix notation, we assume we ž P

 (11)

¨!€ Uh>y

3 3

> ¨!© 4 ¡ €

] 3JR

have labeled examples  and unlabeled

Pƒ‚„

R



Pƒ‚

© ¡ %



R 

examples  . We use interchangeably to de-

d 7¨ One can note that when V , the SVM QP and note the kernel function or the Gram matrix. Eqns (10,11), give zero expansion coefficients over the unlabeled data. The expansion coefficients over the 3.1 Laplacian Regularized Least Squares labeled data and the Q matrix are as in standard SVM, (LapRLS) in this case. Laplacian SVMs can be easily imple- mented using standard SVM software and packages The Laplacian Regularized Least Squares algorithm for solving linear systems.

solves Eqn (6) with the squared loss function:

? ?



š7›¨ (3œZ ¨ ©35 ¨ ©34  (3 . Since the solution is The Manifold Regularization algorithms and some ofT the form given by (7), the objective function can connections are presented in the table below. For

be reduced to a convex differentiable function of the Graph Regularization and Label Propagation see [12, 7

¨ >U^€ -dimensional expansion coefficient vector 3, 18]. ]

Manifold Regularization Algorithms

Pƒ‚ „

P

PS‚

3 3

> ¨!© 4 ¡ € ©¡

3JR



R 

Input: labeled examples , unlabeled examples 

?

&Y¦ *,¦

Output: Estimated function s

¨y>5Uœ€ Ä

Step 1 Ã Construct data adjacency graph with nodes using, e.g, nearest neighbors. Choose

”EÊËEÌ

3 3

   œ7BÅ

edge weights , e.g. binary weights or heat kernel weights ¢tÆ5Ç<È[¢Ç"É"Æ .

à %h¨!©º   %)3 7B%h¨ ©3_ 4©  

Step 2 Choose a kernel function . Compute the Gram matrix  .

ˆ

7͌ÎZh Œ

Step 3 Ã Compute graph Laplacian matrix : where is a diagonal matrix given by

Pƒ‚„

ŒH3J3º7Ϗ ‘3



R 

 .

Vcb Vcd

Step 4 Ã Choose and . @

Step 5 Ã Compute using Eqn (8) for squared loss (Laplacian RLS) or using Eqns (10,11) together

with the SVM] QP solver for soft margin loss (Laplacian SVM).

Pƒ‚„

?

@ @

 3

à ¨!©\7 %h¨ © ©

3 3SR

Step 6 Output function  . ]

Connections to other algorithms

bhÐ d°Ð

¨hV ¨

V Manifold Regularization

bhÐ d

¨hV 7¼¨

V Standard Regularization (RLS or SVM)

b d°Ñ

7¼¨hV ¨

V Out-of-sample extension for Graph Regularization (RLS or SVM)

b d

7B¨©V *,¨

V Out-of-sample extension for Label Propagation (RLS or SVM)

b d

*,¨hV 7¼¨ V Hard margin (RLS or SVM)

4 Related Work example. This is formulated as a mixed-integer pro- gram for linear SVMs in [5] and is found to be in- In this section we survey various approaches to semi- tractable for large amounts of unlabeled data. The supervised and transductive learning and highlight presentation of the algorithm is restricted to the lin- connections of Manifold Regularization to other algo- ear case. rithms. Measure-Based Regularization [9]: The conceptual framework of this work is closest to our approach. Transductive SVM (TSVM) [16], [11]: TSVMs are The authors consider a gradient based regularizer based on the following optimization principle : that penalizes variations of the function more in high

P density regions and less in low density regions lead-

Q

@

? ?

‚ ing to the following optimization principle:

3 3

7 ACED(FHGSI ¨92Z¾ ¨!© 4

K(LM|N

·ÔÓ~Õ ·ÔÓY׺Ø

• Ò •ÔÖÔÖÔÖ Ò

P

3SRÙ

Q

?@ ?

Pƒ‚„

3 3

7¼A(C4DYFHGSI ¨ ¨ ©  4 ºU

Q

? ? @

K(LÛ



‚

3 3

¨_9Z ¨ ©  Uf1 1 U

3JRÙÜT

.

Ø

PS‚

3SR



? ? zyÝ

vyx x

V ¨!© ¨ ©$ ¨!©$i(© e which proposes a joint optimization of the SVM ob- 

jective function over binary-valued labels on the unla- where Ý is the density of the marginal distribution

@

 

beled data and functions in the RKHS. Here, are . The authors observe that it is not straightforward

Ý Ø

parameters that control the relative hinge-lossØ over la- to find a kernel for arbitrary densities , whose associ-

? ? zyÝ

vyx x

¨!© ¨ ©$ ¨!©$i(© beled and unlabeled sets. The joint optimization is ated RKHS norm is u . Thus, in implemented in [11] by first using an inductive SVM the absence of a representer theorem, the authors pro-

to label the unlabeled data and then iteratively solv- pose to perform minimization over the linear space Þ ing SVM quadratic programs, at each step switching generated by the span of a fixed set of basis functions

labels to improve the objective function. However chosen apriori. It is also worth noting that while [9]

? x this procedure is susceptible to local minima and re- uses the gradient ¨!© in the ambient space, we use

quires an unknown, possibly large number of label the penalty functional associated with the gradient

g ? switches before converging. Note that even though x over a submanifold. In a situation where the

TSVM were inspired by transductive inference, they data truly lies on or near a submanifold k , the dif-

do provide an out-of-sample extension. ference between these two penalizers can be signifi- Ú Semi-Supervised SVMs (S Ú VM) [5] : S VM incorpo- cant since smoothness in the normal direction to the rate unlabeled data by including the minimum hinge- data manifold is irrelevant to classification or regres- loss for the two choices of labels for each unlabeled sion. The algorithm in [9] does not demonstrate per- formance improvements in real world experiments. of handwritten digits. The first àY¨(¨ images for each Graph Based Approaches See e.g., [7, 10, 15, 17, digit in the USPS training set (preprocessed using

18, 1]: A variety of graph based methods have been PCA to 9<¨Y¨ dimensions) were taken to form the train- proposed for transductive inference. However, these ing set and 2 of these were randomly labeled. The re-

methods do not provide an out-of-sample extension. maining images formed the test set. Polynomial Ker-

¨jᨠ7ã9<¨j In [18], nearest neighbor labeling for test examples nels of degree 3 were used, and V>â7w¨ª; is proposed once unlabeled examples have been la- was set for inductive methods following experimentsØ beled by transductive learning. In [10], test points reported in [13]. For manifold regularization, we

are approximately represented as a linear combina- chose to split the same weight in the ratio 9w& ä

P

’

”

„#‚PS“

b

>‘7™¨; ¨(¨jác ±å=æ 7™¨; ¨à~á tion of training and unlabeled points in the feature so that V . The observa- space induced by the kernel. We also note the very tions reported in this section hold consistently across recent work [6] on out-of-sample extensions for semi- a wide choice of parameters. In Figure 4, we com- supervised learning. For Graph Regularization and pare the error rates of Laplacian algorithms, SVM and Label Propagation see [12, 3, 18]. Manifold regular- TSVM, at the precision-recall breakeven points in the ization provides natural out-of-sample extensions to ROC curves (averaged over 10 random choices of la- several graph based approaches. These connections beled examples) for the 45 binary classification prob- are summarized in the Table on page 5. lems. The following comments can be made: (a) Man- Other methods with different paradigms for using ifold regularization results in significant improve- unlabeled data include Cotraining [8] and Bayesian ments over inductive classification, for both RLS and Techniques, e.g., [14]. SVM, and either compares well or significantly out- performs TSVM across the 45 classification problems. 5 Experiments Note that TSVM solves multiple quadratic programs in the size of the labeled and unlabeled sets whereas LapSVM solves a single QP in the size of the labeled We performed experiments on a synthetic dataset set, followed by a linear system. This resulted in sub- and two real world classification problems arising stantially faster training times for LapSVM in this ex- in visual and speech recognition. Comparisons periment. (b) Scatter plots of performance on test and are made with inductive methods (SVM, RLS). We unlabeled data sets confirm that the out-of-sample ex- also compare with Transductive SVM (e.g., [11]) tension is good for both LapRLS and LapSVM. (c) based on our survey of related algorithms in Section Finally, we found Laplacian algorithms to be signif- 4. For all experiments, we constructed adjacency icantly more stable with respect to choice of the la- graphs with ß nearest neighbors. Software and beled data than the inductive methods and TSVM, as Datasets for these experiments are available at shown in the scatter plot in Figure 4 on standard devi- http://manifold.cs.uchicago.edu/manifold regularization. ation of error rates. In Figure 5, we plot performance More detailed experimental results are presented in as a function of number of labeled examples. [4].

5.1 Synthetic Data : Two Moons Dataset 5.3 Spoken Letter Recognition

The two moons dataset is shown in Figure 2. The best This experiment was performed on the Isolet decision surfaces across a wide range of parameter database of letters of the English alphabet spoken in

settings are also shown for SVM, Transductive SVM isolation (available from the UCI machine learning

¸ ¨Y¨ and Laplacian SVM. The dataset contains exam- repository). The data set contains utterances of 9#ᨠples with only 9 labeled example for each class. Fig- subjects who spoke the name of each letter of the En-

ure 2 demonstrates how TSVM fails to find the op- glish alphabet twice. The speakers are grouped into á

timal solution. The Laplacian SVM decision bound- sets of ç(¨ speakers each, referred to as isolet1 through ary seems to be intuitively most satisfying. Figure 3 isolet5. For the purposes of this experiment, we chose

shows how increasing the intrinsic regularization al- to train on the first ç(¨ speakers (isolet1) forming a

lows effective use of unlabeled data for constructing training set of 9`á(ß(¨ examples, and test on isolet5 con-

classifiers. taining 9`á(á(ä examples (1 utterance is missing in the database due to poor recording). We considered the

5.2 Handwritten Digit Recognition task of classifying the first 9<ç letters of the English al-

phabet from the last 9<ç . The experimental set-up is

In this set of experiments we applied Laplacian SVM meant to simulate a real-world situation: we consid- ç(¨

and Laplacian RLSC algorithms to àjá binary classi- ered binary classification problems corresponding

¸ á fication problems that arise in pairwise classification to ç(¨ splits of the training data where all utterances Figure 2: Two Moons Dataset: Best decision surfaces using RBF kernels for SVM, TSVM and Laplacian SVM. Labeled points are shown in color, other points are unlabeled. SVM Transductive SVM Laplacian SVM 2.5 2.5 2.5 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0 −0.5 −0.5 −0.5 −1 −1 −1 −1.5 −1.5 −1.5 −1 0 1 2 −1 0 1 2 −1 0 1 2

Figure 3: Two Moons Dataset: Laplacian SVM with increasing intrinsic regularization. SVM Laplacian SVM Laplacian SVM

2 2 2

1 1 1

0 0 0

−1 −1 −1

−1 0 1 2 −1 0 1 2 −1 0 1 2 γ = 0.03125 γ = 0 γ = 0.03125 γ = 0.01 γ = 0.03125 γ = 1 A I A I A I

Figure 4: USPS Experiment - Error Rates at Precision-Recall Breakeven points for 45 binary classification prob- lems RLS vs LapRLS SVM vs LapSVM TSVM vs LapSVM 20 20 RLS 20 SVM TSVM LapRLS LapSVM LapSVM 15 15 15

10 10 10 Error Rates Error Rates Error Rates

5 5 5

0 0 0 10 20 30 40 10 20 30 40 10 20 30 40 45 Classification Problems 45 Classification Problems 45 Classification Problems

Out−of−Sample Extension Out−of−Sample Extension Std Deviation of Error Rates 15 15 15

10 10 10

5 5 5 LapRLS (Test) LapSVM (Test) SVM (o) , TSVM (x) Std Dev 0 0 0 0 5 10 15 0 5 10 15 0 2 4 6 LapRLS (Unlabeled) LapSVM (Unlabeled) LapSVM Std Dev Figure 5: USPS Experiment - Mean Error Rate at Precision-Recall Breakeven points as a function of number of labeled points (T: Test Set, U: Unlabeled Set) RLS vs LapRLS SVM vs LapSVM 9 10 RLS (T) SVM (T) RLS (U) SVM (U) 8 LapRLS (T) 9 LapSVM (T) LapRLS (U) LapSVM (U)

8 7

7 6

6 5 5 4 4 Average Error Rate Average Error Rate

3 3

2 2

1 1

0 0 2 4 8 16 32 64 128 2 4 8 16 32 64 128 Number of Labeled Examples Number of Labeled Examples

Figure 6: Isolet Experiment - Error Rates at precision-recall breakeven points of 30 binary classification problems RLS vs LapRLS SVM vs TSVM vs LapSVM 28 RLS 40 26 SVM LapRLS TSVM 24 35 LapSVM

22 30 20 25 18 20 16 Error Rate (unlabeled set) Error Rates (unlabeled set) 14 15 0 10 20 30 0 10 20 30 Labeled Speaker # Labeled Speaker #

RLS vs LapRLS SVM vs TSVM vs LapSVM 35 40 RLS LapRLS SVM 35 TSVM LapSVM 30 30

25 25 Error Rates (test set) Error Rates (test set) 20 20 0 10 20 30 0 10 20 30 Labeled Speaker # Labeled Speaker # of one speaker were labeled and all the rest were left speaker. (b) On Isolet5 which comprises of a separate unlabeled. The test set is composed of entirely new group of speakers, performance improvements are speakers, forming the separate group isolet5. smaller but consistent over the choice of the labeled speaker. This can be expected since there appears to

We chose to train with RBF kernels of width è¼7é9`¨ be a systematic bias that affects all algorithms, in fa- (this was the best value among several settings with vor of same-group speakers. For further details, see respect to á -fold cross-validation error rates for the [4].

fully supervised problem using standard SVM). For

¨já 7w9<¨

SVM and RLSC we set V>ê7Ψª; ( ) (this was Ø the best value among several settings with respect to 5.4 Regularized Spectral Clustering and Data mean error rates over the ç(¨ splits). For Laplacian

P Representation

’

”

„#‚Pƒ“

Vbt>87 7f¨; ¨(¨já RLS and Laplacian SVM we set å=æ . In Figure 6, we compare these algorithms. The fol- When all training examples are unlabeled, the op- lowing comments can be made: (a) LapSVM and timization problem of our framework, expressed in LapRLS make significant performance improvements Eqn 6 reduces to the following clustering objective

over inductive methods and TSVM, for predictions on function :

‡

? ˆ ?

unlabeled speakers that come from the same group ? 

FHGJI VX1 1 U

.

K(LM

N † as the labeled speaker, over all choices of the labeled † (12) Figure 7: Two Moons Dataset: Regularized Clustering

2 2 2

1 1 1

0 0 0

−1 −1 −1

−1 0 1 2 −1 0 1 2 −1 0 1 2 γ = 1e−06 γ = 1 γ = 0.0001 γ = 1 γ = 0.1 γ = 1

A I A I A I

”

„ V7 where å"ë is a regularization parameter that References controls the complexityå=æ of the clustering function. To [1] M. Belkin, P. Niyogi, Using Manifold Structure for Par- avoid degenerate solutions we need to impose some tially Labeled Classification, NIPS 2002. additional conditions (cf. [2]). It can be easily seen [2] M. Belkin, P. Niyogi, Laplacian Eigenmaps for Di-

that a version of Representer theorem„ holds so that mensionality Reduction and Data Representation, Neural

?

@

 3 3

7 %h¨ © "ìE 3SR the minimizer has the form  By Computation, June 2003 substituting back in Eqn. 12, we come up] with the fol- [3] M. Belkin, I. Matveeva, P. Niyogi, Regression and Regu- lowing optimization problem: larization on Large Graphs, COLT 2004.

[4] M. Belkin, P. Niyogi, V. Sindhwani, Manifold Regu-

‡

? ? ˆ ?

 larization : A Geometric Framework for Learning From

7¯A(C4DYF GJI VX1 1 U

.

† † Nï<ðYñ

í‹î Examples, Technical Report, Univ. of Chicago, Depart-

]

”

î

ò ò

R  . ment of Computer Science, TR-2004-06. Available at : http://www.cs.uchicago.edu/research/publications/

„ techreports/TR-2004-06

ó <;";<;=  7ͨ

where is the vector of all ones and 

] ] ] [5] K. Bennett and A. Demirez, Semi-Supervised Support and % is the corresponding Gram matrix.

Vector Machines, NIPS 1998

„ ¦ Letting ¢ be the projection onto the subspace of [6] Y. Bengio, O. Delalleau and N.Le Roux, Efficient Non-

orthogonal to %©ó , one obtains the solution for the Parametric Function Induction in Semi-Supervised Learn- constrained quadratic problem, which is given by the ing, Technical Report 1247, DIRO, University of Mon- treal, 2004. generalized eigenvalue problem

[7] A. Blum, S. Chawla, Learning from Labeled and Unlabeled ˆ

 Data using Graph Mincuts, ICML 2001.

%¾¢–ô§7Bõª¢°% ¢–ô ¢Ž¨!V%ŸU+% (13) [8] A. Blum, T. Mitchell, Combining Labeled and Unlabeled

Data with Co-Training, COLT 1998 ô The final solution is given by 7¿¢–ô , where is the [9] O. Bousquet, O. Chapelle, M. Hein, Measure Based Reg- eigenvector corresponding to the] smallest eigenvalue. ularization, NIPS 2003 [10] Chapelle, O., J. Weston and B. Schoelkopf, Cluster Ker- The framework for clustering sketched above pro- nels for Semi-Supervised Learning, NIPS 2002. vides a regularized form spectral clustering, where V [11] T. Joachims, Transductive Inference for Text Classification controls the smoothness of the resulting function in using Support Vector Machines, ICML 1999. the ambient space. We also obtain a natural out-of- [12] A. Smola and R. Kondor, Kernels and Regularization on sample extension for clustering points not in the origi- Graphs, COLT/KW 2003. nal data set. Figure 7 shows the results of this method [13] B. Schoelkopf, C.J.C. Burges, V. Vapnik, Extracting Sup- port Data for a Given Task, KDD95. on a two-dimensional clustering problem. [14] M. Seeger Learning with Labeled and Unlabeled Data, By taking multiple eigenvectors of the system in Tech Report. Edinburgh University (2000) Eqn. 13 we obtain a natural regularized out-of-sample [15] Martin Szummer, Tommi Jaakkola, Partially labeled classification with Markov random walks, NIPS 2001,. extension of Laplacian eigenmaps [1]. This leads to [16] V. Vapnik, Statistical Learning Theory, Wiley- new method for dimensionality reduction and data Interscience, 1998. representation. Further study of this approach is a di- [17] D. Zhou, O. Bousquet, T.N. Lal, J. Weston and B. rection of future research. Schoelkopf, Learning with Local and Global Consistency, NIPS 2003. Acknowledgments. We are grateful to Marc Coram, [18] X. Zhu, J. Lafferty and Z. Ghahramani, Semi-supervised Steve Smale and Peter Bickel for intellectual support learning using Gaussian fields and harmonic functions, and to NSF funding for financial support. We would ICML 2003. like to acknowledge the Toyota Technological Insti- tute for its support for this work.