Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008)

Semi- supervised Classification Using Local and Global Regularization

Fei Wang1, Tao Li2, Gang Wang3, Changshui Zhang1 1Department of Automation, Tsinghua University, Beijing, China 2School of Computing and Information Sciences, Florida International University, Miami, FL, USA 3Microsoft China Research, Beijing, China

Abstract works concentrate on the derivation of different forms of In this paper, we propose a semi- (SSL) smoothness regularizers, such as the ones using combinato- algorithm based on local and global regularization. In the lo- rial graph Laplacian (Zhu et al., 2003)(Belkin et al., 2006), cal regularization part, our algorithm constructs a regularized normalized graph Laplacian (Zhou et al., 2004), exponen- classifier for each data point using its neighborhood, while tial/iterative graph Laplacian (Belkin et al., 2004), local lin- the global regularization part adopts a Laplacian regularizer ear regularization (Wang & Zhang, 2006)and local learning to smooth the data labels predicted by those local classifiers. regularization (Wu & Sch¨olkopf, 2007), but rarely touch the We show that some existing SSL algorithms can be derived problem of how to derive a more efficient . from our framework. Finally we present some experimental In this paper, we argue that rather than applying a global results to show the effectiveness of our method. loss function which is based on the construction of a global predictor using the whole data set, it would be more desir- Introduction able to measure such loss locally by building some local pre- Semi-supervised learning (SSL), which aims at learning dictors for different regions of the input data space. Since from partially labeled data sets, has received considerable according to (Vapnik, 1995), usually it might be difficult to interests from the and data mining com- find a predictor which holds a good predictability in the en- munities in recent years (Chapelle et al., 2006b). One reason tire input data space, but it is much easier to find a good for the popularity of SSL is because in many real world ap- predictor which is restricted to a local region of the input plications, the acquisition of sufficient labeled data is quite space. Such divide and conquer scheme has been shown expensive and time consuming, but the large amount of un- to be much more effective in some real world applications labeled data are far easier to obtain. (Bottou & Vapnik, 1992). One problem of this local strategy Many SSL methods have been proposed in the recent is that the number of data points in each regionis usually too decades (Chapelle et al., 2006b), among which the graph small to train a good predictor, therefore we propose to also based approaches, such as Gaussian Random Fields (Zhu et apply a global smoother to make the predicted data labels al., 2003), Learning with Local and Global Regularization more comply with the intrinsic data distributions. (Zhou et al., 2004) and (Belkin et al., 2004), have been becoming one of the hottest research A Brief Review of Manifold Regularization area in SSL field. The common denominator of those al- Before we go into the details of our algorithm, let’s first re- gorithms is to model the whole data set as an undirected view the basic idea of manifold regularization (Belkin et al., weighted graph, whose vertices correspond to the data set, 2006) in this section, since it is closely related to this paper. and edges reflect the relationships between pairwise data As we know, in semi-supervised learning, we are given a points. In SSL setting, some of the vertices on the graph set of data points = x1, , xl, xl+1, , xn , where l X { ··· ···n } are labeled, while the remained are unlabeled, and the goal l = xi are labeled, and u = xj are un- of graph based SSL is to predictthe labels of those unlabeled X { }i=1 X { }j=l+1 labeled. Each xi is drawn from a fixed but usually data points (and even the new testing data which are not unknown distribution∈ Xp(x). Belkin et al. (Belkin et al., in the graph) such that the predicted labels are sufficiently 2006) proposed a general geometric framework for semi- smooth with respect to the data graph. supervised learning called manifold regularization, which One common strategy for realizing graph based SSL is seeks an optimal classification function f by minimizing the to minimize a criterion which is composed of two parts, the following objective first part is a loss measures the difference between the pre- l dictions and the initial data labels, and the second part is a 2 2 g = (yi, f(xi, w)) + γA f F + γI f I, (1) smoothness penalty measuring the smoothness of the pre- i=1 J X L k k k k dicted labels over the whole data graph. Most of the past where yi represents the label of xi, f(x, w) denotes the clas- Copyright c 2008, Association for the Advancement of Artificial sification function f with its parameter w, f F penalizes k k Intelligence (www.aaai.org). All rights reserved. the complexity of f in the functional space , f I reflects F k k

726 the intrinsic geometric information of the marginal distribu- The Construction of Local Classifiers x tion p( ), γA, γI are the regularization parameters. In this subsection, we will introduce how to construct the The reason why we should punish the geometrical infor- local classifiers. Specifically, in our method, we split the n mation of f is that in semi-supervised learning, we only have whole input data space into n overlapping regions i , {R }i=1 a small portion of labeled data (i.e. l is small), which are such that i is just the k-nearest neighborhood of xi. We R not enough to train a good learner by purely minimizing the further construct a classification function fi for region i, R structural loss of f. Therefore, we need some prior knowl- which, for simplicity, is assumed to be linear. Then gi pre- edge to guideus to learn a good f. What p(x) reflects is just dicts the label of x by such type of prior information. Moreover, it is usually as- T sumed (Belkin et al., 2006) that there is a direct relationship gi(x) = wi (x xi) + bi, (3) x x − between p( ) and p(y ), i.e. if two points x1 and x2 are w 1 | where i and bi are the weight vector and bias term of gi . close in the intrinsic geometry of p(x), then the conditional A general approach for getting the optimal parameter set distributions p(y x ) and p(y x ) should be similar. In other n 1 2 (wi, bi) i is to minimize the following structural loss words, x should| vary smoothly| along the geodesics in { } =1 p(y ) n the intrinsic| geometry of p(x). ˆ = (wT (x x ) + b y )2 + γ w 2. Specifically, (Belkin et al., 2006) also showed that f 2 l i j i i j A i I J x − − k k can be approximated by k k Xi=1 Xj ∈Ri However, in semi-supervised learning scenario, we only 2 T ˆ = (f(xi) f(xj)) Wij = f Lf (2) have a few labeled points, i.e., we do not know the corre- S i,j − X sponding yi for most of the points. To alleviate this problem, where n is the total number of data points, and Wij we associate each yi with a “hidden label” fi, such that yi are the edge weights in the data adjacency graph, f = is directly determined by fi. Then we can minimize the fol- T n×n (f(x ), , f(xn)) . L = D W R is the graph lowing loss function instead to get the optimal parameters. 1 ··· − ∈ Laplacian where W is the graph weight matrix with its l (i, j)-th entry W(i, j) = Wij , and D is a diagonal degree 2 l = (yi fi) + λ ˆl (4) matrix with D(i, i) = Wij . There has been extensive J − J j Xi=1 discussion on that underP certain conditions choosing Gaus- n sian weights for the adjacency graph leads to convergence T 2 2 = (w (xj xi) + bi fj) + γA wi of the graph Laplacian to the Laplace-Beltrami operator i − − k k Xi=1 xXj ∈Ri M (or its weighted version) on the manifold (Belkin △ M i T 2 2 & Niyogi, 2005)(Hein et al., 2005). Let = x (w (xj xi) + bi fi) + γA wi , Jl j ∈Ri i − − k k which canP be rewritten in its matrix form as The Algorithm 2 i wi ˜ l = Gi fi In this section we will introduce our learning with local and J  bi  − global regularization approach in detail. First let’s see the motivation of this work. where T T xi1 xi 1 fi1 T − T Why Local Learning x x 1 f 2  i2 1   i  −. . . Although (Belkin et al., 2006) provides us an excellent G = . . , f˜ = . . i  . .  i  .  framework for learning from labeled and unlabeled data, the  T T     x x 1  fin ini i  i  loss g is defined in a global way, i.e. for the whole data set,  −    J  √γAId 0  0 we only need to pursue one classification function f that can     minimize g. According to (Vapnik, 1995), selecting a good where x represents the j-th neighbor of x , n is the car- J ij i i f in such a global way might not be a good strategy because dinality of i, and 0 is a d 1 zero vector, d is the dimen- x w w R × i the function set f( , ), may not contain a good ∂Jl sionality of the data vectors. By taking w = 0, we can predictor for the entire input space.∈ W However, it is much eas- ∂b ( i, i) get that ier for the set to contain some functions that are capable of ∗ producing good predictions on some specified regions of the wi T −1 T ˜ = (Gi Gi) Gi fi (5) input space. Therefore, if we split the whole input space into  bi  C local regions, then it is usually more effective to minimize Then the total loss we want to minimize becomes the following local cost function for each region. i T ˆl = = f˜iG˜i G˜if˜i, (6) Nevertheless, there are still some problems with pure lo- J Jl cal learning algorithms since that there might not be enough Xi Xi data points in each local region for training the local classi- 1Since there is only a few data points in each neighborhood, fiers. Therefore, we propose to apply a global smoother to then the structural penalty term kwik will pull the weight vector smooth the predicted data labels with respect to the intrin- wi toward some arbitrary origin. For isotropy reasons, we translate sic data manifold, such that the predicted data labels can be the origin of the input space to the neighborhood medoid xi, by more reasonable and accurate. subtracting xi from the training points xj ∈ Ri

727 ˜ T −1 T ˜ Rn×n where Gi = I Gi(Gi Gi) Gi . If we partition Gi into where J is a diagonal matrix with its (i, i)-th entry four block as − ∈ 1, if xi is labeled; × × J(i, i) = (12) Ani ni Bni d 0, otherwise, G˜ = i i  i Cd×ni Dd×d  i i  y is an n 1 column vector with its i-th entry × T Let fi = [fi1 , fi2 , , fin ] , then y , if x is labeled; ··· i y(i) = i i  0, otherwise ˜T ˜ ˜ T Ai Bi fi T fi Gifi = [fi 0] = fi Aifi  Ci Di   0  Induction Thus To predict the label of an unseen testing data point, which T has not appeared in , we propose a three-step approach: ˆl = f Aifi. (7) X J i Step 1. Solve the optimal label vector f ∗ using LGReg. Xi ∗ ∗ Furthermore, we have the following theorem. Step 2. Solve the parameters wi , bi of the optimal local classification functions using Eq.(5).{ } Theorem 1. Step 3. For a new testing point x, first identify the local T −1 T T −1 T −1 Xi Hi Xi11 Xi Hi Xi regions that x falls in (e.g. by computing the Euclidean dis- Ai = Ini Xi Hi Xi + −  ni c tance between x to the region medoids and select the nearest − one), then apply the local prediction functions of the corre- XT H−1X 11T 11T XT H−1X 11T i i i i i i + , sponding regions to predict its label. − ni c − ni c ni c − − − T T T −1 Discussions where Hi = XiXi + γAId, c = 1 Xi Hi Xi1, 1 Rni×1 is an all-one vector, and A 1 = 0. ∈ In this section, we discuss the relationships between the i proposed framework with some existing related approaches, Proof. See the supplemental material. and present another mixed regularization framework for the Then we can define the label vector f = T n×1 algorithm presented in section . [f , f , , fn] R , the concatenated label vector 1 2 ··· ∈ ˆf = [f T , f T , , f T ]T and the concatenated block-diagonal Relationship with Related Approaches 1 2 ··· n matrix There has already been some semi-supervised learning algo- A 0 0 1 ··· rithms based on different regularizations. In this subsection,  0 A2 0  we will discuss the relationships between our algorithm with Gˆ = . . ···. . , . . .. . those existing approaches.  . . .   0 0 An   ···  Relationship with Gaussian-Laplacian Regularized Ap- proaches Most of traditional graph based SSL algorithms which is of size i ni i ni. Then from Eq.(7) we can T × (e.g. (Belkin et al., 2004; Zhou et al., 2004; Zhu et al., derive that l =Pˆf Gˆ ˆf. PDefine the selection matrix S n× n 2003)) are based on the following framework 0, 1 PiJi , which is a 0-1 matrix and there is only one∈ 1 { } in each row of S, such that ˆf = Sf. Then ˆ = f T ST GSfˆ . l l 2 T J f = arg min (fi yi) + ζf Lf, (13) Let f − M = ST GSˆ Rn×n, (8) Xi=1 ∈ T which is a square matrix, then we can rewrite ˆl as where f = [f1, f2, , fl, , fn] , L is the graph Lapla- J cian constructed by··· Gaussian··· functions. Clearly, the above ˆ = f T Mf. (9) framework is just a special case of our algorithm if we set l 2 J λ = 0, γI = n ζ in Eq.(10). SSL with Local & Global Regularizations Relationship with Local Learning Regularized Ap- As stated in section 3.1, we also need to apply a global proaches Recently, Wu & Sch¨olkopf (Wu & Sch¨olkopf, smoother to smooth the predicted hidden labels fi . Here 2007) proposed a novel transduction method based on lo- we apply the same smoothness regularizer as in Eq.(2),{ } then cal learning, which aims to solve the following optimization the predicted labels can be achieved by minimizing problem

l l n γ 2 2 2 f T Mf I f T Lf (10) f = arg min (fi yi) + ζ fi oi , (14) = (yi fi) + λ + 2 . f − k − k J − n i i Xi=1 X=1 X=1 where o is the label of x predicted by the local classifier By setting ∂ /∂f = 0 we can get that i i J constructed on the neighborhood of xi, and the parameters γ −1 of the local classifier can be represented by f via minimizing f J M i L Jy (11) = + λ + 2 , some local structural loss functions as in Eq.(5).  n 

728 This approach can be understood as a two-step approach Table 1: Descriptions of the datasets for optimizing Eq.(10) with γI = 0: in the fist step, it op- timizes the classifier parameters by minimizing local struc- Datasets Sizes Classes Dimensions tural loss (Eq.(4)); in the second step, it minimizes the pre- g241c 1500 2 241 diction loss of each data points by the local classifier con- g241n 1500 2 241 structed just on its neighborhood. USPS 1500 2 241 COIL 1500 6 241 digit1 1500 2 241 A Mixed-Regularization Viewpoint cornell 827 7 4134 In section 3.3 we have stated that our algorithm aims to min- texas 814 7 4029 imize wisconsin 1166 7 4189 washington 1210 7 4165 l 2 T γI T BCI 400 2 117 = (yi fi) + λf Mf + f Lf (15) J − n2 diabetes 768 2 8 Xi=1 ionosphere 351 2 34 where M is defined in Eq.(8) and L is the conventional graph Laplacian constructed by Gaussian functions. It is Methods & Parameter Settings easy to prove that M has the following property. Besides our method, we also implement some other compet- Theorem 2. M1 = 0, where 1 Rn×1 is a column vector ∈ ing methods for experimental comparison. For all the meth- with all its elements equaling to 1. ods, their hyperparameters were set by 5-fold cross valida- Proof. From the definition of M (Eq.(8)), we have tion from some grids introduced in the following. Local and Global Regularization (LGReg). In the T ˆ T ˆ  M1 = S GS1 = S G1 = 0, • implementation the neighborhood size is searched from 5, 10, 50 , γA and λ are all searched from Therefore, M can also be viewed as a Laplacian matrix. −3 {−2 −1 } 1 2 3 γI 4 , 4 , 4 , 1, 4 , 4 , 4 and we set λ + n2 = 1, the That is, the last two terms of l can all be viewed as reg- { } R width of the Gaussian similarity when constructing the ularization terms with different Laplacians, one is derived graph is set by the method in (Zhu et al., 2003). from local learning, the other is derived from the heat ker- Local Learning Regularization (LLReg). The im- nel. Hence our algorithm can also be understood from a • mixed regularization viewpoint (Chapelle et al., 2006a)(Zhu plementation of this algorithm is the same as in & Goldberg, 2007). Just like the multiview learning algo- (Wu & Sch¨olkopf, 2007), in which we also adopt rithm, which trains the same type of classifier using different the mutual neighborhood with its size search from 5, 10, 50 . The regularization parameter of the lo- data features, our method trains different classifiers using the { } same data features. Different types of Laplacians may bet- cal classifier and the tradeoff parameter between the ter reveal different (maybe complementary) information and loss and local regularization term are searched from 4−3, 4−2, 4−1, 1, 41, 42, 43 . thus provide a more powerful classifier. { } Laplacian Regularized (LapRLS). • The implementation code is downloaded from Experiments http://manifold.cs.uchicago.edu/ In this section, we present a set of experiments to show the manifold_regularization/software.html,, effectiveness of our method. First let’s describe the basic in which the width of the Gaussian similarity is also set information of the data sets. by the method in (Zhu et al., 2003), and the extrinsic and intrinsic regularization parameters are searched from 4−3, 4−2, 4−1, 1, 41, 42, 43 . We adopt the linear kernel The Data Sets {since our algorithm is locally} linear. We adopt 12 data sets in our experiments, including 2 artifi- Learning with Local and Global Consistency (LLGC). cial data sets g241c and g241n, three image data sets USPS, • The implementation of the algorithm is the same as in COIL, digit1, one BCI data set2, four text data sets cor- (Zhou et al., 2004), in which the width of the Gaus- nell, texas, wisconsin and washington from the WebKB sian similarity is also by the method in (Zhu et al., data set3, and two UCI data sets diabetes and ionosphere4. 2003), and the regularization parameter is searched from Table 1 summarizes the characteristics of the datasets. 4−3, 4−2, 4−1, 1, 41, 42, 43 . { } Gaussian Random Fields (GRF). The implementation of 2All these former 6 data sets can be downloaded from • the algorithm is the same as in (Zhu et al., 2003). http://www.kyb.tuebingen.mpg.de/ssl-book/ Support Vector Machine (SVM). We use libSVM (Fan benchmarks.html. • 3http://www.cs.cmu.edu/˜WebKB/. et al., 2005) to implement the SVM algorithm with a 4http://www.ics.uci.edu/mlearn/ linear kernel, and the cost parameter is searched from −4 −3 −2 −1 1 2 3 4 MLRepository.html 10 , 10 , 10 , 10 , 1, 10 , 10 , 10 , 10 . . { }

729 0.9 0.9

0.85 0.98 0.8 0.8 0.96 0.94 0.75 0.7 0.92 0.7 0.9 0.6 0.65 LGReg LGReg LGReg LLReg LLReg 0.88 LLReg LapRLS 0.6 LapRLS 0.86 LapRLS LLGC 0.5 LLGC LLGC average classification accuracy average classification accuracy 0.55 GRF GRF average classification accuracy 0.84 GRF SVM SVM SVM 0.5 0.4 0.82 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 percentage of randomly labeled points percentage of randomly labeled points percentage of randomly labeled points (a) g241c (b) g241n (c) USPS

1 1 0.9 LGReg 0.95 LLReg 0.98 LapRLS 0.85 LLGC 0.9 0.96 GRF SVM 0.85 0.94 0.8 0.8 LGReg LGReg LLReg 0.92 LLReg 0.75 LapRLS LapRLS 0.75 LLGC LLGC 0.7 0.9 average classification accuracy GRF average classification accuracy GRF average classification accuracy SVM SVM 0.65 0.88 0.7 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 percentage of randomly labeled points percentage of randomly labeled points percentage of randomly labeled points (d) COIL (e) digit1 (f) cornell

1 LGReg LGReg LGReg LLReg 0.9 LLReg LLReg 0.85 0.95 LapRLS LapRLS LapRLS LLGC 0.88 LLGC LLGC 0.9 GRF 0.86 GRF GRF 0.8 SVM SVM SVM 0.84 0.85

0.82 0.8 0.75 0.8 0.75 0.78 0.7 average classification accuracy 0.7 average classification accuracy 0.76 average classification accuracy 0.74 0.65 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 percentage of randomly labeled points percentage of randomly labeled points percentage of randomly labeled points (g) texas (h) wisconsin (i) washington

0.8 0.76 0.76 LGReg 0.75 LLReg 0.74 LapRLS 0.74 LLGC 0.7 0.72 GRF 0.72 SVM 0.65 0.7 0.7 0.6 0.68 LGReg LGReg 0.68 LLReg LLReg 0.55 0.66 LapRLS LapRLS LLGC LLGC 0.5 0.66 0.64 average classification accuracy average classification accuracy GRF average classification accuracy GRF SVM SVM 0.45 0.64 0.62 10 20 30 40 50 10 20 30 40 50 10 20 30 40 50 percentage of randomly labeled points number of randomly labeled points percentage of randomly labeled points (j) BCI (k) diabetes (l) ionosphere

Figure 1: Experimental results of different algorithms.

730 Table 2: Experimental results with 10% of the data points randomly labeled SVM GRF LLGC LLReg LapRLS LGReg g241c 756 .4 ± 1.1383 56.34 ± 2.1665 77.13 ± 2.5871 65.31 ± 2.1220 80.44 ± 1.0746 72.29 ± 0.1347 g241n 750 .1 ± 1.7155 55.06 ± 1.9519 49.75 ± 0.2570 73.25 ± 0.2466 76.89 ± 1.1350 73.20 ± 0.5983 USPS 883 .2 ± 1.1087 94.87 ± 1.7490 96.19 ± 0.7588 95.79 ± 0.6804 88.80 ± 1.0087 99.21 ± 1.1290 COIL 785 .9 ± 1.9936 91.23 ± 1.8321 92.04 ± 1.9170 86.86 ± 2.2190 73.35 ± 1.8921 89.61 ± 1.2197 digit1 928 .0 ± 1.4818 96.95 ± 0.9601 95.49 ± 0.5638 97.64 ± 0.6636 92.79 ± 1.0960 97.10 ± 1.0982 cornell 702 .6 ± 0.4807 71.43 ± 0.8564 76.30 ± 2.5865 79.46 ± 1.6336 80.59 ± 1.6665 81.39 ± 0.8968 texas 690 .6 ± 0.5612 70.03 ± 0.8371 75.93 ± 3.6708 79.44 ± 1.7638 78.15 ± 1.5667 80.75 ± 1.2513 wisconsin 740 .1 ± 0.3988 74.65 ± 0.4979 80.57 ± 1.9062 83.62 ± 1.5191 84.21 ± 0.9656 84.05 ± 0.5421 washington 695 .4 ± 0.4603 78.26 ± 0.4053 80.23 ± 1.3997 86.37 ± 1.5516 86.58 ± 1.4985 88.01 ± 1.1369 BCI 597 .7 ± 4.1279 50.49 ± 1.9392 53.07 ± 2.9037 51.56 ± 2.8277 61.84 ± 2.8177 65.31 ± 2.5354 diabetes 726 .3 ± 1.5924 70.69 ± 2.6321 67.15 ± 1.9766 68.38 ± 2.1772 64.95 ± 1.1024 72.36 ± 1.3223 ionosphere 755 .2 ± 1.2622 70.21 ± 2.2778 67.31 ± 2.6155 68.15 ± 2.3018 65.17 ± 0.6628 84.05 ± 0.5421

Experimental Results from Labeled and Unlabeled Examples. Journal of Ma- The experimental results are shown in figure 1. In all the chine Learning Research 7(Nov): 2399-2434. figures, the x-axis represents the percentage of randomly la- Bottou, L. and Vapnik, V. (1992). Local learning algo- beled points, the y-axis is the average classification accuracy rithms. Neural Computation, 4:888-900. over 50 independent runs. From the figures we can observe Chapelle, O., Chi, M. and Zien, A. (2006). A Continuation The LapRLS algorithm works very well on the toy and Method for Semi-Supervised SVMs. ICML 23, 185-192. • text data sets, but not very well on the image and UCI Chapelle, O., B. Sch¨olkopf and A. Zien. (2006). Semi- data sets. Supervised Learning. 508, MIT Press, Cambridge, USA. The LLGC and GRF algorithm work well on the image Fan, R. -E., Chen, P. -H., and Lin, C.-J. (2005). Working • data sets, but not very well on other data sets. Set Selection Using Second Order Informationfor Training SVM. Journal of Machine Learning Research 6. The LLReg algorithm works well on the image and text • Lal, T. N., Schr¨oder, M., Hinterberger, T., Weston, J., Bog- data sets, but not very well on the BCI and toy data sets. dan, M., Birbaumer, N., and Sch¨olkopf, B. (2004). Support SVM works well when the data sets are not well struc- Vector Channel Selection in BCI. IEEE TBE, 51(6). • tured, e.g. the toy, UCI and BCI data sets. Gloub, G. H., Vanloan, C. F. (1983). Matrix Computations. LGReg works very well on almost all the data sets, except Johns Hopking UP, Baltimore. • for the toy data sets. Hein, M., Audibert, J. Y., and von Luxburg, U. (2005). To better illustrate the experimental results, we also pro- From Graphs to Manifolds-Weak and Strong Pointwise vide the numerical results of those algorithms on all the data Consistency of Graph Laplacians. In COLT 18, 470-485. sets with 10% of the points randomly labeled, and the values Sch¨olkopf, B. and Smola, A. (2002). Learning with Ker- in table 2 are the mean classification accuracies and standard nels: Support Vector Machines, Regularization, Optimiza- deviations of 50 independent runs, from which we can also tion, and Beyond. The MIT Press, Cambridge, MA. see the superiority of the LGReg algorithm. Smola, A. J., Bartlett, P. L., Scholkopf, B., and Schuur- mans, D. (2000). Advances in Large Margin Classifiers, Conclusions The MIT Press. In this paper we proposed a general learning framework Vapnik, V. N. (1995). The Nature of Statistical Learning based on local and global regularization. We showed that Theory. Berlin: Springer-Verlag, 1995. many existing learning algorithms can be derived from our Wang, F. and Zhang, C. (2006). Label Propagation framework. Finally experiments are conducted to demon- Through Linear Neighborhoods. ICML 23. strate the effectiveness of our method. Wu, M. and Sch¨olkopf, B. (2007). Transductive Classifica- tion via Local Learning Regularization. AISTATS 11. References Zhou, D., Bousquet, O., Lal, T. N. Weston, J., & Sch¨olkopf, Belkin, M., Matveeva, I., and Niyogi, P. (2004). Regular- B. (2004). Learning with Local and Global Consistency. In ization and Semi-supervised Learning on Large Graphs. In NIPS 16. COLT 17. Zhu, X., Ghahramani, Z., and Lafferty, Z. (2003). Semi- Belkin, M., and Niyogi, P. Towards a Theoretical Founda- Supervised Learning Using Gaussian Fields and Harmonic tion for Laplacian-Based Manifold Methods. In COLT 18. Functions. In ICML 20. Belkin, M., Niyogi, P., and Sindhwani, V. (2006). Mani- Zhu, X. and Goldberg, A. (2007). Kernel Regression with fold Regularization: A Geometric Framework for Learning Order Preferences. In AAAI.

731