Semi-Supervised Classification Using Local and Global Regularization

Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Semi- supervised Classification Using Local and Global Regularization Fei Wang1, Tao Li2, Gang Wang3, Changshui Zhang1 1Department of Automation, Tsinghua University, Beijing, China 2School of Computing and Information Sciences, Florida International University, Miami, FL, USA 3Microsoft China Research, Beijing, China Abstract works concentrate on the derivation of different forms of In this paper, we propose a semi-supervised learning (SSL) smoothness regularizers, such as the ones using combinato- algorithm based on local and global regularization. In the lo- rial graph Laplacian (Zhu et al., 2003)(Belkin et al., 2006), cal regularization part, our algorithm constructs a regularized normalized graph Laplacian (Zhou et al., 2004), exponen- classifier for each data point using its neighborhood, while tial/iterative graph Laplacian (Belkin et al., 2004), local lin- the global regularization part adopts a Laplacian regularizer ear regularization (Wang & Zhang, 2006)and local learning to smooth the data labels predicted by those local classifiers. regularization (Wu & Schölkopf, 2007), but rarely touch the We show that some existing SSL algorithms can be derived problem of how to derive a more efficient loss function. from our framework. Finally we present some experimental In this paper, we argue that rather than applying a global results to show the effectiveness of our method. loss function which is based on the construction of a global predictor using the whole data set, it would be more desir- Introduction able to measure such loss locally by building some local pre- Semi-supervised learning (SSL), which aims at learning dictors for different regions of the input data space. Since from partially labeled data sets, has received considerable according to (Vapnik, 1995), usually it might be difficult to interests from the machine learning and data mining com- find a predictor which holds a good predictability in the en- munities in recent years (Chapelle et al., 2006b). One reason tire input data space, but it is much easier to find a good for the popularity of SSL is because in many real world ap- predictor which is restricted to a local region of the input plications, the acquisition of sufficient labeled data is quite space. Such divide and conquer scheme has been shown expensive and time consuming, but the large amount of un- to be much more effective in some real world applications labeled data are far easier to obtain. (Bottou & Vapnik, 1992). One problem of this local strategy Many SSL methods have been proposed in the recent is that the number of data points in each regionis usually too decades (Chapelle et al., 2006b), among which the graph small to train a good predictor, therefore we propose to also based approaches, such as Gaussian Random Fields (Zhu et apply a global smoother to make the predicted data labels al., 2003), Learning with Local and Global Regularization more comply with the intrinsic data distributions. (Zhou et al., 2004) and Tikhonov Regularization (Belkin et al., 2004), have been becoming one of the hottest research A Brief Review of Manifold Regularization area in SSL field. The common denominator of those al- Before we go into the details of our algorithm, let’s first re- gorithms is to model the whole data set as an undirected view the basic idea of manifold regularization (Belkin et al., weighted graph, whose vertices correspond to the data set, 2006) in this section, since it is closely related to this paper. and edges reflect the relationships between pairwise data As we know, in semi-supervised learning, we are given a points. In SSL setting, some of the vertices on the graph set of data points = x1, , xl, xl+1, , xn , where l X { ··· ···n } are labeled, while the remained are unlabeled, and the goal l = xi are labeled, and u = xj are un- of graph based SSL is to predictthe labels of those unlabeled X { }i=1 X { }j=l+1 labeled. Each xi is drawn from a fixed but usually data points (and even the new testing data which are not unknown distribution∈ Xp(x). Belkin et al. (Belkin et al., in the graph) such that the predicted labels are sufficiently 2006) proposed a general geometric framework for semi- smooth with respect to the data graph. supervised learning called manifold regularization, which One common strategy for realizing graph based SSL is seeks an optimal classification function f by minimizing the to minimize a criterion which is composed of two parts, the following objective first part is a loss measures the difference between the pre- l dictions and the initial data labels, and the second part is a 2 2 g = (yi, f(xi, w)) + γA f F + γI f I, (1) smoothness penalty measuring the smoothness of the pre- i=1 J X L k k k k dicted labels over the whole data graph. Most of the past where yi represents the label of xi, f(x, w) denotes the clas- Copyright c 2008, Association for the Advancement of Artificial sification function f with its parameter w, f F penalizes k k Intelligence (www.aaai.org). All rights reserved. the complexity of f in the functional space , f I reflects F k k 726 the intrinsic geometric information of the marginal distribu- The Construction of Local Classifiers x tion p( ), γA, γI are the regularization parameters. In this subsection, we will introduce how to construct the The reason why we should punish the geometrical infor- local classifiers. Specifically, in our method, we split the n mation of f is that in semi-supervised learning, we only have whole input data space into n overlapping regions i , {R }i=1 a small portion of labeled data (i.e. l is small), which are such that i is just the k-nearest neighborhood of xi. We R not enough to train a good learner by purely minimizing the further construct a classification function fi for region i, R structural loss of f. Therefore, we need some prior knowl- which, for simplicity, is assumed to be linear. Then gi pre- edge to guideus to learn a good f. What p(x) reflects is just dicts the label of x by such type of prior information. Moreover, it is usually as- T sumed (Belkin et al., 2006) that there is a direct relationship gi(x) = wi (x xi) + bi, (3) x x − between p( ) and p(y ), i.e. if two points x1 and x2 are w 1 | where i and bi are the weight vector and bias term of gi . close in the intrinsic geometry of p(x), then the conditional A general approach for getting the optimal parameter set distributions p(y x ) and p(y x ) should be similar. In other n 1 2 (wi, bi) i is to minimize the following structural loss words, x should| vary smoothly| along the geodesics in { } =1 p(y ) n the intrinsic| geometry of p(x). ˆ = (wT (x x ) + b y )2 + γ w 2. Specifically, (Belkin et al., 2006) also showed that f 2 l i j i i j A i I J x − − k k can be approximated by k k Xi=1 Xj ∈Ri However, in semi-supervised learning scenario, we only 2 T ˆ = (f(xi) f(xj)) Wij = f Lf (2) have a few labeled points, i.e., we do not know the corre- S i,j − X sponding yi for most of the points. To alleviate this problem, where n is the total number of data points, and Wij we associate each yi with a “hidden label” fi, such that yi are the edge weights in the data adjacency graph, f = is directly determined by fi. Then we can minimize the fol- T n×n (f(x ), , f(xn)) . L = D W R is the graph lowing loss function instead to get the optimal parameters. 1 ··· − ∈ Laplacian where W is the graph weight matrix with its l (i, j)-th entry W(i, j) = Wij , and D is a diagonal degree 2 l = (yi fi) + λ ˆl (4) matrix with D(i, i) = Wij . There has been extensive J − J j Xi=1 discussion on that underP certain conditions choosing Gaus- n sian weights for the adjacency graph leads to convergence T 2 2 = (w (xj xi) + bi fj) + γA wi of the graph Laplacian to the Laplace-Beltrami operator i − − k k Xi=1 xXj ∈Ri M (or its weighted version) on the manifold (Belkin △ M i T 2 2 & Niyogi, 2005)(Hein et al., 2005). Let = x (w (xj xi) + bi fi) + γA wi , Jl j ∈Ri i − − k k which canP be rewritten in its matrix form as The Algorithm 2 i wi ˜ l = Gi fi In this section we will introduce our learning with local and J bi − global regularization approach in detail. First let’s see the motivation of this work. where T T xi1 xi 1 fi1 T − T Why Local Learning x x 1 f 2 i2 1 i −. Although (Belkin et al., 2006) provides us an excellent G = . , f˜ = . i . i . framework for learning from labeled and unlabeled data, the T T x x 1 fin ini i i loss g is defined in a global way, i.e. for the whole data set, − J √γAId 0 0 we only need to pursue one classification function f that can minimize g. According to (Vapnik, 1995), selecting a good where x represents the j-th neighbor of x , n is the car- J ij i i f in such a global way might not be a good strategy because dinality of i, and 0 is a d 1 zero vector, d is the dimen- x w w R × i the function set f( , ), may not contain a good ∂Jl sionality of the data vectors.

Semi-Supervised Classification Using Local and Global Regularization

ECS289: Scalable Machine Learning

Linear Manifold Regularization for Large Scale Semi-Supervised Learning

Semi-Supervised Deep Learning Using Improved Unsupervised Discriminant Projection?

Hyper-Parameter Optimization for Manifold Regularization Learning

Zero Shot Learning Via Multi-Scale Manifold Regularization

Semi-Supervised Deep Metric Learning Networks for Classiﬁcation of Polarimetric SAR Data

Manifold Regularization and Semi-Supervised Learning: Some Theoretical Analyses

A Graph Based Approach to Semi-Supervised Learning

Cross-Modal and Multimodal Data Analysis Based on Functional Mapping of Spectral Descriptors and Manifold Regularization

Classification by Discriminative Regularization

Approximate Manifold Regularization: Scalable Algorithm and Generalization Analysis

A Geometric Viewpoint of Manifold Learning